Earlier I wrote about finding the most common Firefox issues. I had
wanted to automate that process and continually find these issues.
Unfortunately I never had time to do this.
When they announced Firefox Input, I thought about doing this again…
just with Firefox Input data but then I went on paternity leave and time kind
of crept away. But I mentioned the idea this week and it piqued some interest.
So I found myself with a bit of time to work on it. The first stage was
releasing a python library called textcluster.
textcluster takes the work I did earlier and makes it a bit more
general purpose. The idea is I can do something like this:
Which results in:
[
(
"Rats don't sleep.",
{'Cats eat rats.': 0.21353467285253394}
),
(
'Every good girl does well.',
{'Every good boy does fine.': 0.32030200927880093}
)
]
The number is the “similarity” between the strings relative to the entire
document corpus.
My next trick is to see if I can run this memory-intensive calculation over a
data-set of 25,000 opinions submitted. If I can we can get some interesting
data about what people think of the new Firefox beta.