<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Dave Dash</title>
 <link href="http://davedash.com/tag/clustering/atom.xml" rel="self"/>
 <link href="http://davedash.com/tag/clustering"/>
 <updated>2012-01-17T21:54:19-08:00</updated>
 <id>http://davedash.com/</id>
 <author>
   <name>Dave Dash</name>
   <email>dd+atom1@davedash.com</email>
 </author>

 
 <entry>
   <title>The Python textcluster Package</title>
   <link href="http://davedash.com/2010/07/08/the-python-textcluster-package/"/>
   <updated>2010-07-08T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/07/08/the-python-textcluster-package</id>
   <content type="html">&lt;p&gt;Earlier I wrote about &lt;a href=&quot;http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues/&quot;&gt;finding the most common Firefox issues&lt;/a&gt;.  I had
wanted to automate that process and continually find these issues.
Unfortunately I never had time to do this.&lt;/p&gt;

&lt;p&gt;When they announced &lt;a href=&quot;http://aakash.doesthings.com/2010/06/25/hi-my-name-is-firefox-input/&quot;&gt;Firefox Input&lt;/a&gt;, I thought about doing this again...
just with Firefox Input data but then I went on paternity leave and time kind
of crept away.  But I mentioned the idea this week and it piqued some interest.&lt;/p&gt;

&lt;p&gt;So I found myself with a bit of time to work on it.  The first stage was
releasing a python library called &lt;a href=&quot;http://github.com/davedash/textcluster&quot;&gt;&lt;code&gt;textcluster&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://github.com/davedash/textcluster&quot;&gt;&lt;code&gt;textcluster&lt;/code&gt;&lt;/a&gt; takes the &lt;a href=&quot;http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues/&quot;&gt;work I did earlier&lt;/a&gt; and makes it a bit more
general purpose.  The idea is I can do something like this:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;docs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;#39;Every good boy does fine.&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;#39;Every good girl does well.&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;#39;Cats eat rats.&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;quot;Rats don&amp;#39;t sleep.&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Corpus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;docs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Which results in:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[
    (
        &quot;Rats don't sleep.&quot;,
        {'Cats eat rats.': 0.21353467285253394}
    ),
    (
        'Every good girl does well.',
        {'Every good boy does fine.': 0.32030200927880093}
    )
]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The number is the &quot;similarity&quot; between the strings relative to the entire
document corpus.&lt;/p&gt;

&lt;p&gt;My next trick is to see if I can run this memory-intensive calculation over a
data-set of 25,000 opinions submitted.  If I can we can get some interesting
data about what people think of the new &lt;a href=&quot;http://www.mozilla.com/en-US/firefox/all-beta.html&quot;&gt;Firefox beta&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Finding the most common Firefox issues</title>
   <link href="http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues/"/>
   <updated>2010-03-18T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues</id>
   <content type="html">&lt;p&gt;Cheng Wang of the Mozilla Support team, a few months back, decided to present on some design ideas for &lt;a href=&quot;http://support.mozilla.com/en-US/kb/&quot;&gt;Firefox Support&lt;/a&gt;.  One of the issues he noted was that there are a lot of repeated issues and that it would be useful to group them.  Grouping them lets you see how often something occurs, and secondly let's you see how urgent it might be.&lt;/p&gt;

&lt;p&gt;Luckily grouping and clustering text is something computers can do.  So I wrote &lt;a href=&quot;http://github.com/davedash/SUMO-issues&quot;&gt;this utility&lt;/a&gt; that does just that.&lt;/p&gt;

&lt;p&gt;I ran this script over a sampling of data from the last week:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firefox won't start after update. (65 related issues)

&lt;ul&gt;
&lt;li&gt;5.6:  Firefox updated, Gmail not delivering mails&lt;/li&gt;
&lt;li&gt;5.6:  How to change My Profile when Firefox won't load?&lt;/li&gt;
&lt;li&gt;7.5:  Once I close firefox, cannot start firefox again except system restart&lt;/li&gt;
&lt;li&gt;5.6:  When intalling updates Firefox uninstalls itself&lt;/li&gt;
&lt;li&gt;16.8:  firefox won't start after update 3.6&lt;/li&gt;
&lt;li&gt;11.2:  Upgraded to Firefox 3.6 and now it won't start&lt;/li&gt;
&lt;li&gt;14.9:  Firefox won't start with most extensions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;How do I add a bookmark to more than one folder? (64 related issues)

&lt;ul&gt;
&lt;li&gt;8.9:  How do I get my bookmarks on the bookmarks toolbar to show up as an icon only with no text?&lt;/li&gt;
&lt;li&gt;7.5:  Bookmarks lost after upgrade and cannot save new bookmarks&lt;/li&gt;
&lt;li&gt;7.5:  why do i have to add the .com now to addy's?&lt;/li&gt;
&lt;li&gt;8.7:  When I open sidebar to edit bookmarks, I only see the folder for Bookmarks Toolbar. I do not see a folder just called Bookmarks nor do I see my list of bookmarks, that separately appear under bookmarks menu at top of screen&lt;/li&gt;
&lt;li&gt;7.5:  All my impoted bookmarks go to the same webpage&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;How do I remove the \ask toolbar\&quot;?&quot; (50 related issues)

&lt;ul&gt;
&lt;li&gt;14.9:  How do I remove an unwanted toolbar?&lt;/li&gt;
&lt;li&gt;5.6:  how to remove temporary video files from computer&lt;/li&gt;
&lt;li&gt;7.5:  I have no Toolbars or searchbar and i cant bring them back&lt;/li&gt;
&lt;li&gt;7.5:  nowhere says how to REMOVE a toolbar - only how to add or modify one&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;not able to open youtube videos (45 related issues)

&lt;ul&gt;
&lt;li&gt;5.6:  Cannot open bookmark/history sidebar&lt;/li&gt;
&lt;li&gt;5.6:  After working well for years Firefox will now not open&lt;/li&gt;
&lt;li&gt;6.7:  opening bookmarks do not open in new tab&lt;/li&gt;
&lt;li&gt;5.6:  I can't watch videos on youtube with firefox, but on internet explorer i can&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;I cannot download Firefox 3.6.  I've tried erasing the download file.  I cannot get beyond logging out of Firefox. (44 related issues)

&lt;ul&gt;
&lt;li&gt;8.4:  when downloading files firefox download manager will freeze and i will have to start over the file download&lt;/li&gt;
&lt;li&gt;5.6:  Firefox will not let me download anything! Can someone help?&lt;/li&gt;
&lt;li&gt;6.3:  cannot download epixHD.com: not compatible with firefox 3.6&lt;/li&gt;
&lt;li&gt;5.0:  Several tabs are coming up when i try to downloads things&lt;/li&gt;
&lt;li&gt;5.0:  Firefox wont open since I downloaded the 3.6 update.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The number on the right of the related issue is a score of how strongly it relates to the main issue.&lt;/p&gt;

&lt;p&gt;The full sample is 352 clusters from an original 3000+ issues.  That's a lot less stuff to go through.  We can tune this to have either less clusters, and more related issues in a cluster, or we can make more clusters of issues and that might result in more accuracy.&lt;/p&gt;

&lt;p&gt;Despite the inaccuracy of clustering we can make some general observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firefox not starting is a big issue.&lt;/li&gt;
&lt;li&gt;Bookmarks are either confusing or broken.&lt;/li&gt;
&lt;li&gt;People don't like toolbars&lt;/li&gt;
&lt;li&gt;Opening things is hard&lt;/li&gt;
&lt;li&gt;Downloading things or Firefox is hard&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Hopefully we can fine tune these reports and have them run regularly... maybe automatically posting to Tumblr?&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Delicious keeps you in the know</title>
   <link href="http://davedash.com/2009/08/04/delicious-keeps-you-in-the-know/"/>
   <updated>2009-08-04T00:00:00-07:00</updated>
   <id>http://davedash.com/2009/08/04/delicious-keeps-you-in-the-know</id>
   <content type="html">&lt;p&gt;My last task at Delicious was to build along with the amazing &lt;a href=&quot;http://zooie.wordpress.com/&quot;&gt;Vik Singh&lt;/a&gt; was to build a new feed of bookmarks that was heavily influenced by Twitter.  It was one of the most interesting and enjoyable pieces of code that I worked on at Delicious.&lt;/p&gt;

&lt;p&gt;Over two months since my final check-in, the code is &lt;a href=&quot;http://delicious.com/&quot;&gt;now in production&lt;/a&gt;.  It is mostly as intended, but is lacking an RSS or JSON feed (which I had already built).  This is somewhat disappointing since I was hoping that Delicious would remain as open as it had been in the past.&lt;/p&gt;

&lt;p&gt;The algorithm is fairly simple we take a look at what trending topics exist at any moment in time (via Google Trends and Twitter) and we combine it with a list of popular terms.  We take the whole lot of these items and   query search twitter and store an in-memory data table of tweets.  We also take a snapshot of new URLs to the Delicious corpus (basically anything on &lt;a href=&quot;http://delicious.com/recent/&quot;&gt;Delicious recent&lt;/a&gt; with 1 save).  We cluster the Delicious URLs and then find tweets that are similar to each of these clusters.&lt;/p&gt;

&lt;p&gt;The code for this is similar to &lt;a href=&quot;http://zooie.wordpress.com/2009/01/15/twitter-boss-real-time-search/&quot;&gt;Vik's TweetNews&lt;/a&gt; - but I think the Delicious data is a nicer fit.&lt;/p&gt;

&lt;p&gt;We optimized this quite a bit and built a very fast inverted-index and tweaked the code to run in about a minute.  Like TweetNews the heart of this was built using Python.  Python while being a dynamic language is quite amazing for manipulating and iteratiting over sets of data.&lt;/p&gt;

&lt;p&gt;While building this tool, it became my way to feel pulse of what's going on.  I could ditch a lot of my RSS feeds and rely solely on Delicious to be on the up and up.  Unfortunately I can't subscribe to a feed for this.  Either delicious has made a mistake and didn't launch their feeds at the same time as their web (entirely possible, since Delicious hasn't been updated for most of 2009) or they are deliberately taking a step backwards.&lt;/p&gt;

&lt;p&gt;This step backwards is weird from the usability issue.  Delicious has always been a tool that allowed for multiple types of consumers and a tool that appealed to developers thanks to its myriad of RSS and JSON feeds.    I'm glad I didn't have to be on the losing side of that decision.  Delicious relies heavily on Google Trends and Twitter Search.  While there is no requirement for them to share the data they are mashing up, it would be the right thing to do.&lt;/p&gt;

&lt;p&gt;Let me know what you think of the new feeds.  I wish I could share a github link or something snazzy so you could play around with it, but this post should be a good starting point for other real-time data mashups.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Update&lt;/strong&gt;: Read &lt;a href=&quot;http://blog.delicious.com/blog/2009/08/delicious-homepage-gets-%E2%80%9Cfresh%E2%80%9D.html&quot;&gt;Vik's account of this on the Delicious Blog&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 

</feed>

