<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Dave Dash</title>
 <link href="http://davedash.com/tag/search/atom.xml" rel="self"/>
 <link href="http://davedash.com/tag/search"/>
 <updated>2012-04-07T22:42:44-07:00</updated>
 <id>http://davedash.com/</id>
 <author>
   <name>Dave Dash</name>
   <email>dd+atom1@davedash.com</email>
 </author>

 
 <entry>
   <title>Bulk load ElasticSearch using pyes</title>
   <link href="http://davedash.com/2011/02/25/bulk-load-elasticsearch-using-pyes/"/>
   <updated>2011-02-25T00:00:00-08:00</updated>
   <id>http://davedash.com/2011/02/25/bulk-load-elasticsearch-using-pyes</id>
   <content type="html">&lt;p&gt;When indexing a lot of data, you can save time by bulk loading data.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;pyes&lt;/code&gt; you can do the following:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pyes&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ES&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;This will make 4 independent network calls.&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pyes&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ES&lt;/span&gt;


&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bulk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bulk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bulk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-index&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&amp;#39;my-type&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bulk&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;es&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;refresh&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Will do this in one call.  This is handy for those &quot;reindex all the items we
can&quot; weekends.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Installing ElasticSearch plugins</title>
   <link href="http://davedash.com/2011/02/24/installing-elasticsearch-plugins/"/>
   <updated>2011-02-24T00:00:00-08:00</updated>
   <id>http://davedash.com/2011/02/24/installing-elasticsearch-plugins</id>
   <content type="html">&lt;p&gt;I'm slowly trying to familiarize myself with ElasticSearch and the &lt;code&gt;pyes&lt;/code&gt;
python interface.  ElasticSearch uses a lot of plugins, and while the plugin
system is easy to use, it's not obvious where to find the plugins.&lt;/p&gt;

&lt;p&gt;They are &lt;a href=&quot;http://elasticsearch.googlecode.com/svn/plugins/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;If you want to install the attachments plugin, you can do:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;bin/plugin install mapper-attachments
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;And voilà it's installed.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Counting Sphinx groupBy Queries</title>
   <link href="http://davedash.com/2010/10/15/counting-sphinx-groupby-queries/"/>
   <updated>2010-10-15T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/10/15/counting-sphinx-groupby-queries</id>
   <content type="html">&lt;p&gt;I quickly implemented Sphinx on Input, while revisiting it, I saw that we try
to answer this type of question:&lt;/p&gt;

&lt;blockquote&gt;&lt;p&gt;Of the results displayed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How many are happy and how many are sad?&lt;/li&gt;
&lt;li&gt;How many are for Windows, Linux or Mac?&lt;/li&gt;
&lt;li&gt;How many are for English, French or Japanese&lt;/li&gt;
&lt;/ul&gt;
&lt;/blockquote&gt;

&lt;p&gt;Finding these involve using faceted search.  Unfortunately this is a bit
awkward to do using Sphinx.  For the first example, happy or sad you would have
to run the query like such:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Take the query, remove any filters on &lt;em&gt;happiness&lt;/em&gt; and do a group by on
happy opinions&lt;/li&gt;
&lt;li&gt;Restore any filters on happiness and run the query as normal.&lt;/li&gt;
&lt;li&gt;Return both the results, and the aggregate data from step 1.&lt;/li&gt;
&lt;/ol&gt;


&lt;p&gt;Doing the group by is easy, but you only get to know how many feelings there
are and what they were.  In our case: happy and sad.  What we really want is
how many of our original search were happy and how many were sad?&lt;/p&gt;

&lt;p&gt;I assumed something like this would work:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sphinx.SetSelect('feeling, @count')
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;code&gt;@count&lt;/code&gt; is one of those magic variables that Sphinx uses.  Unfortunately this
doesn't work.  &lt;code&gt;COUNT(*)&lt;/code&gt; doesn't work either.  Here's what did:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sphinx.SetSelect('feeling, SUM(1) AS count')
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Not the straight forward mysqlish syntax I've come to expect from Sphinx, but
it works.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>The Python textcluster Package</title>
   <link href="http://davedash.com/2010/07/08/the-python-textcluster-package/"/>
   <updated>2010-07-08T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/07/08/the-python-textcluster-package</id>
   <content type="html">&lt;p&gt;Earlier I wrote about &lt;a href=&quot;http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues/&quot;&gt;finding the most common Firefox issues&lt;/a&gt;.  I had
wanted to automate that process and continually find these issues.
Unfortunately I never had time to do this.&lt;/p&gt;

&lt;p&gt;When they announced &lt;a href=&quot;http://aakash.doesthings.com/2010/06/25/hi-my-name-is-firefox-input/&quot;&gt;Firefox Input&lt;/a&gt;, I thought about doing this again...
just with Firefox Input data but then I went on paternity leave and time kind
of crept away.  But I mentioned the idea this week and it piqued some interest.&lt;/p&gt;

&lt;p&gt;So I found myself with a bit of time to work on it.  The first stage was
releasing a python library called &lt;a href=&quot;http://github.com/davedash/textcluster&quot;&gt;&lt;code&gt;textcluster&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://github.com/davedash/textcluster&quot;&gt;&lt;code&gt;textcluster&lt;/code&gt;&lt;/a&gt; takes the &lt;a href=&quot;http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues/&quot;&gt;work I did earlier&lt;/a&gt; and makes it a bit more
general purpose.  The idea is I can do something like this:&lt;/p&gt;

&lt;div class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;docs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;#39;Every good boy does fine.&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;#39;Every good girl does well.&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;#39;Cats eat rats.&amp;#39;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&amp;quot;Rats don&amp;#39;t sleep.&amp;quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Corpus&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;docs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;doc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cluster&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;


&lt;p&gt;Which results in:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;[
    (
        &quot;Rats don't sleep.&quot;,
        {'Cats eat rats.': 0.21353467285253394}
    ),
    (
        'Every good girl does well.',
        {'Every good boy does fine.': 0.32030200927880093}
    )
]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The number is the &quot;similarity&quot; between the strings relative to the entire
document corpus.&lt;/p&gt;

&lt;p&gt;My next trick is to see if I can run this memory-intensive calculation over a
data-set of 25,000 opinions submitted.  If I can we can get some interesting
data about what people think of the new &lt;a href=&quot;http://www.mozilla.com/en-US/firefox/all-beta.html&quot;&gt;Firefox beta&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Firefox Input, powered by Sphinx</title>
   <link href="http://davedash.com/2010/07/06/firefox-input%2C-powered-by-sphinx/"/>
   <updated>2010-07-06T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/07/06/firefox-input,-powered-by-sphinx</id>
   <content type="html">&lt;p&gt;Thursday, I decided to take a half-day for my sanity, but saw an email about
how Whoosh wasn't going to cut it for &lt;a href=&quot;http://aakash.doesthings.com/2010/06/25/hi-my-name-is-firefox-input/&quot;&gt;Firefox Input&lt;/a&gt;.  I was CC'd about
this and there was mention that Sphinx might be possible.&lt;/p&gt;

&lt;p&gt;Sphinx is my hammer, and everything is a nail.  So I said, let's do this.
That translated into me spending my weekend, soothing &lt;a href=&quot;/tag/baby&quot;&gt;my newborn&lt;/a&gt; and
working on Sphinx.  Luckily this was easy, since &lt;a href=&quot;https://addons.mozilla.org/en-US/firefox/&quot;&gt;AMO&lt;/a&gt; and &lt;a href=&quot;http://support.mozilla.com/en-US/kb/&quot;&gt;SUMO&lt;/a&gt;
are both running Sphinx in a similar &lt;a href=&quot;http://fredericiana.com/2010/06/23/under-the-hood-of-firefox-input/&quot;&gt;Django environment&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In order to move quickly, I copied code from the &lt;a href=&quot;http://github.com/jbalogh/zamboni/&quot;&gt;Zamboni&lt;/a&gt; project to
&lt;a href=&quot;http://github.com/fwenzel/reporter&quot;&gt;Firefox Input&lt;/a&gt;.  Even our deployment into staging and production wasn't
done by our usual &quot;Sphinx guy&quot; in IT.  Ultimately, everything landed in place.&lt;/p&gt;

&lt;p&gt;So &lt;a href=&quot;http://input.mozilla.com/&quot;&gt;try it out&lt;/a&gt; and file bugs or let me know if searches don't go as
planned.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Alphabetical sorting in Sphinx</title>
   <link href="http://davedash.com/2010/04/21/alphabetical-sorting-in-sphinx/"/>
   <updated>2010-04-21T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/04/21/alphabetical-sorting-in-sphinx</id>
   <content type="html">&lt;p&gt;Sphinx 0.9.9 is great at searching full text, but treating actual strings as attributes takes some work.&lt;/p&gt;

&lt;p&gt;Initially I employed the strategy of indexing my full text fields &lt;em&gt;and&lt;/em&gt; storing them as attributes.  E.g.:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sql_query = SELECT name, name AS name_ord FROM documents
sql_attr_str2ordinal = name_ord
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;This stores each attribute in lexical order.  Meaning if your name's are Apple, Aardvark, Button, Choco-room they would be given the ordinal 2, 1, 3, 4 respectively.&lt;/p&gt;

&lt;p&gt;However, this is case-insensitive.  So trying this approach:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;sql_query = SELECT name, UPPER(name) AS name_ord FROM documents
sql_attr_str2ordinal = name_ord
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Will allow for case-insensitive alphabetical sorting in Sphinx.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Finding the most common Firefox issues</title>
   <link href="http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues/"/>
   <updated>2010-03-18T00:00:00-07:00</updated>
   <id>http://davedash.com/2010/03/18/finding-the-most-common-firefox-issues</id>
   <content type="html">&lt;p&gt;Cheng Wang of the Mozilla Support team, a few months back, decided to present on some design ideas for &lt;a href=&quot;http://support.mozilla.com/en-US/kb/&quot;&gt;Firefox Support&lt;/a&gt;.  One of the issues he noted was that there are a lot of repeated issues and that it would be useful to group them.  Grouping them lets you see how often something occurs, and secondly let's you see how urgent it might be.&lt;/p&gt;

&lt;p&gt;Luckily grouping and clustering text is something computers can do.  So I wrote &lt;a href=&quot;http://github.com/davedash/SUMO-issues&quot;&gt;this utility&lt;/a&gt; that does just that.&lt;/p&gt;

&lt;p&gt;I ran this script over a sampling of data from the last week:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firefox won't start after update. (65 related issues)

&lt;ul&gt;
&lt;li&gt;5.6:  Firefox updated, Gmail not delivering mails&lt;/li&gt;
&lt;li&gt;5.6:  How to change My Profile when Firefox won't load?&lt;/li&gt;
&lt;li&gt;7.5:  Once I close firefox, cannot start firefox again except system restart&lt;/li&gt;
&lt;li&gt;5.6:  When intalling updates Firefox uninstalls itself&lt;/li&gt;
&lt;li&gt;16.8:  firefox won't start after update 3.6&lt;/li&gt;
&lt;li&gt;11.2:  Upgraded to Firefox 3.6 and now it won't start&lt;/li&gt;
&lt;li&gt;14.9:  Firefox won't start with most extensions&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;How do I add a bookmark to more than one folder? (64 related issues)

&lt;ul&gt;
&lt;li&gt;8.9:  How do I get my bookmarks on the bookmarks toolbar to show up as an icon only with no text?&lt;/li&gt;
&lt;li&gt;7.5:  Bookmarks lost after upgrade and cannot save new bookmarks&lt;/li&gt;
&lt;li&gt;7.5:  why do i have to add the .com now to addy's?&lt;/li&gt;
&lt;li&gt;8.7:  When I open sidebar to edit bookmarks, I only see the folder for Bookmarks Toolbar. I do not see a folder just called Bookmarks nor do I see my list of bookmarks, that separately appear under bookmarks menu at top of screen&lt;/li&gt;
&lt;li&gt;7.5:  All my impoted bookmarks go to the same webpage&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;How do I remove the \ask toolbar\&quot;?&quot; (50 related issues)

&lt;ul&gt;
&lt;li&gt;14.9:  How do I remove an unwanted toolbar?&lt;/li&gt;
&lt;li&gt;5.6:  how to remove temporary video files from computer&lt;/li&gt;
&lt;li&gt;7.5:  I have no Toolbars or searchbar and i cant bring them back&lt;/li&gt;
&lt;li&gt;7.5:  nowhere says how to REMOVE a toolbar - only how to add or modify one&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;not able to open youtube videos (45 related issues)

&lt;ul&gt;
&lt;li&gt;5.6:  Cannot open bookmark/history sidebar&lt;/li&gt;
&lt;li&gt;5.6:  After working well for years Firefox will now not open&lt;/li&gt;
&lt;li&gt;6.7:  opening bookmarks do not open in new tab&lt;/li&gt;
&lt;li&gt;5.6:  I can't watch videos on youtube with firefox, but on internet explorer i can&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;I cannot download Firefox 3.6.  I've tried erasing the download file.  I cannot get beyond logging out of Firefox. (44 related issues)

&lt;ul&gt;
&lt;li&gt;8.4:  when downloading files firefox download manager will freeze and i will have to start over the file download&lt;/li&gt;
&lt;li&gt;5.6:  Firefox will not let me download anything! Can someone help?&lt;/li&gt;
&lt;li&gt;6.3:  cannot download epixHD.com: not compatible with firefox 3.6&lt;/li&gt;
&lt;li&gt;5.0:  Several tabs are coming up when i try to downloads things&lt;/li&gt;
&lt;li&gt;5.0:  Firefox wont open since I downloaded the 3.6 update.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;The number on the right of the related issue is a score of how strongly it relates to the main issue.&lt;/p&gt;

&lt;p&gt;The full sample is 352 clusters from an original 3000+ issues.  That's a lot less stuff to go through.  We can tune this to have either less clusters, and more related issues in a cluster, or we can make more clusters of issues and that might result in more accuracy.&lt;/p&gt;

&lt;p&gt;Despite the inaccuracy of clustering we can make some general observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Firefox not starting is a big issue.&lt;/li&gt;
&lt;li&gt;Bookmarks are either confusing or broken.&lt;/li&gt;
&lt;li&gt;People don't like toolbars&lt;/li&gt;
&lt;li&gt;Opening things is hard&lt;/li&gt;
&lt;li&gt;Downloading things or Firefox is hard&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Hopefully we can fine tune these reports and have them run regularly... maybe automatically posting to Tumblr?&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>AMO Search: Powered by Sphinx</title>
   <link href="http://davedash.com/2009/09/30/amo-search-powered-by-sphinx/"/>
   <updated>2009-09-30T00:00:00-07:00</updated>
   <id>http://davedash.com/2009/09/30/amo-search-powered-by-sphinx</id>
   <content type="html">&lt;p&gt;Last night, I gave a talk at the &lt;a href=&quot;https://wiki.mozilla.org/AddonMeetups:2009:Chicago&quot;&gt;Addons Meetup&lt;/a&gt; at Threadless HQ in Chicago on the new search engine powering &lt;a href=&quot;http://addons.mozilla.org/&quot;&gt;addons.mozilla.org&lt;/a&gt;.  I'll recap the technical portion of the talk and give a bit more details.&lt;/p&gt;

&lt;p&gt;First, I'd like to thank Harper and Threadless.  It was a great location in the greatest city in the universe.  Before and after the meetup, Harper was just an all-around great guy to hang with and the threadless headquarters was a nice hangout place for meeting people interested in addons.&lt;/p&gt;

&lt;p&gt;Shortly after my talk, our Engineering Ops team deployed the new AMO 5.1 complete with a new Sphinx powered search engine.&lt;/p&gt;

&lt;p&gt;So let's talk about search.  Note: parts of this are a rehash of my talk, so feel free to skip around.&lt;/p&gt;

&lt;!--more--&gt;


&lt;h3&gt;A bit about addons&lt;/h3&gt;

&lt;p&gt;Addons is a huge growing space.  Arguably it's Mozilla's best kept secret.  Sure readers of this blog probably know what Addons are, but ask people who aren't as web-savvy.  Most people don't know what a browser is - and it's hard to explain it to people without getting technical.&lt;/p&gt;

&lt;p&gt;We can just skip that step.  Because Addons are small things that people can easily &quot;get&quot;.&lt;/p&gt;

&lt;p&gt;&quot;It's an easy way to customize the internet when your surfing.&quot;&lt;/p&gt;

&lt;p&gt;While perhaps not technically correct, its one way of explaining it to people.  Maybe a better way is just showing people what they can do with addons.&lt;/p&gt;

&lt;p&gt;On my flight out to Chicago, I talked to a person on the plane who didn't know what a browser was, but after showing her &lt;a href=&quot;http://addons.mozilla.org/&quot;&gt;AMO&lt;/a&gt; she was really intrigued.&lt;/p&gt;

&lt;p&gt;If everyday non-technical people can realize the potential of addons, it's only a matter of time before they start knocking down the doors to AMO.&lt;/p&gt;

&lt;p&gt;So we better be prepared to handle them, and get them what they want.&lt;/p&gt;

&lt;h3&gt;The technical details of addons.mozilla.org&lt;/h3&gt;

&lt;p&gt;Everytime you open Firefox, it pings &lt;a href=&quot;http://addons.mozilla.org/&quot;&gt;AMO&lt;/a&gt; to see if there's any updates to any of the addons that happen to be installed.  Over a third of the people using Firefox have at least one addon, and Firefox is roughly 22% of the browser market.  That means roughly 7% of people opening their browsers are pinging our servers for updates.&lt;/p&gt;

&lt;p&gt;Needless to say it's a lot of traffic, and to support it we need a fair amount of hardware.  AMO is clearly the largest site in the Mozilla universe in both respects.&lt;/p&gt;

&lt;p&gt;Some stats:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 mySQL master&lt;/li&gt;
&lt;li&gt;4 mySQL slaves&lt;/li&gt;
&lt;li&gt;2 memached servers&lt;/li&gt;
&lt;li&gt;2 Sphinx indexer/search daemons&lt;/li&gt;
&lt;li&gt;24 Web Frontend&lt;/li&gt;
&lt;li&gt;Multiple Zeus ZXTM clusters all&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Most of this is standard, we'll talk about Sphinx later, but Zeus is amazing.  I didn't know what Zeus was until earlier this year when I interviewed with Mozilla's VP of Engineering Operations.  All our requests get cached so much of our hits actually hit our Zeus cluster and not our web servers.&lt;/p&gt;

&lt;p&gt;To see just how amazing they are read our &lt;a href=&quot;http://blog.mozilla.com/mrz/&quot;&gt;mrz's ops blog&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;Why search matters&lt;/h3&gt;

&lt;p&gt;If you have any kind of custom content and unique meta data a custom search solution is a must.  Browsing through a site isn't going to cut it.  Browsing is dead.  Search is how you find things on a web site.  On &lt;a href=&quot;http://addons.mozilla.org/&quot;&gt;AMO&lt;/a&gt; you may see an addon that's featured somewhere, or you might want to see what's out there, but the right search query will find you the right addon in two clicks.&lt;/p&gt;

&lt;h3&gt;Improve Search&lt;/h3&gt;

&lt;p&gt;So my first job on AMO was to &lt;a href=&quot;https://bugzilla.mozilla.org/show_bug.cgi?id=498999&quot;&gt;improve addons search&lt;/a&gt;.  It was a vague request and born out of frustration with what we had.  It wasn't a problem that certain things were indexed, or unicode didn't work, or results weren't sorted.  We may have had all those problems, but as a product search needed to be replaced.&lt;/p&gt;

&lt;p&gt;To me it meant that we needed some framework that would allow developers to quickly debug and fix any future search calamities at a moments notice.&lt;/p&gt;

&lt;p&gt;So here were the goals I made for myself:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Do something that sucks less than what we’ve got&lt;/li&gt;
&lt;li&gt;Do something that makes it easier to suck less in the future&lt;/li&gt;
&lt;li&gt;Do something that’s easy to use for our operations team, web developers and most importantly, end-users&lt;/li&gt;
&lt;li&gt;Reduce strain on our databases, developers and operations teams&lt;/li&gt;
&lt;/ul&gt;


&lt;h3&gt;Complex Data&lt;/h3&gt;

&lt;p&gt;Our data set is small (we have 5,000 addons), but there's a lot of secondary meta data about the addons that we track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Addons work in 1 or more locales (e.g. en-US, fr, de, etc)&lt;/li&gt;
&lt;li&gt;Addons are optionally platform specific (Linux, OS X, etc)&lt;/li&gt;
&lt;li&gt;Addons work with one or more products (Firefox, Thunderbird, Seamonkey, Sunbird or Fennec)&lt;/li&gt;
&lt;li&gt;Addons come in multiple flavors (extensions, themes, dictionaries and more)&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;We want to index all this data.  Unfortunately to get at much of this data it involves either numerous queries, or numerous joins which put a strain on mysql.  How much strain?&lt;/p&gt;

&lt;p&gt;At peak we get about 10 search queries per second.  If we do something smarter this won't have to cause a lot of strain.&lt;/p&gt;

&lt;h3&gt;Using Sphinx&lt;/h3&gt;

&lt;p&gt;Sphinx is an open source search indexer and daemon.  It's used by Craigslist, the Pirate Bay and &lt;a href=&quot;http://support.mozilla.com&quot;&gt;Mozilla Support&lt;/a&gt;.  It was very easy to use and despite a complicated set of data and business logic, Sphinx was up to the task.&lt;/p&gt;

&lt;h3&gt;The challenges&lt;/h3&gt;

&lt;p&gt;We needed to search for addons in several languages.  So indexing just addons wouldn't work, we need to make sure we have every translation of every addon indexed.  For those counting, we have 5,000 addons, but 18,000 translations of addons.&lt;/p&gt;

&lt;p&gt;All the joining and filtering that needed to be done for our old search still needs to be done, but we can do this all in one shot by using a mysql view.  This view is a flat list of each translated addon as well as all meta data associated with it.  This then gets fed into the sphinx indexer.&lt;/p&gt;

&lt;p&gt;Along the way we ran into some issues which used to be dealt with outside of mysql, such as comparing versions.  It was gross and quite a hack, so we turned the variety of &lt;a href=&quot;http://spindrop.us/2009/08/07/v-is-for-version-hell/&quot;&gt;acceptable version strings into integers&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We also learned that stemming wasn't a good idea as we assumed it would be.  Stemming was great for searching through lots of text, but a great deal of addon searches were really just searches for product names, so we opted for substring searches.  We'll see how that fares.  There is probably room for improvement.&lt;/p&gt;

&lt;p&gt;Much of this, however involved knowing our data, and knowing how it will be used by our users.  Once we got that down, we could hammer it all out using Sphinx.&lt;/p&gt;

&lt;h3&gt;Wins&lt;/h3&gt;

&lt;p&gt;So Sphinx gains us a bit architecturally.  We have a complicated query, but it only gets run once every 5 minutes versus the 180,000 times it was run &quot;on demand.&quot;&lt;/p&gt;

&lt;p&gt;Indexing happens rather quickly, just over a minute.&lt;/p&gt;

&lt;p&gt;The API was a breeze to work with, and was easy to drop into our own codebase.&lt;/p&gt;

&lt;p&gt;Because of our relatively small data set, and quick indexing, we're able to scale this simply by cloning and load balancing.  Meaning, we just need to scale for traffic, but addon growth (which is slower than traffic growth) we can safely not worry about for a while.&lt;/p&gt;

&lt;p&gt;Our ops team can monitor the sphinx clusters and just deploy additional nodes as needed.&lt;/p&gt;

&lt;h3&gt;Building a platform&lt;/h3&gt;

&lt;p&gt;What we've done is built a foundation for search.  Not all the problems are gone, but a lot of the problems that our QA team finds are able to be resolved quickly.  We have a nice pile of unit tests as well that help us keep our results in check when we start tweaking dials.&lt;/p&gt;

&lt;p&gt;We even have the groundwork for some nifty advanced search syntax, that hopefully we can inject into future releases of AMO.&lt;/p&gt;

&lt;p&gt;Enjoy.  And if you find anything, &lt;a href=&quot;http://bit.ly/search-bugs&quot;&gt;let me know&lt;/a&gt;.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>V is for Version Hell</title>
   <link href="http://davedash.com/2009/08/07/v-is-for-version-hell/"/>
   <updated>2009-08-07T00:00:00-07:00</updated>
   <id>http://davedash.com/2009/08/07/v-is-for-version-hell</id>
   <content type="html">&lt;p&gt;Versioning is quite difficult to deal with.  Versions are nearly-numbers, but
you can't quite sort them using standard numerical algorithms.&lt;/p&gt;

&lt;p&gt;While the following is true:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;1.1 &amp;lt; 1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The following is also true:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;1.2 &amp;lt; 1.18 &amp;lt; 1.20
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;The &quot;.&quot; is not a decimal point but a separator.&lt;/p&gt;

&lt;p&gt;Mozilla uses a modestly complicated &lt;a href=&quot;https://developer.mozilla.org/en/Toolkit_version_format&quot;&gt;versioning system&lt;/a&gt; that involves stars,
plusses, and sometimes &quot;x&quot;.&lt;/p&gt;

&lt;p&gt;I found a very convoluted way to translate these versions into large integers.
The versions for applications in the AMO database have four parts at most, they
are potentially alpha or beta and potentially a pre-release.  In some cases we
have multiple versions represented with &lt;code&gt;.*&lt;/code&gt;, &lt;code&gt;.x&lt;/code&gt; or &lt;code&gt;+&lt;/code&gt; at the end.&lt;/p&gt;

&lt;!--more--&gt;


&lt;p&gt;The &lt;a href=&quot;https://developer.mozilla.org/en/Toolkit_version_format&quot;&gt;Toolkit docs&lt;/a&gt; let us translate &quot;+&quot; to mean &quot;pre-release of the next
version&quot;.  E.g. 1.0+ is 1.1pre0.  Since my primary purpose of all this is for
sorting, &lt;code&gt;.*&lt;/code&gt; and &lt;code&gt;.+&lt;/code&gt; may as well just be a very large &quot;version part.&quot;  Since
all the version parts I deal with are a maximum of 2-digits, I turned &lt;code&gt;.*&lt;/code&gt; and
&lt;code&gt;.+&lt;/code&gt; into &lt;code&gt;.99&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;3.5+ =&amp;gt; '03'+'05'+'99' =&amp;gt; 030599
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We also need to deal with versions that may be alpha, beta or not.  If
everything else is equal:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;3.5a &amp;lt; 3.5a5 &amp;lt; 3.5b &amp;lt; 3.5b2 &amp;lt; 3.5 &amp;lt; 3.5+
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We assign a single integer to represent a version's &quot;non-alphaness&quot;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;a =&amp;gt; 0
b =&amp;gt; 1
non alpha/beta =&amp;gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;We assume that &lt;code&gt;3.5a = 3.5a1&lt;/code&gt;.  Therefore:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;'3.5a =&amp;gt; 3.5.0a1 =&amp;gt; '03'+'05'+'00'+'0'+'01' =&amp;gt; 030500001
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Similarly if it's a pre-release we assign a 0 or 1 to represent
&quot;non-pre-releaseness&quot;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;'3.5a pre2 =&amp;gt; 3.5.0a1pre2
=&amp;gt; '03'+'05'+'00'+'0'+'01'+'0'+'02
=&amp;gt; 030500001002
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;So what does this get us?  Integers which we can use for comparison, sorting,
etc.  It's a one time calculation for each version and we can do some nice SQL
statements in AMO like:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;mysql&amp;gt; SELECT version,version_int FROM appversions WHERE application_id = 1 ORDER BY version_int LIMIT 15;
+---------+--------------+
| version | version_int  |
+---------+--------------+
| 0.3     |  30000200100 |
| 0.6     |  60000200100 |
| 0.7     |  70000200100 |
| 0.7+    |  80000200000 |
| 0.8     |  80000200100 |
| 0.8+    |  90000200000 |
| 0.9     |  90000200100 |
| 0.9.0+  |  90100200000 |
| 0.9.1+  |  90200200000 |
| 0.9.2+  |  90300200000 |
| 0.9.3   |  90300200100 |
| 0.9.3+  |  90400200000 |
| 0.9.x   |  99900200100 |
| 0.9+    | 100000200000 |
| 0.10    | 100000200100 |
+---------+--------------+
15 rows in set (0.00 sec)
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;I can now index these integers using Sphinx and do some very easy searches for
addons based on version number.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Question: Building a Better Search Engine</title>
   <link href="http://davedash.com/2009/06/18/question-building-a-better-search-engine/"/>
   <updated>2009-06-18T00:00:00-07:00</updated>
   <id>http://davedash.com/2009/06/18/question-building-a-better-search-engine</id>
   <content type="html">&lt;p&gt;So I finally have one of those jobs where I can tell people almost every little detail about what I'm doing and I'm encouraged to talk to people on the intar-webs and solicit opinions.&lt;/p&gt;

&lt;p&gt;Uh - this is more or less how I've operated at previous jobs, just now I can be overt about it.&lt;/p&gt;

&lt;p&gt;So my &lt;a href=&quot;https://bugzilla.mozilla.org/show_bug.cgi?id=498999&quot;&gt;new task&lt;/a&gt; is to work on improving the &lt;a href=&quot;http://addons.mozilla.org&quot;&gt;addons.mozilla.org&lt;/a&gt; search engine.  I've built various &quot;search engines&quot; over time in PHP, powered by Lucene and most recently in python using an inverted index.&lt;/p&gt;

&lt;p&gt;One tool that I've been looking at briefly is &lt;a href=&quot;http://sphinxsearch.com/&quot;&gt;Sphinx&lt;/a&gt;.  While my record count is low (5-10K), Sphinx basically bakes in a lot of the things I would want in a search engine.  Indexing, merging, etc.&lt;/p&gt;

&lt;p&gt;Since I'm fairly new to the add-ons team I'm still understanding the basics of what we need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fast automated indexing of addons for Firefox, Thunderbird and any other Mozilla product&lt;/li&gt;
&lt;li&gt;Quick result sets&lt;/li&gt;
&lt;li&gt;Easy deployability&lt;/li&gt;
&lt;li&gt;Extendible&lt;/li&gt;
&lt;li&gt;Customized ranking&lt;/li&gt;
&lt;li&gt;Filtering (e.g. by Firefox version, etc).&lt;/li&gt;
&lt;li&gt;Basics: Stemming and stop-words&lt;/li&gt;
&lt;/ul&gt;


&lt;p&gt;Whether it's Sphinx, Lucene or some home grown solution, I have all that to support.  But this should be fairly straight forward.  What are people's thoughts?&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Boosting terms in  Zend Search Lucene</title>
   <link href="http://davedash.com/2007/05/29/boosting-terms-in-zend-search-lucene/"/>
   <updated>2007-05-29T00:00:00-07:00</updated>
   <id>http://davedash.com/2007/05/29/boosting-terms-in-zend-search-lucene</id>
   <content type="html">&lt;p&gt;[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl[/tags]&lt;/p&gt;

&lt;h3&gt;Boosting terms &amp;mdash; some fields are better than others&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;Lucene&lt;/a&gt; supports boosting or weighting terms.  For example, if I search for members of a web site, and I type in &lt;q&gt;Dash&lt;/q&gt;, I want people with the name &lt;q&gt;Dash&lt;/q&gt; to take precendence over somebody who has a hobby of running the 50-yard Dash.&lt;/p&gt;

&lt;p&gt;If we look at our &lt;code&gt;generateZSLDocument()&lt;/code&gt; method we defined we just need to adjust a few lines:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;

        $doc-&gt;addField(Zend_Search_Lucene_Field::Text('firstname', $this-&gt;getFirstname()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Text('lastname', $this-&gt;getLastname()));
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;Should be turned into:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;

        $field = Zend_Search_Lucene_Field::Text('firstname', $this-&gt;getFirstname());
        $field-&gt;boost = 1.5;
        $doc-&gt;addField($field);
        $field = Zend_Search_Lucene_Field::Text('lastname', $this-&gt;getLastname());
        $field-&gt;boost = 1.5;
        $doc-&gt;addField($field);

&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;This is pretty straight forward way to add weight (1.5 times the weight of a normal term) and you can customize it to the needs of your site.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Finding things using Zend Search Lucene in symfony</title>
   <link href="http://davedash.com/2007/05/23/finding-things-using-zend-search-lucene-in-symfony/"/>
   <updated>2007-05-23T00:00:00-07:00</updated>
   <id>http://davedash.com/2007/05/23/finding-things-using-zend-search-lucene-in-symfony</id>
   <content type="html">&lt;p&gt;[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl[/tags]&lt;/p&gt;

&lt;p&gt;&lt;span class=&quot;notice&quot;&gt;This is part of an &lt;a href=&quot;http://spindrop.us/tag/zsl&quot;&gt;on going series&lt;/a&gt; about the Zend Search Lucene libraries and symfony.  We'll pretty everything up when we're done =)&lt;/span&gt;&lt;/p&gt;

&lt;p&gt;We now know how to &lt;a href=&quot;http://spindrop.us/2007/04/24/creating-updating-deleting-documents-in-a-lucene-index-with-symfony/&quot;&gt;manipulate the index via our model classes&lt;/a&gt;.  But let's actually do something useful with our search engine... let's search!&lt;/p&gt;

&lt;!--more--&gt;


&lt;p&gt;[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl[/tags]&lt;/p&gt;

&lt;p&gt;At the time of this writing we're dealing with Propel which uses &lt;code&gt;Peer&lt;/code&gt; classes which are meant for dealing with multiple objects&lt;sup id=&quot;#fnr_1&quot;&gt;&lt;a href=&quot;#fn_1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.  This is the perfect place for a &lt;code&gt;::search()&lt;/code&gt; method.  In otherwords, &lt;code&gt;UserPeer::search('dave');&lt;/code&gt; should query Lucene for users matching &quot;dave&quot;.  Let's make that happen:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;

    public static function search($query)
    {
        $index = self::getLuceneIndex();
        
        $hits = $index-&gt;find(strtolower($query));
        $pks = array();
    
        foreach($hits AS $hit)
        {
            $pks[] = $hit-&gt;user_id;
        }
        
        return self::retrieveByPks($pks);
    }

&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;What we're doing is retrieving our Lucene index.  Somewhere between tutorials we wrote this &lt;code&gt;Peer&lt;/code&gt; function to handle that:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public static function getLuceneIndex($autoIndex = true)
    {
        try 
        {
            return $index = Zend_Search_Lucene::open(sfConfig::get(self::$luceneIndex));
        } 
        catch (Exception $e) 
        {
            $index = $autoIndex ? self::reindex() : null;
            return $index;
        }
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;If our index is missing we'll conveniently create it on the fly.  We then use the Zend Search Lucene API to retrieve the matching hits in this index and then use some Propel trickery to retrieve by an array of primary keys.&lt;/p&gt;

&lt;p&gt;It's now simple to use &lt;code&gt;::search()&lt;/code&gt; functions in the same manner as you use &lt;code&gt;::doSelect()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;At this point you should be able to create a basic symfony app that can utilize a Lucene index.&lt;/p&gt;

&lt;div id=&quot;footnotes&quot;&gt;
    &lt;hr/&gt;
    &lt;ol&gt;
        &lt;li id=&quot;fn_1&quot;&gt;The examples refer to using Propel, but it's trivial to adapt this to sfDoctrine &lt;a href=&quot;#fnr_1&quot; class=&quot;footnoteBackLink&quot;  title=&quot;Jump back to footnote  in the text.&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/li&gt;
    &lt;/ol&gt;
&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>Creating, Updating, Deleting documents in a Lucene Index with symfony</title>
   <link href="http://davedash.com/2007/04/24/creating-updating-deleting-documents-in-a-lucene-index-with-symfony/"/>
   <updated>2007-04-24T00:00:00-07:00</updated>
   <id>http://davedash.com/2007/04/24/creating-updating-deleting-documents-in-a-lucene-index-with-symfony</id>
   <content type="html">&lt;p&gt;Previously we covered &lt;a href=&quot;http://spindrop.us/2007/04/23/the-lucene-search-index-and-symfony/&quot;&gt;an all-at-once approach&lt;/a&gt; to indexing objects in your symfony app.  But for some reason, people find the need to allow users to sign up, or change their email addresses and then all of a sudden our wonderful Lucene index is out of date.&lt;/p&gt;

&lt;p&gt;Here lies the strength of using &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;Zend Search Lucene&lt;/a&gt; in your app, you can now get the flexibility of interacting with a Lucene index, no matter how it was created and add, update and delete documents to it.&lt;/p&gt;

&lt;!--more--&gt;


&lt;p&gt;The last thing you want to do is have a cron job in charge of making sure your index is always up to date by reindexing regularly.  This is an inelegant and inefficient process.&lt;/p&gt;

&lt;p&gt;A smarter method would be to trigger an update of the index each time you update your database.  Luckily the &lt;acronym title=&quot;Object Relational Mapping&quot;&gt;ORM&lt;/acronym&gt; layer allows us to do this using objects (in our case Propel objects).&lt;/p&gt;

&lt;p&gt;If we look at our &lt;a href=&quot;http://spindrop.us/2007/04/23/the-lucene-search-index-and-symfony/&quot;&gt;user example from before&lt;/a&gt;, we did set ourselves up to easily do this using our &lt;code&gt;User::generateZSLDocument()&lt;/code&gt; function, which did most of the heavy lifting.&lt;/p&gt;

&lt;p&gt;We can make a few small changes to the &lt;code&gt;User&lt;/code&gt; class:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    var $reindex = false;
    public function setUsername ( $v )
    {
        parent::setUsername($v);
        $this-&gt;reindex = true;
    }
    public function setFirstname ( $v )
    {
        parent::setFirstname($v);
        $this-&gt;reindex = true;
    }
    public function setLastname ( $v )
    {
        parent::setLastname($v);
        $this-&gt;reindex = true;
    }
    public function setEmail ( $v )
    {
        parent::setEmail($v);
        $this-&gt;reindex = true;
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;We have an attribute called &lt;code&gt;$reindex&lt;/code&gt;.  When it is false we don't need to worry about the index.  When something significant changes, like an update to your name or email address, then we set &lt;code&gt;$reindex&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.  Then when we save with an overridden save method:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function save ($con = null)
    {
        parent::save($con);
      
        if ($this-&gt;reindex) 
        {
            $index = $this-&gt;removeFromIndex();
            $doc   = $this-&gt;generateZSLDocument();
            $index-&gt;addDocument($doc);
        }
    }

    public function removeFromIndex() 
    {
        $index = Zend_Search_Lucene::open(sfConfig::get('app_search_user_index'));  

        // remove old documents
        $term  = new Zend_Search_Lucene_Index_Term($this-&gt;getId(), 'userid');
        $query = new Zend_Search_Lucene_Search_Query_Term($term);
        $hits  = array();
        $hits  = $index-&gt;find($query);

        foreach ($hits AS $hit) 
        {  
            $index-&gt;delete($hit-&gt;id);  
        }

        return $index;      
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;Now we've got the &lt;em&gt;exact&lt;/em&gt; same data that we created during &lt;a href=&quot;http://spindrop.us/2007/04/23/the-lucene-search-index-and-symfony/&quot;&gt;our original indexing&lt;/a&gt;.  This handled creating and updating object, but we miss updating the index when deleting objects.&lt;/p&gt;

&lt;p&gt;Luckily we already made a function &lt;code&gt;User::removeFromIndex()&lt;/code&gt; to remove any related documents from the index, so our delete function can be pretty simple:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function delete($con = null)
    {
        parent::delete($con);
        $this-&gt;removeFromIndex();
    }
&lt;/textarea&gt;&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>The Lucene Search Index and symfony</title>
   <link href="http://davedash.com/2007/04/23/the-lucene-search-index-and-symfony/"/>
   <updated>2007-04-23T00:00:00-07:00</updated>
   <id>http://davedash.com/2007/04/23/the-lucene-search-index-and-symfony</id>
   <content type="html">&lt;p&gt;[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl, index[/tags]&lt;/p&gt;

&lt;p&gt;This article is meant to followup &lt;a href=&quot;http://spindrop.us/2007/04/10/sfzendplugin/&quot;&gt;sfZendPlugin&lt;/a&gt; where we learn a newer way of obtaining the &lt;a href=&quot;http://framework.zend.com/&quot;&gt;Zend Framework&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this tutorial we're going to delve into the Lucene index.  &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;Zend Search Lucene&lt;/a&gt; relies on building a Lucene index.  This is a directory that contains files that can be indexed and queried by Lucene or other ports.  In our example we'll be creating a search for user profiles.&lt;/p&gt;

&lt;!--more--&gt;


&lt;p&gt;We'll want to store in our &lt;code&gt;app.yml&lt;/code&gt; the precise location of this index file so we can refer to it in our app&lt;sup id=&quot;#fnr_lucene_index1&quot;&gt;&lt;a href=&quot;#fn_lucene_index1&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;p&gt;Here's an example:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;all:
  search:
    user_index: /tmp/myapp.user.lucene.index
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;Now when we need to refer to the index we can do &lt;code&gt;sfConfig::get('app_search_user_index')&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;Index Something&lt;/h3&gt;

&lt;p&gt;Let's try a user search where we can find a user by their name or email address.  It's fairly simple to accomplish, and hardly requires the use of &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;&lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt;&lt;/a&gt;, but by using &lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt; we can easily extend it to do a full-text search of a user's profile or any other textual data.&lt;/p&gt;

&lt;p&gt;Each &quot;thing&quot; stored in the index is a Lucene &quot;document&quot;.  Each document then consists of several &quot;fields&quot; (&lt;code&gt;Zend_Search_Lucene_Field&lt;/code&gt; objects).  In our example, each document will be an individual user and the fields will be relevant attributes of the user (username, first name, last name, email, the text of their profile).&lt;/p&gt;

&lt;p&gt;Initially we'll want to populate our index.  We may also want to regularly reindex all the users at once to optimize the search performance.  Since reindexing involves multiple users it would make sense to have a static &lt;code&gt;reindex&lt;/code&gt; method in our &lt;code&gt;UserPeer&lt;/code&gt; class&lt;sup id=&quot;#fnr_lucene_index2&quot;&gt;&lt;a href=&quot;#fn_fn_lucene_index2&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;.&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
class UserPeer extends BaseUserPeer
{
    public static function reindex()
    {
        $index = Zend_Search_Lucene::create(sfConfig::get('app_search_user_index'));

        $user = UserPeer::doSelect(new Criteria());
        foreach ($users AS $user)
        {
            $index-&gt;addDocument($user-&gt;generateZSLDocument());
        }

        return $index;
    }
}
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;Very simply, we're creating a new index, getting all the users, adding a document to the index and then committing the index (to disk).  You might have noticed that there's a strange function, &lt;code&gt;User::generateZSLDocument()&lt;/code&gt;.  This function contains all the magic.  In order to not repeat ourselves we keep the internals of making a document for the Lucene index in the &lt;code&gt;User&lt;/code&gt; class itself.  Let's look at it:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function generateZSLDocument()
    {
        $doc = new Zend_Search_Lucene_Document();
        $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('uid', $this-&gt;getId()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('username', $this-&gt;getUsername()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('email', $this-&gt;getEmail()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Text('firstname', $this-&gt;getFirstname()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Text('lastname', $this-&gt;getLastname()));
        /* An unstored contents field as an aggregate 
          * of all data is no longer needed in *ZEND* Lucene 
          * But it's here.
          */
        $doc-&gt;addField(Zend_Search_Lucene_Field::Unstored('contents', implode(' ', array($this-&gt;getEmail(), $this-&gt;getFirstname(), $this-&gt;getLastname(), $this-&gt;getUsername())));
        return $doc;
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;We're really just dumping the relevant search terms into this document.  The beauty of keeping this code internalized in the &lt;code&gt;User&lt;/code&gt; class is we can reuse it later if we need to index a single &lt;code&gt;User&lt;/code&gt; at a time.&lt;/p&gt;

&lt;p&gt;A couple things to note.  &lt;code&gt;Zend_Search_Lucene_Field::Keyword&lt;/code&gt; allows us to store data that we can lookup later.  We store the &lt;code&gt;User::id&lt;/code&gt; in a field called &lt;code&gt;uid&lt;/code&gt; since &lt;code&gt;id&lt;/code&gt; is a reserved word for the index and we can't access it from &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;Zend Search Lucene&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In a batch script or a reindex action we can now just call &lt;code&gt;UserPeer::reindex()&lt;/code&gt; and have a working search index for our users.&lt;/p&gt;

&lt;div id=&quot;footnotes&quot;&gt;
    &lt;hr/&gt;
    &lt;ol&gt;
        &lt;li id=&quot;fn_lucene_index1&quot;&gt;Storing things in &lt;code&gt;app.yml&lt;/code&gt; is great for indexes that don't need to be searched in multiple applications. &lt;a href=&quot;#fnr_lucene_index1&quot; class=&quot;footnoteBackLink&quot;  title=&quot;Jump back to footnote 1 in the text.&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/li&gt;
        &lt;li id=&quot;fn_lucene_index2&quot;&gt;
Since we're using a Lucene index, which has an open documented structure, we aren't limited to just using Zend Search Lucene or Apache Lucene (java).  We can mix and match and read and write to the same index file.  For very large indexes (65,000+ documents), I rewrote a Java application to index all the documents at once as PHP would time out during such a task.
&lt;a href=&quot;#fnr_lucene_index2&quot; class=&quot;footnoteBackLink&quot;  title=&quot;Jump back to footnote 2 in the text.&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/li&gt;
    &lt;/ol&gt;
&lt;/div&gt;

</content>
 </entry>
 
 <entry>
   <title>sfZendPlugin</title>
   <link href="http://davedash.com/2007/04/10/sfzendplugin/"/>
   <updated>2007-04-10T00:00:00-07:00</updated>
   <id>http://davedash.com/2007/04/10/sfzendplugin</id>
   <content type="html">&lt;p&gt;[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl, plugins[/tags]&lt;/p&gt;

&lt;p&gt;I originally intended to rewrite &lt;a href=&quot;http://spindrop.us/2006/08/25/using-zend-search-lucene-in-a-symfony-app/&quot;&gt;my Zend Search Lucene tutorial&lt;/a&gt;, but &lt;a href=&quot;http://archivemati.ca/2007/03/08/zend-search-lucene-symfony-and-the-ica-atom-application/&quot;&gt;Peter Van Garderen&lt;/a&gt; covered the bulk of what's changed and I was too busy developing search functionality for &lt;a href=&quot;http://lyro.com/&quot;&gt;lyro.com&lt;/a&gt; (not to mention finding inconsistencies with the Zend Search Lucene port and Lucene) to finish the tutorial.  So I broke it up into smaller pieces.&lt;/p&gt;

&lt;p&gt;I packaged &lt;a href=&quot;http://framework.zend.com/&quot;&gt;Zend Framework&lt;/a&gt; into a &lt;a href=&quot;http://www.symfony-project.com/trac/browser/plugins/sfZendPlugin&quot;&gt;symfony plugin&lt;/a&gt;.  &lt;a href=&quot;http://symfony-project.com/&quot;&gt;symfony&lt;/a&gt; is easily extended using plugins.&lt;/p&gt;

&lt;p&gt;You can obtain this from subversion with the following command (from your &lt;code&gt;/plugins&lt;/code&gt; directory):&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;svn export http://svn.symfony-project.com/plugins/sfZendPlugin
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;&lt;a href=&quot;http://symfony-project.com/&quot;&gt;symfony&lt;/a&gt; has a &lt;a href=&quot;http://www.symfony-project.com/book/trunk/17-Extending-Symfony#Bridges%20to%20Other%20Framework%20Components&quot;&gt;Zend Framework Bridge&lt;/a&gt; which let's us autoload the framework by adding the following to &lt;code&gt;settings.yml&lt;/code&gt;:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;.settings:
  zend_lib_dir:   %SF_ROOT_DIR%/plugins/sfZendPlugin/lib
  autoloading_functions:
    - [sfZendFrameworkBridge, autoload]
&lt;/code&gt;&lt;/pre&gt;

&lt;p&gt;First we define &lt;code&gt;sf_zend_lib_dir&lt;/code&gt; to be in our plugin's &lt;code&gt;lib&lt;/code&gt; directory.  Then we autoload the bridge framework.&lt;/p&gt;

&lt;p&gt;After setting this up, all the Zend classes will be available and auto-loaded from elsewhere in your code.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Using Zend Search Lucene in a symfony app</title>
   <link href="http://davedash.com/2006/08/25/using-zend-search-lucene-in-a-symfony-app/"/>
   <updated>2006-08-25T00:00:00-07:00</updated>
   <id>http://davedash.com/2006/08/25/using-zend-search-lucene-in-a-symfony-app</id>
   <content type="html">&lt;p&gt;[tags]zend, search, lucene, zend search lucene, zsl, symfony,php[/tags]&lt;/p&gt;

&lt;p&gt;If you're like me you've probably followed the &lt;a href=&quot;http://symfony-project.com/askeet/21&quot;&gt;Askeet tutorial on Search&lt;/a&gt; in order to create a decent search engine for your web app.  It's fairly straight forward, but they hinted that when &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;Zend Search Lucene&lt;/a&gt; (&lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt;) is released, that might be the way to go.  Well we are in luck, &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;&lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt;&lt;/a&gt; is available, so let's just dive right in.&lt;/p&gt;

&lt;!--more--&gt;


&lt;p&gt;If you aren't using &lt;a href=&quot;http://symfony-project.com/&quot;&gt;symfony&lt;/a&gt; have a look at &lt;a href=&quot;http://devzone.zend.com/node/view/id/91&quot; title=&quot;Roll Your Own Search Engine with Zend_Search_Lucene&quot;&gt;this article&lt;/a&gt; from the &lt;a href=&quot;http://devzone.zend.com/&quot;&gt;Zend Developer Zone&lt;/a&gt;.  It covers just enough to get you started.  If you are using &lt;a href=&quot;http://symfony-project.com/&quot;&gt;symfony&lt;/a&gt;, just follow along and we'll get you where you need to go.&lt;/p&gt;

&lt;h3&gt;Obtaining Zend Search Lucene&lt;/h3&gt;

&lt;p&gt;First &lt;a href=&quot;http://framework.zend.com/download&quot; title=&quot;Zend Framework Download&quot;&gt;download&lt;/a&gt; the &lt;a href=&quot;http://framework.zend.com/&quot;&gt;Zend Framework&lt;/a&gt; (&lt;acronym title=&quot;Zend Developer Framework&quot;&gt;ZF&lt;/acronym&gt;).  The &lt;a href=&quot;http://framework.zend.com/&quot;&gt;Zend Framework&lt;/a&gt;  is supposed to be fairly &quot;easy&quot; in terms of installation.  So let's put that to the test.  Open your &lt;a href=&quot;http://framework.zend.com/&quot;&gt;&lt;acronym title=&quot;Zend Developer Framework&quot;&gt;ZF&lt;/acronym&gt;&lt;/a&gt; archive.  Copy &lt;code&gt;Zend.php&lt;/code&gt; and &lt;code&gt;Zend/Search&lt;/code&gt; to your &lt;a href=&quot;http://symfony-project.com/&quot;&gt;symfony&lt;/a&gt; project's library folder:&lt;/p&gt;

&lt;pre&gt;&lt;code&gt;cp Zend.php $SF_PROJECT/lib              
mkdir $SF_PROJECT/lib/Zend
cp -r Zend/Search $SF_PROJECT/lib/Zend
cp Zend/Exception.php $SF_PROJECT/lib/Zend                 
chmod -R a+r $SF_PROJECT/lib/Zend*
&lt;/code&gt;&lt;/pre&gt;

&lt;h3&gt;Index Something&lt;/h3&gt;

&lt;p&gt;We'll deviate slightly from &lt;a href=&quot;http://spindrop.us/category/reviewsbyus&quot; title=&quot;ReviewsBy.Us category of Spindrop&quot;&gt;food themed&lt;/a&gt; tutorials and do something generic.  Let's try a user search where we can find a user by their name or email address.  It's fairly simple to accomplish, and hardly requires the use of &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;&lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt;&lt;/a&gt;, but by using &lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt; we can easily extend it to do a full-text search of a user's profile or any other textual data.&lt;/p&gt;

&lt;p&gt;Each &quot;thing&quot; stored in the index is a &quot;document&quot; in &lt;acronym title=&quot;Zend Search Lucene&quot;&gt;ZSL&lt;/acronym&gt;, specifically a &lt;code&gt;Zend_Search_Lucene_Document&lt;/code&gt;.  Each document then consists of several &quot;fields&quot; (&lt;code&gt;Zend_Search_Lucene_Field&lt;/code&gt; objects).  In our example, our document will be an individual user and the fields will be relevant attributes of the user (username, first name, last name, email, the text of their profile).&lt;/p&gt;

&lt;p&gt;We're going to write a general re-indexing tool.  Something that will index all users.&lt;/p&gt;

&lt;p&gt;In our &lt;code&gt;userActions&lt;/code&gt; class let's add the following action:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function executeReindex()
    {
        require_once 'Zend/Search/Lucene.php';
        $index = new Zend_Search_Lucene(sfConfig::get('app_search_user_index_file'),true);
        
        $users = UserPeer::doSelect(new Criteria());
        foreach ($users AS $user)
        {
            $doc = new Zend_Search_Lucene_Document();
            $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('id', $user-&gt;getId()));
            $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('username', $user-&gt;getUsername()));
            $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('email', $user-&gt;getEmail()));
            $doc-&gt;addField(Zend_Search_Lucene_Field::Text('firstname', $user-&gt;getFirstname()));
            $doc-&gt;addField(Zend_Search_Lucene_Field::Text('lastname', $user-&gt;getLastname()));
            $doc-&gt;addField(Zend_Search_Lucene_Field::Unstored('contents', &quot;{$user-&gt;getEmail()} {$user-&gt;getFirstname()} {$user-&gt;getLastname()} {$user-&gt;getUsername()}&quot;));
            $index-&gt;addDocument($doc);
        }
        
        $index-&gt;commit();
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;The code should be fairly easy to follow.  First of all we're requiring the necessary libraries for Lucene.  The next line we are creating the index:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    $index = new Zend_Search_Lucene(sfConfig::get('app_search_user_index_file'),true);
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;&lt;code&gt;app_search_user_index_file&lt;/code&gt; is a symfony configuration that you define in your &lt;code&gt;app.yml&lt;/code&gt;.  It defines which file you want to use for your index.  &lt;code&gt;/tmp/lucene.user.index&lt;/code&gt; works for our purposes.   The second parameter tells Lucene we are creating a new index.&lt;/p&gt;

&lt;p&gt;We then loop through all the users and for each user create a document.  For all the search relevant attributes that a user might have we add a field into the document.  Note the last field:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    $doc-&gt;addField(Zend_Search_Lucene_Field::Unstored('contents', &quot;{$user-&gt;getEmail()} {$user-&gt;getFirstname()} {$user-&gt;getLastname()} {$user-&gt;getUsername()}&quot;));
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;By default search is made for the &quot;contents&quot; field.  So in this example we want people to be able to type in someone's name, email, username without having to specify what field we're searching for.&lt;/p&gt;

&lt;h3&gt;Find those users&lt;/h3&gt;

&lt;p&gt;Finding the user's is equally as straight-forward.  We make a new action called &lt;code&gt;search&lt;/code&gt;:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function executeSearch()
    {
        require_once('Zend/Search/Lucene.php');
        $query = $this-&gt;getRequestParameter('q');
    
        $this-&gt;getResponse()-&gt;setTitle('Search for \'' . $query . '\' &amp;laquo; ' . sfConfig::get('app_title'), true);
    
        $hits = array();
    
        if ($query)
        {
            $index = new Zend_Search_Lucene(sfConfig::get('app_search_user_index_file'));
            $hits = $index-&gt;find(strtolower($query));
        }
        $this-&gt;hits = $hits;
    }

The magic happens in our `if` statement:

    if ($query)
    {
        $index = new Zend_Search_Lucene(sfConfig::get('app_search_user_index_file'));
        $hits = $index-&gt;find(strtolower($query));
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;If we have a query, open the &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;ZSL&lt;/a&gt; index (note that we only have one parameter here).  Run the &lt;code&gt;find&lt;/code&gt; method to find our query and store it to the &lt;code&gt;$hits&lt;/code&gt; array.  Note that our query was cleaned with &lt;code&gt;strtolower&lt;/code&gt;, since &lt;a href=&quot;http://framework.zend.com/manual/en/zend.search.html&quot;&gt;ZSL&lt;/a&gt; is case sensitive.&lt;/p&gt;

&lt;p&gt;The template takes care of the rest:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    &lt;?php use_helper('Form');?&gt;
    &lt;?php echo form_tag('@search_users') ?&gt;
    &lt;?php echo input_tag('q'); ?&gt;
    &lt;?php echo submit_tag() ?&gt;
    &lt;/form&gt;
    &lt;?php foreach ($hits as $hit): ?&gt;
      &lt;?php echo $hit-&gt;score ?&gt;
      &lt;?php echo $hit-&gt;firstname ?&gt;
      &lt;?php echo $hit-&gt;lastname ?&gt;
      &lt;?php echo $hit-&gt;email ?&gt;
    &lt;?php endforeach ?&gt;
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;Fairly simple... but it could use some cleaning up (enjoy).&lt;/p&gt;

&lt;h3&gt;What about new users?&lt;/h3&gt;

&lt;p&gt;Regularly reindexing might be nice in terms of having an optimized search index, but its lousy if you want to be able to search the network immediately when new people join on.  So why not automatically re-index each user every time they are created or everytime one of their indexed components is summoned?&lt;/p&gt;

&lt;p&gt;This should be fairly simple by adding to the &lt;code&gt;User&lt;/code&gt; class:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    var $reindex = false;
    public function setUsername ( $v )
    {
        parent::setUsername($v);
        $this-&gt;reindex = true;
    }
    public function setFirstname ( $v )
    {
        parent::setFirstname($v);
        $this-&gt;reindex = true;
    }
    public function setLastname ( $v )
    {
        parent::setLastname($v);
        $this-&gt;reindex = true;
    }
    public function setEmail ( $v )
    {
        parent::setEmail($v);
        $this-&gt;reindex = true;
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;We have an attribute called &lt;code&gt;$reindex&lt;/code&gt;.  When it is false we don't need to worry about indexes.  When something significant changes, like an update to your name or email address, then we set &lt;code&gt;$reindex&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.  Then when we save:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function save ($con = null)
    {
        parent::save($con);
        if ($this-&gt;reindex) {
            require_once 'Zend/Search/Lucene.php';
            $index = new Zend_Search_Lucene(sfConfig::get('app_search_user_index_file'));
            // first find any references to this user and delete them
            $hits = $index-&gt;find('id:'. $this-&gt;getId());
            foreach ($hits AS $hit) {
                $index-&gt;delete($hit-&gt;id);
            }
        
            $doc = $this-&gt;generateZSLDocument();
            $index-&gt;addDocument($doc);
            $index-&gt;commit();
        }
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;We're calling a new function called &lt;code&gt;generateZSLDocument&lt;/code&gt;.  It might look familiar:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function generateZSLDocument()
    {
    
        require_once 'Zend/Search/Lucene.php';
        $doc = new Zend_Search_Lucene_Document();
        $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('id', $this-&gt;getId()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('username', $this-&gt;getUsername()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Keyword('email', $this-&gt;getEmail()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Text('firstname', $this-&gt;getFirstname()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Text('lastname', $this-&gt;getLastname()));
        $doc-&gt;addField(Zend_Search_Lucene_Field::Unstored('contents', &quot;{$this-&gt;getEmail()} {$this-&gt;getFirstname()} {$this-&gt;getLastname()} {$this-&gt;getUsername()}&quot;));
        return $doc;
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;Now, whenever a user is updated, so is our index.  Additionally we can modify our reindex action:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public function executeReindex()
    {
        require_once('Zend/Search/Lucene.php');
        $index = new Zend_Search_Lucene(sfConfig::get('app_search_user_index_file'),true);
        
        $users = UserPeer::doSelect(new Criteria());
        foreach ($users AS $user)
        {
            
            $index-&gt;addDocument($user-&gt;generateZSLDocument);
        }
        
        $index-&gt;commit();
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;That's a &lt;strong&gt;lot&lt;/strong&gt; easier to deal with.&lt;/p&gt;

&lt;h3&gt;...and beyond&lt;/h3&gt;

&lt;p&gt;Hope this article helps some of you jumpstart your &lt;a href=&quot;http://symfony-project.com/&quot;&gt;symfony&lt;/a&gt; apps.  Really cool, easy to implement search is here.  We no longer have to stick with shoddy solutions like HT://Dig or spend time rolling our own full text search, as the &lt;a href=&quot;http://symfony-project.com/askeet/21&quot;&gt;symfony team diligently showed us we could&lt;/a&gt;.  But there is a lot more ground to cover.  Including optimization techniques and best practices.&lt;/p&gt;

&lt;p&gt;Let me know what you think, and if you use this in any of your apps.&lt;/p&gt;
</content>
 </entry>
 

</feed>

