Dave Dash

Bulk load ElasticSearch using pyes

2011-02-25T00:00:00+00:00

When indexing a lot of data, you can save time by bulk loading data.

With pyes you can do the following:

from pyes import ES


es = ES()
es.index(data, 'my-index', 'my-type', 1)
es.index(data, 'my-index', 'my-type', 2)
es.index(data, 'my-index', 'my-type', 3)
es.index(data, 'my-index', 'my-type', 4)

This will make 4 independent network calls.

from pyes import ES


es = ES()
es.index(data, 'my-index', 'my-type', 1, bulk=True)
es.index(data, 'my-index', 'my-type', 2, bulk=True)
es.index(data, 'my-index', 'my-type', 3, bulk=True)
es.index(data, 'my-index', 'my-type', 4, bulk=True)
es.refresh()

Will do this in one call. This is handy for those “reindex all the items we can” weekends.

Installing ElasticSearch plugins

2011-02-24T00:00:00+00:00

I’m slowly trying to familiarize myself with ElasticSearch and the pyes python interface. ElasticSearch uses a lot of plugins, and while the plugin system is easy to use, it’s not obvious where to find the plugins.

They are here.

If you want to install the attachments plugin, you can do:

bin/plugin install mapper-attachments

And voilà it’s installed.

Counting Sphinx groupBy Queries

2010-10-15T00:00:00+00:00

I quickly implemented Sphinx on Input, while revisiting it, I saw that we try to answer this type of question:

Of the results displayed:

How many are happy and how many are sad?

How many are for Windows, Linux or Mac?

How many are for English, French or Japanese

Finding these involve using faceted search. Unfortunately this is a bit awkward to do using Sphinx. For the first example, happy or sad you would have to run the query like such:

Take the query, remove any filters on happiness and do a group by on happy opinions
Restore any filters on happiness and run the query as normal.
Return both the results, and the aggregate data from step 1.

Doing the group by is easy, but you only get to know how many feelings there are and what they were. In our case: happy and sad. What we really want is how many of our original search were happy and how many were sad?

I assumed something like this would work:

sphinx.SetSelect('feeling, @count')

@count is one of those magic variables that Sphinx uses. Unfortunately this doesn’t work. COUNT(*) doesn’t work either. Here’s what did:

sphinx.SetSelect('feeling, SUM(1) AS count')

Not the straight forward mysqlish syntax I’ve come to expect from Sphinx, but it works.

The Python textcluster Package

2010-07-08T00:00:00+00:00

Earlier I wrote about finding the most common Firefox issues. I had wanted to automate that process and continually find these issues. Unfortunately I never had time to do this.

When they announced Firefox Input, I thought about doing this again… just with Firefox Input data but then I went on paternity leave and time kind of crept away. But I mentioned the idea this week and it piqued some interest.

So I found myself with a bit of time to work on it. The first stage was releasing a python library called textcluster.

textcluster takes the work I did earlier and makes it a bit more general purpose. The idea is I can do something like this:

docs = (
        'Every good boy does fine.',
        'Every good girl does well.',
        'Cats eat rats.',
        "Rats don't sleep.",
        )

c = Corpus()
for doc in docs:
    c.add(doc)

print c.cluster()

Which results in:

[
    (
        "Rats don't sleep.",
        {'Cats eat rats.': 0.21353467285253394}
    ),
    (
        'Every good girl does well.',
        {'Every good boy does fine.': 0.32030200927880093}
    )
]

The number is the “similarity” between the strings relative to the entire document corpus.

My next trick is to see if I can run this memory-intensive calculation over a data-set of 25,000 opinions submitted. If I can we can get some interesting data about what people think of the new Firefox beta.

Firefox Input, powered by Sphinx

2010-07-06T00:00:00+00:00

Thursday, I decided to take a half-day for my sanity, but saw an email about how Whoosh wasn’t going to cut it for Firefox Input. I was CC’d about this and there was mention that Sphinx might be possible.

Sphinx is my hammer, and everything is a nail. So I said, let’s do this. That translated into me spending my weekend, soothing my newborn and working on Sphinx. Luckily this was easy, since AMO and SUMO are both running Sphinx in a similar Django environment.

In order to move quickly, I copied code from the Zamboni project to Firefox Input. Even our deployment into staging and production wasn’t done by our usual “Sphinx guy” in IT. Ultimately, everything landed in place.

So try it out and file bugs or let me know if searches don’t go as planned.

Alphabetical sorting in Sphinx

2010-04-21T00:00:00+00:00

Sphinx 0.9.9 is great at searching full text, but treating actual strings as attributes takes some work.

Initially I employed the strategy of indexing my full text fields and storing them as attributes. E.g.:

sql_query = SELECT name, name AS name_ord FROM documents
sql_attr_str2ordinal = name_ord

This stores each attribute in lexical order. Meaning if your name’s are Apple, Aardvark, Button, Choco-room they would be given the ordinal 2, 1, 3, 4 respectively.

However, this is case-insensitive. So trying this approach:

sql_query = SELECT name, UPPER(name) AS name_ord FROM documents
sql_attr_str2ordinal = name_ord

Will allow for case-insensitive alphabetical sorting in Sphinx.

Finding the most common Firefox issues

2010-03-18T00:00:00+00:00

Cheng Wang of the Mozilla Support team, a few months back, decided to present on some design ideas for Firefox Support. One of the issues he noted was that there are a lot of repeated issues and that it would be useful to group them. Grouping them lets you see how often something occurs, and secondly let’s you see how urgent it might be.

Luckily grouping and clustering text is something computers can do. So I wrote this utility that does just that.

I ran this script over a sampling of data from the last week:

Firefox won’t start after update. (65 related issues)
- 5.6: Firefox updated, Gmail not delivering mails
- 5.6: How to change My Profile when Firefox won’t load?
- 7.5: Once I close firefox, cannot start firefox again except system restart
- 5.6: When intalling updates Firefox uninstalls itself
- 16.8: firefox won’t start after update 3.6
- 11.2: Upgraded to Firefox 3.6 and now it won’t start
- 14.9: Firefox won’t start with most extensions
How do I add a bookmark to more than one folder? (64 related issues)
- 8.9: How do I get my bookmarks on the bookmarks toolbar to show up as an icon only with no text?
- 7.5: Bookmarks lost after upgrade and cannot save new bookmarks
- 7.5: why do i have to add the .com now to addy’s?
- 8.7: When I open sidebar to edit bookmarks, I only see the folder for Bookmarks Toolbar. I do not see a folder just called Bookmarks nor do I see my list of bookmarks, that separately appear under bookmarks menu at top of screen
- 7.5: All my impoted bookmarks go to the same webpage
How do I remove the \ask toolbar"?” (50 related issues)
- 14.9: How do I remove an unwanted toolbar?
- 5.6: how to remove temporary video files from computer
- 7.5: I have no Toolbars or searchbar and i cant bring them back
- 7.5: nowhere says how to REMOVE a toolbar - only how to add or modify one
not able to open youtube videos (45 related issues)
- 5.6: Cannot open bookmark/history sidebar
- 5.6: After working well for years Firefox will now not open
- 6.7: opening bookmarks do not open in new tab
- 5.6: I can’t watch videos on youtube with firefox, but on internet explorer i can
I cannot download Firefox 3.6. I’ve tried erasing the download file. I cannot get beyond logging out of Firefox. (44 related issues)
- 8.4: when downloading files firefox download manager will freeze and i will have to start over the file download
- 5.6: Firefox will not let me download anything! Can someone help?
- 6.3: cannot download epixHD.com: not compatible with firefox 3.6
- 5.0: Several tabs are coming up when i try to downloads things
- 5.0: Firefox wont open since I downloaded the 3.6 update.

The number on the right of the related issue is a score of how strongly it relates to the main issue.

The full sample is 352 clusters from an original 3000+ issues. That’s a lot less stuff to go through. We can tune this to have either less clusters, and more related issues in a cluster, or we can make more clusters of issues and that might result in more accuracy.

Despite the inaccuracy of clustering we can make some general observations:

Firefox not starting is a big issue.
Bookmarks are either confusing or broken.
People don’t like toolbars
Opening things is hard
Downloading things or Firefox is hard

Hopefully we can fine tune these reports and have them run regularly… maybe automatically posting to Tumblr?

AMO Search: Powered by Sphinx

2009-09-30T00:00:00+00:00

Last night, I gave a talk at the Addons Meetup at Threadless HQ in Chicago on the new search engine powering addons.mozilla.org. I’ll recap the technical portion of the talk and give a bit more details.

First, I’d like to thank Harper and Threadless. It was a great location in the greatest city in the universe. Before and after the meetup, Harper was just an all-around great guy to hang with and the threadless headquarters was a nice hangout place for meeting people interested in addons.

Shortly after my talk, our Engineering Ops team deployed the new AMO 5.1 complete with a new Sphinx powered search engine.

So let’s talk about search. Note: parts of this are a rehash of my talk, so feel free to skip around.

A bit about addons

Addons is a huge growing space. Arguably it’s Mozilla’s best kept secret. Sure readers of this blog probably know what Addons are, but ask people who aren’t as web-savvy. Most people don’t know what a browser is - and it’s hard to explain it to people without getting technical.

We can just skip that step. Because Addons are small things that people can easily “get”.

“It’s an easy way to customize the internet when your surfing.”

While perhaps not technically correct, its one way of explaining it to people. Maybe a better way is just showing people what they can do with addons.

On my flight out to Chicago, I talked to a person on the plane who didn’t know what a browser was, but after showing her AMO she was really intrigued.

If everyday non-technical people can realize the potential of addons, it’s only a matter of time before they start knocking down the doors to AMO.

So we better be prepared to handle them, and get them what they want.

The technical details of addons.mozilla.org

Everytime you open Firefox, it pings AMO to see if there’s any updates to any of the addons that happen to be installed. Over a third of the people using Firefox have at least one addon, and Firefox is roughly 22% of the browser market. That means roughly 7% of people opening their browsers are pinging our servers for updates.

Needless to say it’s a lot of traffic, and to support it we need a fair amount of hardware. AMO is clearly the largest site in the Mozilla universe in both respects.

Some stats:

1 mySQL master
4 mySQL slaves
2 memached servers
2 Sphinx indexer/search daemons
24 Web Frontend
Multiple Zeus ZXTM clusters all

Most of this is standard, we’ll talk about Sphinx later, but Zeus is amazing. I didn’t know what Zeus was until earlier this year when I interviewed with Mozilla’s VP of Engineering Operations. All our requests get cached so much of our hits actually hit our Zeus cluster and not our web servers.

To see just how amazing they are read our mrz’s ops blog.

Why search matters

If you have any kind of custom content and unique meta data a custom search solution is a must. Browsing through a site isn’t going to cut it. Browsing is dead. Search is how you find things on a web site. On AMO you may see an addon that’s featured somewhere, or you might want to see what’s out there, but the right search query will find you the right addon in two clicks.

Improve Search

So my first job on AMO was to improve addons search. It was a vague request and born out of frustration with what we had. It wasn’t a problem that certain things were indexed, or unicode didn’t work, or results weren’t sorted. We may have had all those problems, but as a product search needed to be replaced.

To me it meant that we needed some framework that would allow developers to quickly debug and fix any future search calamities at a moments notice.

So here were the goals I made for myself:

Do something that sucks less than what we’ve got
Do something that makes it easier to suck less in the future
Do something that’s easy to use for our operations team, web developers and most importantly, end-users
Reduce strain on our databases, developers and operations teams

Complex Data

Our data set is small (we have 5,000 addons), but there’s a lot of secondary meta data about the addons that we track:

Addons work in 1 or more locales (e.g. en-US, fr, de, etc)
Addons are optionally platform specific (Linux, OS X, etc)
Addons work with one or more products (Firefox, Thunderbird, Seamonkey, Sunbird or Fennec)
Addons come in multiple flavors (extensions, themes, dictionaries and more)

We want to index all this data. Unfortunately to get at much of this data it involves either numerous queries, or numerous joins which put a strain on mysql. How much strain?

At peak we get about 10 search queries per second. If we do something smarter this won’t have to cause a lot of strain.

Using Sphinx

Sphinx is an open source search indexer and daemon. It’s used by Craigslist, the Pirate Bay and Mozilla Support. It was very easy to use and despite a complicated set of data and business logic, Sphinx was up to the task.

The challenges

We needed to search for addons in several languages. So indexing just addons wouldn’t work, we need to make sure we have every translation of every addon indexed. For those counting, we have 5,000 addons, but 18,000 translations of addons.

All the joining and filtering that needed to be done for our old search still needs to be done, but we can do this all in one shot by using a mysql view. This view is a flat list of each translated addon as well as all meta data associated with it. This then gets fed into the sphinx indexer.

Along the way we ran into some issues which used to be dealt with outside of mysql, such as comparing versions. It was gross and quite a hack, so we turned the variety of acceptable version strings into integers.

We also learned that stemming wasn’t a good idea as we assumed it would be. Stemming was great for searching through lots of text, but a great deal of addon searches were really just searches for product names, so we opted for substring searches. We’ll see how that fares. There is probably room for improvement.

Much of this, however involved knowing our data, and knowing how it will be used by our users. Once we got that down, we could hammer it all out using Sphinx.

Wins

So Sphinx gains us a bit architecturally. We have a complicated query, but it only gets run once every 5 minutes versus the 180,000 times it was run “on demand.”

Indexing happens rather quickly, just over a minute.

The API was a breeze to work with, and was easy to drop into our own codebase.

Because of our relatively small data set, and quick indexing, we’re able to scale this simply by cloning and load balancing. Meaning, we just need to scale for traffic, but addon growth (which is slower than traffic growth) we can safely not worry about for a while.

Our ops team can monitor the sphinx clusters and just deploy additional nodes as needed.

Building a platform

What we’ve done is built a foundation for search. Not all the problems are gone, but a lot of the problems that our QA team finds are able to be resolved quickly. We have a nice pile of unit tests as well that help us keep our results in check when we start tweaking dials.

We even have the groundwork for some nifty advanced search syntax, that hopefully we can inject into future releases of AMO.

Enjoy. And if you find anything, let me know.

V is for Version Hell

2009-08-07T00:00:00+00:00

Versioning is quite difficult to deal with. Versions are nearly-numbers, but you can’t quite sort them using standard numerical algorithms.

While the following is true:

1.1	< 1.2

The following is also true:

1.2	< 1.18 < 1.20

The “.” is not a decimal point but a separator.

Mozilla uses a modestly complicated versioning system that involves stars, plusses, and sometimes “x”.

I found a very convoluted way to translate these versions into large integers. The versions for applications in the AMO database have four parts at most, they are potentially alpha or beta and potentially a pre-release. In some cases we have multiple versions represented with .*, .x or + at the end.

The Toolkit docs let us translate “+” to mean “pre-release of the next version”. E.g. 1.0+ is 1.1pre0. Since my primary purpose of all this is for sorting, .* and .+ may as well just be a very large “version part.” Since all the version parts I deal with are a maximum of 2-digits, I turned .* and .+ into .99.

For example:

3.5+ => '03'+'05'+'99' => 030599

We also need to deal with versions that may be alpha, beta or not. If everything else is equal:

3.5a < 3.5a5 < 3.5b < 3.5b2 < 3.5 < 3.5+

We assign a single integer to represent a version’s “non-alphaness”:

a => 0
b => 1
non alpha/beta => 2

We assume that 3.5a = 3.5a1. Therefore:

'3.5a => 3.5.0a1 => '03'+'05'+'00'+'0'+'01' => 030500001

Similarly if it’s a pre-release we assign a 0 or 1 to represent “non-pre-releaseness”:

'3.5a pre2 => 3.5.0a1pre2
=> '03'+'05'+'00'+'0'+'01'+'0'+'02
=> 030500001002

So what does this get us? Integers which we can use for comparison, sorting, etc. It’s a one time calculation for each version and we can do some nice SQL statements in AMO like:

mysql> SELECT version,version_int FROM appversions WHERE application_id = 1 ORDER BY version_int LIMIT 15;
+---------+--------------+
| version | version_int  |
+---------+--------------+
| 0.3     |  30000200100 |
| 0.6     |  60000200100 |
| 0.7     |  70000200100 |
| 0.7+    |  80000200000 |
| 0.8     |  80000200100 |
| 0.8+    |  90000200000 |
| 0.9     |  90000200100 |
| 0.9.0+  |  90100200000 |
| 0.9.1+  |  90200200000 |
| 0.9.2+  |  90300200000 |
| 0.9.3   |  90300200100 |
| 0.9.3+  |  90400200000 |
| 0.9.x   |  99900200100 |
| 0.9+    | 100000200000 |
| 0.10    | 100000200100 |
+---------+--------------+
15 rows in set (0.00 sec)

I can now index these integers using Sphinx and do some very easy searches for addons based on version number.

Question: Building a Better Search Engine

2009-06-18T00:00:00+00:00

So I finally have one of those jobs where I can tell people almost every little detail about what I’m doing and I’m encouraged to talk to people on the intar-webs and solicit opinions.

Uh - this is more or less how I’ve operated at previous jobs, just now I can be overt about it.

So my new task is to work on improving the addons.mozilla.org search engine. I’ve built various “search engines” over time in PHP, powered by Lucene and most recently in python using an inverted index.

One tool that I’ve been looking at briefly is Sphinx. While my record count is low (5-10K), Sphinx basically bakes in a lot of the things I would want in a search engine. Indexing, merging, etc.

Since I’m fairly new to the add-ons team I’m still understanding the basics of what we need:

Fast automated indexing of addons for Firefox, Thunderbird and any other Mozilla product
Quick result sets
Easy deployability
Extendible
Customized ranking
Filtering (e.g. by Firefox version, etc).
Basics: Stemming and stop-words

Whether it’s Sphinx, Lucene or some home grown solution, I have all that to support. But this should be fairly straight forward. What are people’s thoughts?

Boosting terms in Zend Search Lucene

2007-05-29T00:00:00+00:00

[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl[/tags]

Boosting terms — some fields are better than others

Lucene supports boosting or weighting terms. For example, if I search for members of a web site, and I type in Dash, I want people with the name Dash to take precendence over somebody who has a hobby of running the 50-yard Dash.

If we look at our generateZSLDocument() method we defined we just need to adjust a few lines:

Should be turned into:

This is pretty straight forward way to add weight (1.5 times the weight of a normal term) and you can customize it to the needs of your site.

Finding things using Zend Search Lucene in symfony

2007-05-23T00:00:00+00:00

[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl[/tags]

This is part of an on going series about the Zend Search Lucene libraries and symfony. We’ll pretty everything up when we’re done =)

We now know how to manipulate the index via our model classes. But let’s actually do something useful with our search engine… let’s search!

[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl[/tags]

At the time of this writing we’re dealing with Propel which uses Peer classes which are meant for dealing with multiple objects¹. This is the perfect place for a ::search() method. In otherwords, UserPeer::search('dave'); should query Lucene for users matching “dave”. Let’s make that happen:

What we’re doing is retrieving our Lucene index. Somewhere between tutorials we wrote this Peer function to handle that:

If our index is missing we’ll conveniently create it on the fly. We then use the Zend Search Lucene API to retrieve the matching hits in this index and then use some Propel trickery to retrieve by an array of primary keys.

It’s now simple to use ::search() functions in the same manner as you use ::doSelect().

At this point you should be able to create a basic symfony app that can utilize a Lucene index.

The examples refer to using Propel, but it's trivial to adapt this to sfDoctrine ↩

Creating, Updating, Deleting documents in a Lucene Index with symfony

2007-04-24T00:00:00+00:00

Previously we covered an all-at-once approach to indexing objects in your symfony app. But for some reason, people find the need to allow users to sign up, or change their email addresses and then all of a sudden our wonderful Lucene index is out of date.

Here lies the strength of using Zend Search Lucene in your app, you can now get the flexibility of interacting with a Lucene index, no matter how it was created and add, update and delete documents to it.

The last thing you want to do is have a cron job in charge of making sure your index is always up to date by reindexing regularly. This is an inelegant and inefficient process.

A smarter method would be to trigger an update of the index each time you update your database. Luckily the ORM layer allows us to do this using objects (in our case Propel objects).

If we look at our user example from before, we did set ourselves up to easily do this using our User::generateZSLDocument() function, which did most of the heavy lifting.

We can make a few small changes to the User class:

We have an attribute called $reindex. When it is false we don’t need to worry about the index. When something significant changes, like an update to your name or email address, then we set $reindex to true. Then when we save with an overridden save method:

Now we’ve got the exact same data that we created during our original indexing. This handled creating and updating object, but we miss updating the index when deleting objects.

Luckily we already made a function User::removeFromIndex() to remove any related documents from the index, so our delete function can be pretty simple:

The Lucene Search Index and symfony

2007-04-23T00:00:00+00:00

[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl, index[/tags]

This article is meant to followup sfZendPlugin where we learn a newer way of obtaining the Zend Framework.

In this tutorial we’re going to delve into the Lucene index. Zend Search Lucene relies on building a Lucene index. This is a directory that contains files that can be indexed and queried by Lucene or other ports. In our example we’ll be creating a search for user profiles.

We’ll want to store in our app.yml the precise location of this index file so we can refer to it in our app¹.

Here’s an example:

all:
  search:
    user_index: /tmp/myapp.user.lucene.index

Now when we need to refer to the index we can do sfConfig::get('app_search_user_index').

Index Something

Let’s try a user search where we can find a user by their name or email address. It’s fairly simple to accomplish, and hardly requires the use of ZSL, but by using ZSL we can easily extend it to do a full-text search of a user’s profile or any other textual data.

Each “thing” stored in the index is a Lucene “document”. Each document then consists of several “fields” (Zend_Search_Lucene_Field objects). In our example, each document will be an individual user and the fields will be relevant attributes of the user (username, first name, last name, email, the text of their profile).

Initially we’ll want to populate our index. We may also want to regularly reindex all the users at once to optimize the search performance. Since reindexing involves multiple users it would make sense to have a static reindex method in our UserPeer class².

Very simply, we’re creating a new index, getting all the users, adding a document to the index and then committing the index (to disk). You might have noticed that there’s a strange function, User::generateZSLDocument(). This function contains all the magic. In order to not repeat ourselves we keep the internals of making a document for the Lucene index in the User class itself. Let’s look at it:

We’re really just dumping the relevant search terms into this document. The beauty of keeping this code internalized in the User class is we can reuse it later if we need to index a single User at a time.

A couple things to note. Zend_Search_Lucene_Field::Keyword allows us to store data that we can lookup later. We store the User::id in a field called uid since id is a reserved word for the index and we can’t access it from Zend Search Lucene.

In a batch script or a reindex action we can now just call UserPeer::reindex() and have a working search index for our users.

Storing things in app.yml is great for indexes that don't need to be searched in multiple applications. ↩
Since we're using a Lucene index, which has an open documented structure, we aren't limited to just using Zend Search Lucene or Apache Lucene (java). We can mix and match and read and write to the same index file. For very large indexes (65,000+ documents), I rewrote a Java application to index all the documents at once as PHP would time out during such a task. ↩

sfZendPlugin

2007-04-10T00:00:00+00:00

[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl, plugins[/tags]

I originally intended to rewrite my Zend Search Lucene tutorial, but Peter Van Garderen covered the bulk of what’s changed and I was too busy developing search functionality for lyro.com (not to mention finding inconsistencies with the Zend Search Lucene port and Lucene) to finish the tutorial. So I broke it up into smaller pieces.

I packaged Zend Framework into a symfony plugin. symfony is easily extended using plugins.

You can obtain this from subversion with the following command (from your /plugins directory):

svn export http://svn.symfony-project.com/plugins/sfZendPlugin

symfony has a Zend Framework Bridge which let’s us autoload the framework by adding the following to settings.yml:

.settings:
  zend_lib_dir:   %SF_ROOT_DIR%/plugins/sfZendPlugin/lib
  autoloading_functions:
    - [sfZendFrameworkBridge, autoload]

First we define sf_zend_lib_dir to be in our plugin’s lib directory. Then we autoload the bridge framework.

After setting this up, all the Zend classes will be available and auto-loaded from elsewhere in your code.

Using Zend Search Lucene in a symfony app

2006-08-25T00:00:00+00:00

[tags]zend, search, lucene, zend search lucene, zsl, symfony,php[/tags]

If you’re like me you’ve probably followed the Askeet tutorial on Search in order to create a decent search engine for your web app. It’s fairly straight forward, but they hinted that when Zend Search Lucene (ZSL) is released, that might be the way to go. Well we are in luck, ZSL is available, so let’s just dive right in.

If you aren’t using symfony have a look at this article from the Zend Developer Zone. It covers just enough to get you started. If you are using symfony, just follow along and we’ll get you where you need to go.

Obtaining Zend Search Lucene

First download the Zend Framework (ZF). The Zend Framework is supposed to be fairly “easy” in terms of installation. So let’s put that to the test. Open your ZF archive. Copy Zend.php and Zend/Search to your symfony project’s library folder:

cp Zend.php $SF_PROJECT/lib              
mkdir $SF_PROJECT/lib/Zend
cp -r Zend/Search $SF_PROJECT/lib/Zend
cp Zend/Exception.php $SF_PROJECT/lib/Zend                 
chmod -R a+r $SF_PROJECT/lib/Zend*

Index Something

We’ll deviate slightly from food themed tutorials and do something generic. Let’s try a user search where we can find a user by their name or email address. It’s fairly simple to accomplish, and hardly requires the use of ZSL, but by using ZSL we can easily extend it to do a full-text search of a user’s profile or any other textual data.

Each “thing” stored in the index is a “document” in ZSL, specifically a Zend_Search_Lucene_Document. Each document then consists of several “fields” (Zend_Search_Lucene_Field objects). In our example, our document will be an individual user and the fields will be relevant attributes of the user (username, first name, last name, email, the text of their profile).

We’re going to write a general re-indexing tool. Something that will index all users.

In our userActions class let’s add the following action:

The code should be fairly easy to follow. First of all we’re requiring the necessary libraries for Lucene. The next line we are creating the index:

app_search_user_index_file is a symfony configuration that you define in your app.yml. It defines which file you want to use for your index. /tmp/lucene.user.index works for our purposes. The second parameter tells Lucene we are creating a new index.

We then loop through all the users and for each user create a document. For all the search relevant attributes that a user might have we add a field into the document. Note the last field:

By default search is made for the “contents” field. So in this example we want people to be able to type in someone’s name, email, username without having to specify what field we’re searching for.

Find those users

Finding the user’s is equally as straight-forward. We make a new action called search:

If we have a query, open the ZSL index (note that we only have one parameter here). Run the find method to find our query and store it to the $hits array. Note that our query was cleaned with strtolower, since ZSL is case sensitive.

The template takes care of the rest:

Fairly simple… but it could use some cleaning up (enjoy).

What about new users?

Regularly reindexing might be nice in terms of having an optimized search index, but its lousy if you want to be able to search the network immediately when new people join on. So why not automatically re-index each user every time they are created or everytime one of their indexed components is summoned?

This should be fairly simple by adding to the User class:

We have an attribute called $reindex. When it is false we don’t need to worry about indexes. When something significant changes, like an update to your name or email address, then we set $reindex to true. Then when we save:

We’re calling a new function called generateZSLDocument. It might look familiar:

Now, whenever a user is updated, so is our index. Additionally we can modify our reindex action:

That’s a lot easier to deal with.

…and beyond

Hope this article helps some of you jumpstart your symfony apps. Really cool, easy to implement search is here. We no longer have to stick with shoddy solutions like HT://Dig or spend time rolling our own full text search, as the symfony team diligently showed us we could. But there is a lot more ground to cover. Including optimization techniques and best practices.

Let me know what you think, and if you use this in any of your apps.