The Lucene Search Index and symfony

[tags]Zend, Zend Search Lucene, Search, Lucene, php, symfony, zsl, index[/tags]

This article is meant to followup sfZendPlugin where we learn a newer way of obtaining the Zend Framework.

In this tutorial we’re going to delve into the Lucene index. Zend Search Lucene relies on building a Lucene index. This is a directory that contains files that can be indexed and queried by Lucene or other ports. In our example we’ll be creating a search for user profiles.

We’ll want to store in our app.yml the precise location of this index file so we can refer to it in our app1.

Here’s an example:

    user_index: /tmp/myapp.user.lucene.index

Now when we need to refer to the index we can do sfConfig::get('app_search_user_index').

Index Something

Let’s try a user search where we can find a user by their name or email address. It’s fairly simple to accomplish, and hardly requires the use of ZSL, but by using ZSL we can easily extend it to do a full-text search of a user’s profile or any other textual data.

Each “thing” stored in the index is a Lucene “document”. Each document then consists of several “fields” (Zend_Search_Lucene_Field objects). In our example, each document will be an individual user and the fields will be relevant attributes of the user (username, first name, last name, email, the text of their profile).

Initially we’ll want to populate our index. We may also want to regularly reindex all the users at once to optimize the search performance. Since reindexing involves multiple users it would make sense to have a static reindex method in our UserPeer class2.

Very simply, we’re creating a new index, getting all the users, adding a document to the index and then committing the index (to disk). You might have noticed that there’s a strange function, User::generateZSLDocument(). This function contains all the magic. In order to not repeat ourselves we keep the internals of making a document for the Lucene index in the User class itself. Let’s look at it:

We’re really just dumping the relevant search terms into this document. The beauty of keeping this code internalized in the User class is we can reuse it later if we need to index a single User at a time.

A couple things to note. Zend_Search_Lucene_Field::Keyword allows us to store data that we can lookup later. We store the User::id in a field called uid since id is a reserved word for the index and we can’t access it from Zend Search Lucene.

In a batch script or a reindex action we can now just call UserPeer::reindex() and have a working search index for our users.

  1. Storing things in app.yml is great for indexes that don't need to be searched in multiple applications.
  2. Since we're using a Lucene index, which has an open documented structure, we aren't limited to just using Zend Search Lucene or Apache Lucene (java). We can mix and match and read and write to the same index file. For very large indexes (65,000+ documents), I rewrote a Java application to index all the documents at once as PHP would time out during such a task.