<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Dave Dash</title>
 <link href="http://davedash.com/tag/stem/atom.xml" rel="self"/>
 <link href="http://davedash.com/tag/stem"/>
 <updated>2012-01-17T21:54:19-08:00</updated>
 <id>http://davedash.com/</id>
 <author>
   <name>Dave Dash</name>
   <email>dd+atom1@davedash.com</email>
 </author>

 
 <entry>
   <title>py vs php: stemming</title>
   <link href="http://davedash.com/2008/02/16/py-vs-php-stemming/"/>
   <updated>2008-02-16T00:00:00-08:00</updated>
   <id>http://davedash.com/2008/02/16/py-vs-php-stemming</id>
   <content type="html">&lt;p&gt;I've been porting some PHP to python during SuperHappyDevHouse and was amazed at how little code I needed to write since python makes list manipulation a breeze.&lt;/p&gt;

&lt;p&gt;Today I was working on stemming (ala &lt;a href=&quot;http://tartarus.org/martin/PorterStemmer/&quot;&gt;Porter Stemming algorithm&lt;/a&gt;).  &lt;a href=&quot;http://reviewsby.us/&quot;&gt;reviewsby.us&lt;/a&gt; uses stemming in the search engine to make queries:&lt;/p&gt;

&lt;p&gt;Stemming turns &lt;code&gt;hello everybody how are you guy's&lt;/code&gt; into a collection &lt;code&gt;'everybodi', 'gui', 'hello'&lt;/code&gt;.  To produce this in php I do the following:&lt;/p&gt;

&lt;!--more--&gt;




&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;php&quot;&gt;
    public static function stemPhrase($phrase)
    {
        // remove apostrophe's and periods
        $phrase = strtolower(str_replace(array('\'', '.'), null, $phrase));
        
        // split into words
        $words = str_word_count($phrase, 1);

        // ignore stop words
        $words = array_diff($words, STOP_WORDS_ARRAY);

        // stem words
        $stemmed_words = array();

        foreach ($words as $word)
        {
            $stemmed_words[] = PorterStemmer::stem($word, true);
        }

        return $stemmed_words;
    }
&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;With some magic python:&lt;/p&gt;

&lt;div&gt;&lt;textarea name=&quot;code&quot; class=&quot;python&quot;&gt;
def stem_phrase(phrase):
    words = phrase.lower().replace('.','').replace(&quot;'&quot;,'').split()

    # ignore stop words
    words = list(set(words)-set(STOP_WORDS))

    p = PorterStemmer()
    
    return [p.stem(word,0,len(word)-1) for word in words]

&lt;/textarea&gt;&lt;/div&gt;


&lt;p&gt;The magic here is list mappings.  Learning about them, they don't seem that great, but as soon as you start coding you stop using a lot of for loops.&lt;/p&gt;

&lt;p&gt;I'm sure my PHP can be cleaned up and reduced as well, but its fun exploiting the magic of languages.&lt;/p&gt;
</content>
 </entry>
 

</feed>

