GSoC 2010 – Progress Report 2

2010
07.10

This report will outline the work done so far right before the mid terms and tally it with my original timeline.

I had been working with the core search features with the simple test harness that I had made in the last iteration. To illustrate the work done, I will go a bit into detail about Whoosh too. From the documentation: Whoosh is a fast, pure Python search engine library. And so far, its been quite fast. :)

The work flow starts from the indexer. Every email is a document that I insert into the index. Currently the indexer is plain text based. This Storage engine can be rewritten, but its somewhat tedious. So will keep this for the end and discuss it with the team.

The document is structured in the following manner:

msgid
author
date
email
subject
body
filename
period

Most of the fields should be self explanatory. ‘filename’ points to the HTML file that contains the actual Email content (from pipermail) and ‘period’ stores the time when the month based timestamp (just like the folders in pipermail). The index is stored in a directory, with pickled files. After this, the searching module of Whoosh can be accessed with custom built queries. Following points map the queries with the use cases (all of this has actually been implemented):

1. Simple Search:

Say a user enters:

linux network problem

The search engine’s tokenizer will break this down into: linux AND network AND problem. It will search in the “body” field by default and give the results.

If exact phrase match is required the user can enter:

"linux network problem"

With the quotes, The search engine will not run a tokenizer on it and match the exact string. This is similar to the exact string match in Google.

2. Specific Search:

Suppose a user wants to search for a specific user who has sent an email, or search for a word only in the subject field, he/she can enter:

author:Priya

or

subject:gsoc

This will search only in the mentioned in the index.

Though this one is more for advanced users, one can use a wild card like * for inexact matching. AND, OR, NOT keywords can also be coupled with words for a better/narrow search result set. I have tested this and it works great.

3. Fuzzy Search:

This was a bit daunting at the start, and I was wondering how I would go about this.

Bad spelling is a very common problem in search. For Mailocate, I am using the built in Whoosh dictionary for spell checking and providing suggestions to the users. For this, I build a dictionary using the corpus from the index itself. All the words from the index are added to the SpellCheck module which looks up (quite fast) and gives suggestions based only on what is there in the index. This solution worked well. For instance, entering “secnd” will give a suggestion for “second”.

The other issue was the different forms of a word. Like:

second, seconds, secondary
fundament, fundamental, fundamentally

There is a Whoosh module called whoosh.analysis.StemmingAnalyzer. This uses the Porter Stemming Algorithm to remove suffixes. This analysis module is to be used during the Schema construction so the search engine keeps it in mind when looking up for the words.

Therefore, if I am looking for ‘seconds’ and ‘second’ is in the database, it will produce a match. It works vice versa too.

I have all of the above use cases working at the moment. Mapping this with my proposal:

  1. FieldSearch – Done
  2. BooleanSearch – Done
  3. FuzzySearch – Done (subject to iterations and improvements)
  4. WildCardSearch – Done
  5. PhraseSearch – Done
  6. RelevantSearch – Pending
  7. KeywordSearch – Pending

I am still wondering of the best way to extract keywords from an email. This is usually done by extracting the highest occurring word in the document. I could also use subject for the keyword matching.

For relevant search, which I shall tackle next, I plan to use the whoosh.lang.wordnet.thesaurus. Hopefully it will also be as awesome as the other tools in Whoosh have been. :) Slightly behind schedule, but now that the base is in place, I can catch up quite fast.

Stay tuned for the link to the code/demo. I shall update this post with the Launchpad link in a day or two. :)

viagra

Tags: , , ,

Your Reply