Paul's LINQ to Lucene Questions

Sep 7, 2008 at 10:05 AM
Paul posted some questions about linq to lucene and it's usage on my blog. I responded via email but decided to post here for all interested.

"A couple quick questions. The Lucene docs say that for a web app, you
should 'keep your index open' across postbacks, either via the app
cache or static objects or whatever. Is that true for Linq-to-Lucene
as well?"

Yup. Index<T> contains a property called Context (which is of type
IndexContext). The index context instance contains the IndexModifier
and IndexSearcher lucene instances which do all the magic under the
hood. I've tried to hide this via encapsulation in a sensible way. If
you keep your Index<T> or IndexSet or DatabaseIndexSet<TData> open and
in your Application cache through the life of the app, you can
safely query/mod the index without instantiating anything you don't
need to.

"Second, how often would you recommend managing re-indexing? I was
contemplating writing the index on application start, and then when
new objects get added, do it through a repository object which then
also writes them to the index, rather than re-indexing the whole db
table again."

Firstly, there is no ability to update documents in an index. To
update a document, you must delete it then add it again.

I've given updating the index alot of thought (and done it twice in
production using the lucene api), but not yet come to a decision about
how to make it easier for linq to lucene users. Ideally, I'd like to
listen to ObjectTrackingManager  (or whatever its called) for changes
and progressively update the index as SaveChanges is called. However,
this approach will have definite performance issues. The approach you
pick depends on a balance between performance (you don't want to
update your index too often otherwise searchers are taken out of
action too often) and up-to-datedness.

On the previous two projects I've worked on, the index has been small
enough to reindex the entire database (barring deactivated records,
expired records etc) every hour then swap the index files across. If
you can afford the CPU time and penalties of data being out of date by
an hour, I'd follow this approach. Please let me know if you can't and
we'll think about a smarter interim solution.

I'm working on a batch indexing and update system at the moment but am
running into sync issues I haven't been able to solve. This system
will allow developers to choose "how" an index set is indexed. That
is, if the indexing of tables is done in parallel, sequentially and
how much to buffer in memory during. The updater will do something
similar, contain a queue of documents that are "dirty", then
periodically or manually update each document.

"Third, I wasn't clear from your description why I would want to
tokenize, can you explain or link? "

Tokenization is necessary to let the term vector model do it's magic,
and to allow terms within a field to be searchable. Tokenization is
actually one phase of a process called field analysis. When a
value/query must be converted into lucene field values, the chosen
analyzer for that field/document is executed. Field analysis is a set
of composable functions that convert a string into an array of values.
Analyzers work to create the lowest string representation of a field
such that variants are properly matched. SimpleAnalyzer for example,
contains a whitespace tokenizer and lower caser. Without an analyzer
on the field, Lucene will happily index the string as is, and casing
will be trapped in the index.

e.g. If ContactTitle wasn't analyzed, the following query wouldn't find anything

var managers = from c in index.Get<Customer>
                      where c.ContactTitle == "marketing manager"
                      select c;

Why? because all those customers have titles of "Marketing Manager".
Upon linq to object translation, linq to lucene will generate a
TermQuery that tries to match the field value "marketing manager" with
"Marketing Manager". For speed (and usability), uses case
sensitive string equality. This seems very strange for simple text
fields, but becomes important when you need to index really long
fields and strangely formatted fields.

Hope that helps,
Sep 7, 2008 at 10:08 AM
The issue of reinstantiating the data context is discussed further here - DataContext recycle discussion