Linq to Lucene in a multithreaded environment

Apr 11, 2009 at 11:08 PM
Firstly thanks for the great job you guys have done making Lucene interoperable Linq. It makes such a big difference and I surely will be tuned in for all the latest updates.

A question regarding Lucene in a multithreaded environment such as a web app:

I have a shared static index that I need to keep up to date. As I save any entities within my repositories I also add a new doc to the index or in the case of an update I'll delete and then add. If other threads are performing searches on these indexes at the same time under large volumes will I encounter any race conditions? Does lucene use any form of locking? What is the best way (i.e. reliable and performant) in your opinion of keeping the indexes up to date as close to real time as possible?

Thanks
Best Regards,
Nabil
Coordinator
Apr 12, 2009 at 7:00 AM
"I have a shared static index that I need to keep up to date. As I save any entities within my repositories I also add a new doc to the index or in the case of an update I'll delete and then add. If other threads are performing searches on these indexes at the same time under large volumes will I encounter any race conditions?"

Yes, you will. But I've added reader writer locks around access to the underlying Lucene indexWriter/searchers (Note: lucene itself provides no index locking). If you write, the searchers will be locked. This seems to work, but it means your searchers will have to wait until adds/updates are finished. Depending on the frequency of index changes, this may be sufficient. If you expect high volume, it's best to adopt a different strategy.

There are a range of strategies for keeping an index in synch - as you can imagine. LINQ to Lucene doesn't force you into one, just protects you for the simplest one. The strategy you choose depends on performance criteria
  • frequency of changes to real-data
  • volume of search queries and query speed under peak update change
  • how out-synch you can afford - (in my experience, you'll be surprised how out-of-date the index can be)
You'll have to prioritize these and choose a strategy accordingly.
Google for example, makes volume of searches highest priority. If their uber index is out of date, that's okay, as long as queries don't suffer.

Most customers I've had never want to compromise query speed, so I keep two indexes (cos disk space and memory are cheap). One is hooked up to queries, the other for updates. Every hour or so I switch the searchers onto the updated index, then disk copy that index into the alt index directory and use the alt index for updates again. I repeat this ad naseum. Updates (add/delete-add or delete) are made using a buffer-or-time-limit algorithm (if the update buffer is filled, changes or made - OR - every 5 mins all updates are made to the alt index).

It's VERY important you define your perf requirements and write perf tests. 
Negotiate carefully with your stakeholders and worst-case the situation. Be realistic and use real-data where possible. Watch how data grows. This is just generic perf advice but is essential in designing your search engine.

Please keep us informed about your specific situation. I've considered offering various indexing/synch strategies in LINQ to Lucene, but it's a wide field. Maybe your situation is simple or more complex or somewhere between. Would be great to know how LINQ to Lucene can help u.



Apr 12, 2009 at 1:28 PM
In fact in my case it is high volume read and very low volume write so I guess your built in reader writer locks should be fine. Thanks for the warning about the perf tests. I'll keep it in mind. One more question: Does the index get fragmented over time if I am adding/removing items from it? Will it need a rebuild or optimise as part of a maintenance procedure?

Thanks
Coordinator
Apr 13, 2009 at 8:08 AM
"Does the index get fragmented over time if I am adding/removing items from it? Will it need a rebuild or optimise as part of a maintenance procedure?"

Yes. Take out a write lock when you Optimize. Let me know if it's worth exposing Optimize through either Index<T> or IndexContext.