Anything for incremental indexing ?

Mar 29, 2012 at 12:43 PM

Is there any provision for Anything for incremental indexing ?
We do not want to reindex all rows from SQL, just need to add updated data. Any solution for this ?

Developer
Mar 30, 2012 at 4:12 PM
Edited Mar 30, 2012 at 4:13 PM

Are you using a DataContext?

Try looking into the example in demo 1 (just called demo) where it simply maintains changed records.

I'm not yet completely familiar with that class, however what I can tell you is that it's a bit of a challenge in the case where you are constantly dealing with tables that have no standard structure and/or possibly no lastupdated timestamp.

What would you suggest?

Mar 30, 2012 at 4:24 PM

I agree with you - "Its bit of challenge".

I have nothing to trigger indexing process, I do not want to use triggers for update/insert etc. Also there is no way for triggering re-indexing after update/insert. For time being I thought of SQLCache Dependency, which can trigger something and we can run indexing, in this case we can use updatedDate/TimeStamp etc.

This is really necessary solution for industry. While thinking for Industry we need to consider following aspects - 

Solution should be easy to plug-in into older projects - without touching the code - Method Interceptor is good option.

It should help user to configure - what to index and how ? - without touching code. 

But as of now all these things are not possible for me - 

I am simply using attributes over Classes and Properties, which will help me decide what to index and how ?

Another thing we are doing is - triggering index from our Business Layer - while saving/updating records - I need to check behavior in heavy traffic - as I am developing social network and it will have too much writes - so will decide locking strategy if necessary - or will restart indexing after some specific time or with bunch of records.

Please let me know your suggestions. 

I will keep posting my findings here.

 


 

Developer
Mar 30, 2012 at 9:00 PM

So if I understand correctly:

  1. You cannot alter the code that inserts to the database
  2. You can however inject code
  3. You have a preference for a method interceptor

Ever consider the adapter pattern?

Table -> TableAdapter <- Index

So instead of passing an entity to your data layer, pass an entityIndexAdapter and also pass the same object to your indexing layer?

If you are concerned about load, can I suggest perhaps indexing by using a Mix of the Command Pattern joined with a Queue? An MSMQ handler that reads messages from MSMQ and then triggers index updates is a very good way to keep a system very scalable and responsible - you will avoid locking up your UI as the handler (installed on a server somewhere) will be in charge of the heavy lifting.

Just a thought, we do this here at my day job - it helps us provide instant responses even when under the load of 1/2 million users.

If you are curious about a good ServiceBus type of middleware, I'd highly recommend nServiceBus from Udi Dahan, that's what we use.

Mar 31, 2012 at 3:13 PM

Thanks for your valuable suggestions. I will evaluate all these - I also need to consider development cost...and system complexity.

Will send more details as and when I move ahead.

Apr 3, 2012 at 8:31 PM

I also have a need for incremental indexing.

What I've tried thus far, is to subclass the DatabaseIndexSet class and overload the write method with my own version that takes an IEnumerable.  I basically copied the underlying write method from the source code and made some changes to support an IEnumberable as the source of records to write.

public interface IUniqueKey<TKey>
{ 
	TKey Key { get; }
}

/// <summary>Write all the records from the table type into their respective indexes</summary>
/// <typeparam name="TTable">Table type to index</typeparam>
public void Write<TTable, TKey>(IEnumerable<TTable> query) where TTable : class, IUniqueKey<TKey>
{
	Write<TTable, TKey>(typeof(TTable), query);
}

/// <summary>Write all the records from the table type into their respective indexes</summary>
/// <param name="tableType">Table type to index</param>
public void Write<TTable, TKey>(Type tableType, IEnumerable<TTable> query) where TTable : class, IUniqueKey<TKey>
{
	using (_dataContextLock.ReadLock())
	{
		IIndex<TTable> index = Get<TTable>();
		string name = tableType.Name;

		// Get the LINQ to SQL ITable instance
		ITable table = DataContext.GetTable(tableType);
		if (table == null)
			throw new ArgumentException("tableType doesnt belong to db");

		// get all records from the query
		IEnumerable<TTable> items = query.ToList();

		Console.WriteLine("About to write " + name + "s...");

		foreach (TTable item in items)
		{
			TTable indexItem = DataContext.Get<TTable, TKey>(item.Key);
			if (indexItem != null)
			{
				// update index element
				index.Delete(indexItem.Key);
			}
			// add new/updated index element
			index.Add(items);
			Console.WriteLine("Added " + item.Key + " " + name + "s.");
		}
	}
}

In my previous Lucene Index implementation, I could update 10,000 records in less than a minute on our production servers and within a couple minutes on my development machine.  However, since implementing the routine above, 10,000 records is taking over 30 minutes and still counting...  If not for periodic activity in the index directory, I'd think the system was not working.

Apr 3, 2012 at 8:33 PM

There is a possibility that the speed difference is due to taking near default settings to index every field (125 fields) of my table rather than trimming it down to a couple dozen fields.

Apr 4, 2012 at 9:54 PM

I've pruned out the unneeded fields from my index and set my incremental batch size to just 50 records.  It still takes quite a while to process those 50 records.  More disturbing, however, is that the index appears to be erased and fully replaced on each iteration of my incremental indexing routine.

public void Write<TTable, TKey>(Type tableType, IEnumerable<TTable> query) where TTable : class, IUniqueKey<TKey>
{
	if (query == null || (query.Count() == 0))
	{
		return;
	}

	using (_dataContextLock.ReadLock())
	{
		IIndex<TTable> index = Get<TTable>();
		string name = tableType.Name;

		// Get the LINQ to SQL ITable instance
		ITable table = DataContext.GetTable(tableType);
		if (table == null)
			throw new ArgumentException("tableType doesnt belong to db");

		// get all records from the query
		IEnumerable<TTable> items = query.ToList();

		System.Diagnostics.Debug.Write("About to write " + name + "s...");

		int i = 0;
		foreach (TTable item in items)
		{
			TTable indexItem = DataContext.Get<TTable, TKey>(item.Key);
			if (indexItem != null)
			{
				// update index element
				index.Delete(indexItem.Key);
			}
			// add new/updated index element
			index.Add(item);
			i++;
			System.Diagnostics.Debug.WriteLine("Added Index " + i.ToString() + ", Key '" + item.Key + "', Name '" + name + "'.");
		}
	}
}

Changes in code from previous post:

System.Diagnostics.Debug.WriteLine instead of Console.WriteLine

"index.Add(item)" instead of "index.Add(items)" - only inserting the individual record instead of the whole collection on each iteration.

Developer
Apr 5, 2012 at 8:59 PM

What are the reference requirements for this line:

TTable indexItem = DataContext.Get<TTable, TKey>(item.Key);

I'm trying to integrate this in Lucene.Linq to see where the bottleneck is

Apr 5, 2012 at 9:30 PM
Edited Apr 5, 2012 at 9:37 PM

The "Write" method overloads the base class "Write" method.  My class derives from Lucene.Linq.Storage.EntityFramework.DatabaseIndexSet.  DataContext is a public property within the base class, and was set at object instantiation with a LINQ to SQL generated data context (System.Data.Linq.DataContext).  In this case, it is set to the data context for our application's SQL database.  TTable is the LINQ to SQL generated entity class for one of our tables.

"Get" is an extension method we've written for the data context.  It provides a quick way to write code that retrieves a record from a table given the record's primary key.  Below is the method's declaration:

using System;
using System.Collections.Generic;
using System.Data.Linq;
using System.Data.Linq.Mapping;
using System.Linq;
using System.Linq.Expressions;
using System.Reflection;

public static TTable Get<TTable, TPrimaryKey>(this DataContext dataContext, TPrimaryKey id)
	where TTable : class
{
	Table<TTable> table = dataContext.GetTable<TTable>();
	ParameterExpression e = Expression.Parameter(typeof(TTable), "e");
	MetaType metaType = dataContext.Mapping.GetMetaType(typeof(TTable));
	string identityColumnName = metaType.IdentityMembers[0].Name;

	PropertyInfo propInfo = typeof(TTable).GetProperty(identityColumnName);
	MemberExpression m = Expression.MakeMemberAccess(e, propInfo);
	ConstantExpression c = Expression.Constant(id, typeof(TPrimaryKey));
	BinaryExpression b = Expression.Equal(m, c);
	Expression<Func<TTable, bool>> lambda = Expression.Lambda<Func<TTable, bool>>(b, e);
	return table.SingleOrDefault(lambda);
}

Developer
Apr 5, 2012 at 10:25 PM

I added the new class in the trunk code, but I haven't unit tested it yet

It lives (along with the extension method) within the Lucene.Linq.Storage.EntityFramework namespace

I called it DatabaseIndexSetIncremental.cs until anyone could come up with a better name...