Writing to ferret index from multiple processes

A

Andreas S.

Hi,

what do I have to do to be able to write a ferret index from multiple
processes at the same time?

I was indexing a lot of documents with a script when another process
made a change to the index; suddenly all of the imported data was gone
from the index, and the import script quit with the exception
"Errno::ENOENT: No such file or directory - ./ferret_index/_1ah.fnm".

auto_flush => true didn't help. Is there something else?

Andreas
 
D

David Balmain

Hi Andreas,

Can you show me some more code? How are you creating the index?
Perhaps you are setting :create =3D> true in which case it will
overwrite the old index.

Dave
 
A

Andreas S.

David said:
Hi Andreas,

Can you show me some more code? How are you creating the index?
Perhaps you are setting :create => true in which case it will
overwrite the old index.

Dave

Oops. I am indeed using :create => true. I forgot that I set it because
create_if_missing did not work.

Sorry for the noise.

Andreas
 
D

David Balmain

I'm not to sure about this one. Are you by any chance explicitely
deleting the lock files when your app starts up? I've seen a few
people do that. The only way I can see doc numbers getting out of
order is if you delete the lock files. Any chance I could look at more
of your code? Is this for RForum? Perhaps I could check it out of svn.
Anyway, I hope I can help you out with this.

Dave

PS: If you are interested you should join the Ferret mailing list. You
seem to be doing some more advanced stuff judging from the bugs you're
finding. ;-)
 
A

Andreas S.

David said:
I'm not to sure about this one. Are you by any chance explicitely
deleting the lock files when your app starts up?
No.

I've seen a few
people do that. The only way I can see doc numbers getting out of
order is if you delete the lock files. Any chance I could look at more
of your code? Is this for RForum? Perhaps I could check it out of svn.

It is for RForum. You can see the the code here:
http://rforum.andreas-s.net/trac/file/trunk/app/models/search_ferret.rb

My indexing script simply fetches all the posts from the database and
calls Post.search_handler.update(post) for each one. If another process
calls the update method while this script is running, I am getting the
exception. If you need more information to reproduce the problem, please
let me know.
PS: If you are interested you should join the Ferret mailing list. You
seem to be doing some more advanced stuff judging from the bugs you're
finding. ;-)

I didn't know there was a list. I will definetely join it.

Thanks for fixing the other bugs so quickly.

Andreas
 
D

David Balmain

Hey Andreas,

The latest version of RForum still has :create =3D> true so I'm guessing
you haven't checked in your latest changes. Could you let me know when
you have?

Cheers,
Dave
 
A

Andreas S.

David said:
Hey Andreas,

The latest version of RForum still has :create => true so I'm guessing
you haven't checked in your latest changes. Could you let me know when
you have?

I have checked it in.
 
A

Andreas S.

Andreas said:
I have checked it in.

Btw, I tried it again on another machine, and couldn't reproduce the
"docs out of order" exception, but instead I got
RuntimeError: could not obtain lock:
/ferret_index/ferret-f62496686e637eca67e933a9cdc5eb21write.lock
 
D

David Balmain

Hi Andreas,

This is what I would expect to happen. What machine where you running
it on the first time. Whatever it was, Ferret's locking mechanism must
not work.

Anyway, to avoid this problem you need to make sure the batch process
doesn't keep the lock for too long (about 5 seconds). I would change
the rebuild index method to use an IndexWriter or switch auto_flush to
false. This should speed the reindexing up. I'd also add a pause in
there so other processes can get a hold of the lock if they need to.
Since you are flushing explicitly you may as well set auto_flush to
false anyway.

def index
@index ||=3D Index::Index.new:)path =3D> @path,
#:auto_flush =3D>true <=3D don't use this a=
nymore
:default_search_field =3D> ['subject'],
:key =3D> ['id', 'class'])
end

# update will continue to work, handling the flushing explicitly
def update(post)
index << create_doc(post)
index.flush
end

# batch_update will keep the IndexWriter open between updates
# so it will run much faster
def batch_update(post)
index << create_doc(post)
end

# define a flush method for use with the batch_update method
def flush
index.flush
end

Then in your process that is doing the reindex I'd use the
batch_update method and I might even add some pauses in there.
Something like this;
MAX_ADDS_BEFORE_FLUSH =3D 10
def rebuild_index
i =3D 0
Post.find_all_by_deleted(0).each do |post|
self.update(post)
i +=3D 1
if (i % MAX_ADDS_BEFORE_FLUSH) =3D=3D 0
self.flush
sleep(0.5)
end
end
end

These are just ideas. You'll probably come up with something better. I
think the best solution is just to keep the Ferret index in sync with
the database so that you don't need to reindex everything.

Let me know what kind of system you were running it on the first time
to get the documents out of order error. I'll see if I can find out
why the locking wasn't working.

Cheers,
Dave
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top