file locking...

bruce · Mar 1, 2009

Hi.

Got a bit of a question/issue that I'm trying to resolve. I'm asking this of
a few groups so bear with me.

I'm considering a situation where I have multiple processes running, and
each process is going to access a number of files in a dir. Each process
accesses a unique group of files, and then writes the group of files to
another dir. I can easily handle this by using a form of locking, where I
have the processes lock/read a file and only access the group of files in
the dir based on the open/free status of the lockfile.

However, the issue with the approach is that it's somewhat synchronous. I'm
looking for something that might be more asynchronous/parallel, in that I'd
like to have multiple processes each access a unique group of files from the
given dir as fast as possible.

So.. Any thoughts/pointers/comments would be greatly appreciated. Any
pointers to academic research, etc.. would be useful.

thanks

zugnush · Mar 1, 2009

You could do something like this so that every process will know if
the file "belongs" to it without prior coordination, it means a lot
of redundant hashing though.

In [36]: import md5

In [37]: pool = 11

In [38]: process = 5

In [39]: [f for f in glob.glob('*') if int(md5.md5(f).hexdigest(),16)
% pool == process ]
Out[39]:

Nigel Rantor · Mar 1, 2009

zugnush said:
You could do something like this so that every process will know if
the file "belongs" to it without prior coordination, it means a lot
of redundant hashing though.

In [36]: import md5

In [37]: pool = 11

In [38]: process = 5

In [39]: [f for f in glob.glob('*') if int(md5.md5(f).hexdigest(),16)
% pool == process ]
Out[39]:

You're also relying on the hashing being perfectly distributed,
otherwise some processes aren't going to be performing useful work even
though there is useful work to perform.

In other words, why would you rely on a scheme that limits some
processes to certain parts of the data? If we're already talking about
trying to get away without some global lock for synchronisation this
seems to go against the original intent of the problem...

n

Lawrence D'Oliveiro · Mar 2, 2009

Nigel said:
In other words, why would you rely on a scheme that limits some
processes to certain parts of the data?

That could be part of the original requirements, it's not clear from the
description so far.

Thomas Guettler · Mar 3, 2009

Hi Bruce,

you can do it like Maildir [1] you move (os.rename()) file or directories.

Maybe something like this: You have three directories: "todo", "in-process" and "done".
A process tries to os.rename from todo to in-process. If it fails, some other
process has done it before. If the process is done it moves the file/directory
to "done".

To avoid stressing the directories, too much, It might be good to use subdirectories
like todo/NN/MM/. I think git (version control system created by Linus Torvalds)
does something like this.

Thomas

[1] http://wiki.dovecot.org/MailboxFormat/Maildir
This page describes Maildir and some unneeded parts of the specification.

locking files on Linux	10	Oct 18, 2012
mailbox.mbox not locking mbox properly	3	Aug 9, 2010
Locking around	10	Aug 4, 2008
Distributed locking	2	May 14, 2009
File locking	3	Mar 12, 2007
Thread-safe file locking	16	Sep 15, 2007
Odd File-locking problem	3	Feb 18, 2010
More file locking problems	4	Apr 21, 2006

file locking...

bruce

zugnush

Nigel Rantor

Lawrence D'Oliveiro

Thomas Guettler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads