Adding and deleting files in an atomic way

C

Chris

I need to be able to have multiple processes add and delete files in a
directory in completely atomic way. Can't quite figure out how to do it.

I have three independent processes. One adds small files to a directory
(the "Adder"). Another merges the small files into big ones, and then
deletes the small ones (the "Merger"). A third simply reads the files in
a random way (the "Reader"). The Reader is heavily multi-threaded --
lots of reading might be going on simultaneously.

The files are read-only. There will be a maximum of a few hundred files
at any given time.

These different functions may or may not be running in the same JVM. It
is possible that multiple JVMs will be hitting the same directory,
possibly ones running on different machines all hitting a shared drive.

When a Reader thread starts to read files, the list of files must not
change until it's done. The file-reading process takes at most a second
or two. If the Merger wants to delete a file during that time, it must wait.

The Adder process must be able to notify the Reader and Merger processes
that a new file has been added. The Merger must be able to notify the
Reader that the current list of files has changed, so that the next time
the Reader starts a new thread, it uses the most current list.

I'm guessing that I might be able to do all this by having a plain text
file in the directory that lists the "current" files, and just have the
Adder and Merger processes put an exclusive file lock on it whenever the
list needs to change. The Adder and Merger can create any new files with
a .tmp extension, and then rename them in a very fast operation to make
them live.

I haven't figured out how to handle it, though, if the system crashes
while the Merger is renaming or deleting files, or how to prevent files
from being deleted while the Reader is using them (how do we know when
the various Reader threads have finished with a file?). I'm hoping that
I won't need to implement some kind of transaction log with commit/rollback.

Any thoughts appreciated.
 
A

Andrey Kuznetsov

I need to be able to have multiple processes add and delete files in a
directory in completely atomic way. Can't quite figure out how to do it.

I have three independent processes. One adds small files to a directory
(the "Adder"). Another merges the small files into big ones, and then
deletes the small ones (the "Merger"). A third simply reads the files in a
random way (the "Reader"). The Reader is heavily multi-threaded --
lots of reading might be going on simultaneously.

The files are read-only. There will be a maximum of a few hundred files at
any given time.

These different functions may or may not be running in the same JVM. It is
possible that multiple JVMs will be hitting the same directory, possibly
ones running on different machines all hitting a shared drive.

if they run on same mashine they could communicate through socket,
for different mashines you will need some kind server.

Andrey
 
G

Gijs Peek

Chris said:
I need to be able to have multiple processes add and delete files in a
directory in completely atomic way. Can't quite figure out how to do it.

I have three independent processes. One adds small files to a directory
(the "Adder"). Another merges the small files into big ones, and then
deletes the small ones (the "Merger"). A third simply reads the files in
a random way (the "Reader"). The Reader is heavily multi-threaded --
lots of reading might be going on simultaneously.

The files are read-only. There will be a maximum of a few hundred files
at any given time.

These different functions may or may not be running in the same JVM. It
is possible that multiple JVMs will be hitting the same directory,
possibly ones running on different machines all hitting a shared drive.

When a Reader thread starts to read files, the list of files must not
change until it's done. The file-reading process takes at most a second
or two. If the Merger wants to delete a file during that time, it must
wait.

The Adder process must be able to notify the Reader and Merger processes
that a new file has been added. The Merger must be able to notify the
Reader that the current list of files has changed, so that the next time
the Reader starts a new thread, it uses the most current list.

I'm guessing that I might be able to do all this by having a plain text
file in the directory that lists the "current" files, and just have the
Adder and Merger processes put an exclusive file lock on it whenever the
list needs to change. The Adder and Merger can create any new files with
a .tmp extension, and then rename them in a very fast operation to make
them live.

I haven't figured out how to handle it, though, if the system crashes
while the Merger is renaming or deleting files, or how to prevent files
from being deleted while the Reader is using them (how do we know when
the various Reader threads have finished with a file?). I'm hoping that
I won't need to implement some kind of transaction log with
commit/rollback.

Any thoughts appreciated.

I think you are dealing with the producer/consumer problem here. There is
lots of information on that problem on the web. It is usually solved using
semaphores (supported in java in the class java.util.concurrent.Semaphore
since java 1.5).
 
F

Filip Larsen

Chris wrote
I need to be able to have multiple processes add and delete files in a
directory in completely atomic way. Can't quite figure out how to do it.

[...]

I'm guessing that I might be able to do all this by having a plain text
file in the directory that lists the "current" files, and just have the
Adder and Merger processes put an exclusive file lock on it whenever the
list needs to change. The Adder and Merger can create any new files with
a .tmp extension, and then rename them in a very fast operation to make
them live.

I haven't figured out how to handle it, though, if the system crashes
while the Merger is renaming or deleting files, or how to prevent files
from being deleted while the Reader is using them (how do we know when
the various Reader threads have finished with a file?). I'm hoping that
I won't need to implement some kind of transaction log with
commit/rollback.

If you associate a unique file name pattern to *all* the different
processes or states that requires exclusive access to a file, you can
signal ownership by file name and acquire or hand-over ownership using
rename and you can skip using file locks. This obviously requires that
all processes are cooperating and that file renames are atomic and
detectable on the file systems you are using. This way, file renames
works much like acquiring semaphores.

For instance, if your Merger somehow decides it wants to delete file1 it
will periodically try to rename it to file1.delete and only continue
with deleting after successful rename. The Merger component can keep the
list of files that should be deleted but which it does not yet own in a
persistent list. After crash Merger can restore that list and also
delete all files that match *.delete.

When a Reader wants to read a file it likewise try to rename the file in
a loop and only when successful it can read it. If rename fails it
probably also have to check that the file has not been deleted (i.e.
"file1.*" should match something). When the reader is done reading it
renames the file back to its base name signaling that the file is free
to use.

If you have a producer that hands over files to a consumer you can have
the producer rename the file to signal that file is ready for
consumption. If you only have one consumer then it can assume it owns
files renamed for it, but if you have multiple consumers they should
each use a unique rename to acquire the file first.

If you have processes that can crash and never come back, you may have
to use a clean-up process that detects files owned by such crashed
processes. One way could be to have each process regularly stamp a
unique file, e.g. "process.reader1" so that all files owned by reader1
("*.reader1") can be released if the stamp gets too old. This is similar
to other lease systems used in distributed systems. You can even include
lease time in your stamp file, filename or timestamp if you need
different lease periods.


Regards,
 
S

steve

I think you are dealing with the producer/consumer problem here. There is
lots of information on that problem on the web. It is usually solved using
semaphores (supported in java in the class java.util.concurrent.Semaphore
since java 1.5).

if it's on different processes or different jvm
The way i currently do this is with a database (because it is already there)
and then as each stage progresses i set a flag, well actually a counter, and
each process only looks for records with flags between a certain range.

The advantage of doing this is that , if a process dies , or does not
complete, then the file is not flagged in the database, as needing action, by
the next process,

Also the system can be restarted, without loss of position or status of the
files under processing.


as you are running different processes (different machines), you will need
some sort of server /client system.
but be careful what you store , in ram, and ensure that the system can
automatically recover, should anything , crash ,blowup, loose power , or any
other such problem.

Steve
 
M

Mark Space

Chris said:
I need to be able to have multiple processes add and delete files in a
directory in completely atomic way. Can't quite figure out how to do it.

I have three independent processes. One adds small files to a directory

This is a classic Reader-Writer problem. Your Reader process(es) is a
Reader, and your Adder and Merger are both writers.

You should implement this, imho, with system level file locking. I
think NIO gives you access to the system file locking mechanisms.

If you are going to allow *completely* different processes to access
these file, you need to think about user privileges too. For example,
if someone comes along after you and starts opening and modifying these
files in a fourth process (say, a Perl script somewhere) you need to
make sure you can lock their script out when you need to (or wait
indefinitely for a lock until the script is done).

If you are going to be the only one accessing these files, then you
could instead adopt some conventions instead of asserting complete
control over the whole file. For example, a common "flag" to lock a
file is to just lock the first byte. This indicates the file is in-use,
and no one else should use it.

You might be able to "lock" a whole subdirectory, which *might* prevent
new files from being created. I'm not sure about this though, check NIO
and your OS.

You may have to modify your expectations a bit to use file locking, but
I think you'll have a more robust solution at the end.

I haven't figured out how to handle it, though, if the system crashes
while the Merger is renaming or deleting files, or how to prevent files
from being deleted while the Reader is using them (how do we know when
the various Reader threads have finished with a file?). I'm hoping that
I won't need to implement some kind of transaction log with
commit/rollback.

I think you'll have to create some sort of journal or log, and roll back
or proceed forward during recovery. This will have to be integrated
into the start-up of the system, or maybe integrated into the start-up
of the application you are building. I can't really give you more help
here, sorry, I don't know off hand of any auto-recovery type objects for
Java.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top