python file synchronization

silentnights · Feb 7, 2012

Hi All,

I have the following problem, I have an appliance (A) which generates
records and write them into file (X), the appliance is accessible
throw ftp from a server (B). I have another central server (C) that
runs a Django App, that I need to get continuously the records from
file (A).

The problems are as follows:
1. (A) is heavily writing to the file, so copying the file will result
of uncompleted line at the end.
2. I have many (A)s and (B)s that I need to get the data from.
3. I can't afford losing any records from file (X)

My current implementation is as follows:
1. Server (B) copy the file (X) throw FTP.
2. Server (B) make a copy of file (X) to file (Y.time_stamp) ignoring
the last line to avoid incomplete lines.
3. Server (B) periodically make copies of file (X) and copy the lines
starting from previous ignored line to file (Y.time_stamp)

4. Server (C) mounts the diffs_dir locally.
5. Server (C) create file (Y.time_stamp.lock) on target_dir then copy
file (Y.time_stamp) to local target_dir then delete
(Y.time_stamp.lock)

6. A deamon running in Server (C) read file list from the target_dir,
and process those file that doesn't have a matching *.lock file, this
procedure to avoid reading the file until It's completely copied.

The above is implemented and working, the problem is that It required
so many syncs and has a high overhead and It's hard to debug.

I greatly appreciate your thoughts and suggestions.

Lastly I want to note that am not a programming guru, still a noob,
but I am trying to learn from the experts.

Cameron Simpson · Feb 8, 2012

| I have the following problem, I have an appliance (A) which generates
| records and write them into file (X), the appliance is accessible
| throw ftp from a server (B). I have another central server (C) that
| runs a Django App, that I need to get continuously the records from
| file (A).
|
| The problems are as follows:
| 1. (A) is heavily writing to the file, so copying the file will result
| of uncompleted line at the end.
| 2. I have many (A)s and (B)s that I need to get the data from.
| 3. I can't afford losing any records from file (X)
[...]
| The above is implemented and working, the problem is that It required
| so many syncs and has a high overhead and It's hard to debug.

Yep.

I would change the file discipline. Accept that FTP is slow and has no
locking. Accept that reading records from an actively growing file is
often tricky and sometimes impossible depending on the record format.
So don't. Hand off completed files regularly and keep the incomplete
file small.

Have (A) write records to a file whose name clearly shows the file to be
incomplete. Eg "data.new". Every so often (even once a second), _if_ the
file is not empty: close it, _rename_ to "data.timestamp" or
"data.sequence-number", open a new "data.new" for new records.

Have the FTP client fetch only the completed files.

You can perform a similar effort for the socket daemon: look only for
completed data files. Reading the filenames from a directory is very
fast if you don't stat() them (i.e. just os.listdir). Just open and scan
any new files that appear.

That would be my first cut.
--
Cameron Simpson <[email protected]> DoD#743
http://www.cskk.ezoshosting.com/cs/

Performing random acts of moral ambiguity.
- Jeff Miller <[email protected]>

Dennis Lee Bieber · Feb 8, 2012

After searching more yesterday, I found that local mv is atomic, so instead
of creating the lock files, I will copy the new diffs to tmp dir, and after
the copy is over, mv it to actual diffs dir, that will avoid reading It
while It's still being copied.

Are your tmp directory and your "diffs" directory on the same
physical volume? If so, "mv" is a rename operation, that only affects
the directory information. If the volumes are different, then "mv"
reverts to a copy/delete file operation.

To avoid problems in the future (say the "diffs" machine is
reconfigured with an additional drive and "tmp" is now mounted on the
new drive) you might be better off taking part of the suggestion to use
a special file name to indicate an "in-work" file...

diffs.timestamp.part

say, and when ready, just

mv diffs.timestamp.part diffs.timestamp

This leaves them in the same physical location and directory.

Search nested folders with specific names in python	0	Sep 23, 2022
How can I upload a tar.bz2 file to OpenStack swift object storage container using the Python swift client?	1	Mar 22, 2024
KML to CSV file conversion using Python and Windows Powershell	0	Oct 14, 2022
Python client/server that reads HTML body from server	1	Apr 12, 2023
Python pyPDF4 code to bookmark pdf based upon date text	1	Jan 18, 2023
Synchronization	7	Apr 2, 2009
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023

python file synchronization

silentnights

Cameron Simpson

Dennis Lee Bieber

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads