random writing access to a file in Python

C

Claudio Grondi

I have a 250 Gbyte file (occupies the whole hard drive space) and want
to change only eight bytes in this file at a given offset of appr. 200
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?

Claudio Grondi
 
T

Tim Peters

[Claudio Grondi]
I have a 250 Gbyte file (occupies the whole hard drive space)

Then where is Python stored ;-)?
and want to change only eight bytes in this file at a given offset of appr. 200
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?

Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets. For example, Windows
using NTFS does. Try it. Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate. Use as
recent a released version of Python as you can.
 
C

Claudio Grondi

Tim said:
[Claudio Grondi]
I have a 250 Gbyte file (occupies the whole hard drive space)


Then where is Python stored ;-)?
and want to change only eight bytes in this file at a given offset of
appr. 200
Gbyte (all other data in that file should remain unchanged).

How can I do that in Python?


Same way you'd do it in C, really:

f = open(PATH_TO_FILE, "r+b")
f.seek(appr. 200 Gbyte)
f.write(A_STRING_CONTAINING_YOUR_NEW_8_BYTES)
f.close()

This depends on your operating system and file system and platform C
library supporting seeks to very large offsets. For example, Windows
using NTFS does. Try it. Python should complain (raise an exception
during the seek() attempt) if your box refuses to cooperate. Use as
recent a released version of Python as you can.

Thank you much for the quick response.

The core of my problem was ... trying to use 'wb' or 'w+b' ... (stupid
me ...)

Claudio Grondi
 
C

Claudio Grondi

Dennis said:
Ouch... How many times did you have to restore that massive file
from backup?
I was smart enough to try it first on a very small file wondering what
was happening. Python documentation and even Google search after 'random
file access in Python' were not helpful as there was no example and no
hint available.

The only hint about random file access in Python I found with Google was
Table of Contents of "Python Cookbook" from O'Railly:
http://www.oreilly.com/catalog/pythoncook2/toc.html
and hints about random reading access.

I was stupid enough to forget about 'r+' (used it many times before in
C/C++ a decade ago, but not yet in Python) thinking just too much the
Pythonic way:

===============================================================
if I want to write, I don't open for reading (plus or not plus)
===============================================================

Actually my file was 'only' 42 GByte, but I wanted to construct the
question making it impossible to suggest use of an intermediate file.

In between I have chosen a total new approach as random writing to hard
disk seems to actually move the disk head each time when seeking, so
apparently no cache is used sorting a bit the pieces to write to the
disk, so if there are many of them there is no other chance as to try to
put them together in memory first before writing them to the file. This
makes the straightforward intuitive programming a bit complicated
because to work on large files it is necessary to work in chunks and
waste some processing results when they don't fill the gaps. I suppose I
am still not on the right path, so by the way:

Is there a ready to use (free, best Open Source) tool able to sort lines
(each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in
place) taking full advantage of available memory to speed up the process
as much as possible?

Claudio Grondi
 
T

tobiah

===============================================================
if I want to write, I don't open for reading (plus or not plus)
===============================================================

I was thinking the same thing. I know python is only following
C, which is good, but the flags are confusing. I'll bet that
when fopen() was first written, they had only 'r', and 'w'.
Then, when that was working they thought, "but what if we
want to read and write on the same handle". So they tacked
on the '+' to mean, "Let me do the other thing too".
 
T

Tim Chase

Is there a ready to use (free, best Open Source) tool able to sort lines
(each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in
place) taking full advantage of available memory to speed up the process
as much as possible?

Sounds like an occasion to use a merge-sort. The pseudo-code
would be:

break up the file into bite-sized chunks (maybe a couple megs
each).
Sort each of them linewise.
Write them out to intermediate files

Once you have these pieces, open each file

read the first line of each one.

[here] Find the "earliest" of each of those lines according to
your sort-order.
write it to your output file
read the next line from that particular file
return to [here]

There are some optimizations that can be had on this as
well...you can find the "earliest" *and* the "next earliest" of
those lines/files, and just read from the "earliest" file until
the current line of it passes "next earliest"...lather, rinse,
repeat shifting "next earliest" to be the "earliest" and then
find the new "next earliest".

I don't know if the GNU "sort" utility does this, but I've thrown
some rather large files at it and haven't choked it yet.

-tkc
 
F

Fredrik Lundh

Claudio said:
I was smart enough to try it first on a very small file wondering what
was happening. Python documentation and even Google search after 'random
file access in Python' were not helpful as there was no example and no
hint available.

one would think that the repeated use of the word "truncate" in the
documentation for open/file would have provided the necessary hints:

"'w' for writing (truncating an existing file)"

"Modes 'r+', 'w+' and 'a+' open the file for updating
(note that 'w+' truncates the file)"

</F>
 
D

Dennis Lee Bieber

Is there a ready to use (free, best Open Source) tool able to sort lines
(each line appr. 20 bytes long) of a XXX GByte large text file (i.e. in
place) taking full advantage of available memory to speed up the process
as much as possible?
Well... For "in place" my first requirement would have to be:
fixed-length records since you are trying to exchange chunks of data and
that requires source and destinations to be the same size.

Unfortunately, "in place" and "advantage of available memory" tend
to be at odds.

Almost any strict sort algorithm (ie, one that would apply to
memory) can be applied to "in place" (bubble, quick, etc.) but would be
working on a record by record basis -- no in-memory chunking.

Taking advantage of memory would involve a variation of the
classical sort-merge algorithm (classically illustrated using four
tape-drives <G>)...

Drive 1 2 3 4
IN Out1 Out2 spare

Read from IN as many records as can be handled by the in-memory sorter
(or, for simplicity, a known fixed number -- say 50). Sort the 50, write
chunk to Out1.

Read from IN another 50, sort, write to Out2.

Repeat, alternating Out1 and Out2 until EOF(IN).

Merge:
Drive 1 2 3 4
Out1 In1 In2 Out2

Read one record from In1 and In2, compare, write "earlier" to Out1 and
read next record from whichever input you wrote from. Repeat until the
chunks are complete (50 in from each input, 100 records out).

Repeat using Out2 this time... And alternate for each chunk.

Continue:
Drive 1 2 3 4
In1 Out1 Out2 In2

Same operation, but with 100 records each input chunk (200 output).

Repeat merging until there are only one chunk in each of In1 and In2,
meaning there will only be one chunk of twice the length in Out1 after
the final merge.


Now -- what you might be able to do is "tag sort". If the keys are
much shorter than the data lines (though given your ~20byte records,
this is unlikely to be effective)...

Read only the keys into memory, along with a record number (if fixed
length) or record offset (if variable length). Sort the keys with tag.
Then, in key order, copy the records from the input file to an output
file. (If fixed length records, you could swap in place if you ensure
you keep updating the in-memory tags for the swaps)
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
P

Paul Rubin

Claudio Grondi said:
Is there a ready to use (free, best Open Source) tool able to sort
lines (each line appr. 20 bytes long) of a XXX GByte large text file
(i.e. in place) taking full advantage of available memory to speed up
the process as much as possible?

Try the standard Unix/Linux sort utility. Use the --buffer-size=SIZE
to tell it how much memory to use.
 
C

Claudio Grondi

Paul said:
Try the standard Unix/Linux sort utility. Use the --buffer-size=SIZE
to tell it how much memory to use.
I am on Windows and it seems, that Windows XP SP2 'sort' can work with
the file, but not without a temporary file and space for the resulting
file, so triple of space of the file to sort must be provided.
Windows XP 'sort' uses constantly appr. 300 MByte of memory and can't
use 100% of CPU all the time, probably due to I/O operations via USB (25
MByte/s experienced top data transfer speed).
I can't tell yet if it succeeded as the sorting of the appr. 80 GByte
file with fixed length records of 20 bytes is still in progress (for
eleven CPU time hours / 18 daytime hours).
I am not sure if own programming would help in my case to be much faster
than the systems own sort (I haven't tried yet to set the size of memory
to use in the options to e.g 1.5 GByte as the sort help tells it is
better not to specify it). My machine is a Pentium 4, 2.8 GHz with 2.0
GByte RAM.
I would be glad to hear if the time required for sorting I currently
experience is as expected for such kind of task or is there still much
space for improvement?

Claudio Grondi
 
P

Paul Rubin

Claudio Grondi said:
I am on Windows and it seems, that Windows XP SP2 'sort' can work with
the file, but not without a temporary file and space for the resulting
file, so triple of space of the file to sort must be provided.

Oh, sorry, I didn't see the earlier parts of the thread. Anyway,
depending on the application, it's probably not worth the hassle of
coding something yourself, instead of just throwing more disk space at
the Unix utility. But especially if the fields are fixed size, you
could just mmap the file and then do quicksort on disk. Simplest
would be to just let the OS paging system take care of caching stuff
if you wanted to get fancy, you could sort in memory once the sorting
regions got below a certain size.

A huge amount of stuff has been written (e.g. about half of Knuth vol
3) about how to sort. Remember too, that traditionally large-scale
sorting was done on serial media like tape drives, so random access
isn't that vital.
 
C

Claudio Grondi

Paul said:
Oh, sorry, I didn't see the earlier parts of the thread. Anyway,
depending on the application, it's probably not worth the hassle of
coding something yourself, instead of just throwing more disk space at
the Unix utility. But especially if the fields are fixed size, you
could just mmap the file and then do quicksort on disk. Simplest
would be to just let the OS paging system take care of caching stuff
if you wanted to get fancy, you could sort in memory once the sorting
regions got below a certain size.

A huge amount of stuff has been written (e.g. about half of Knuth vol
3) about how to sort. Remember too, that traditionally large-scale
sorting was done on serial media like tape drives, so random access
isn't that vital.

Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact on
the actual sorting speed than
the speed of the data transfer from/to storage device(s)
?
So, that the most effective measure towards shortening the time required
for sorting very large files were to use faster hard drives (e.g. 10.000
rpm instead of 7.600 rpm) and faster interfaces for the data transfer
(e.g. E-IDE or S-ATA instead of USB), right?

Claudio Grondi
 
P

Paul Rubin

Claudio Grondi said:
Does it mean, that in case of very large files:
the size of available memory for the sorting operation (making it
possible to work on larger chunks of data in memory) has less impact
on the actual sorting speed than
the speed of the data transfer from/to storage device(s)

Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek. Random access helps simulate
this, but only somewhat. Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory. Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1). Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run. That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step. You really want
each merge to read from M physical input devices that are separate
from the output device(s).

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?

Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once? If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.
 
C

Claudio Grondi

Paul said:
Transfer speed and parallelism matters most, if the cpu can keep up.
Parallelism means you want multiple drives that you can do i/o to
simultaneously without having to seek. Random access helps simulate
this, but only somewhat. Large database installations always want
lots of parallel disks for similar reasons to this.

The basic method of sorting large files has traditionally been:

1) Read the file in pieces p(1),p(2),...,p(n) where each piece is as
big as will fit in memory. Sort each piece with your favorite
in-memory sort, and write out sorted pieces that we'll designate
r(1,1),...r(1,n). The r(a,b)'s are called "runs". Use multiple
output devices (tape or disk drives) to write out the runs, i.e. each
output tape will contain its own bunch of runs.

2) Read back a bunch of the runs in parallel, i.e. say you have
written r(1,1),r(1,2),...,r(1,5) to five separate tape drives. Read
them back simultaneously and merge them (requires very little external
memory) into a new "second level" run called r(2,1). Similarly merge
r(1,6),...,r(1,10) to the second level run r(2,2), and so forth.

3) Merge the second level runs into third level runs, and so forth
recursively until there's only one giant run. That is the sorted
output.

The amount of memory determines the size of the initial "first level"
runs, which determines how many recursive merges are needed, so more
memory helps.

The number of levels of recursion also depends on the "merge order",
i.e. the number of runs merged at each merge step. You really want
each merge to read from M physical input devices that are separate
from the output device(s).

There are obviously a lot of optimizations that the above description
is missing, and there's a lot of strategy about how to organize the
merges, e.g. if you have 8 drives, do you use 4 for input and 4 for
output, or 5 in and 3 out, or what?
The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours) has, from what I could observe on the
task manager, done the job in only two runs of 'copying' :
1. first to a temporary file,
2. then to the final file.
In the second run the size of processed chunks between reading/writing
was in the order of up to tenths of Megabytes, where in the first run in
order of up to hundreds Megabytes. I suppose that the procedure behind
it was as follows:
1. decision about the size of chunks to split the file into and choosing
the size of required memory
2. processing the chunks with in-memory sorting them and writing to the
temporary file
3. decision about the size of buffers for merge sorting the chunks into
the final file, so that they all fit into the 300 MByte of used memory
4. opening as many 'pipes' as there were chunks filling all of the pipe
buffers up when one of them runs out of data
5. continuously switching between reading and writing to the hard disk,
writing the results of the merge sorting to the final file always when
one of the buffers run out of data and then filling up all of the
buffers for the next cycle (concluded from observed scattered reading,
smooth writing)
Is this some kind of production deployment problem you're working on,
or do you just have a big pile of data you need to sort once?
The latter.
One of the intermediate reasons behind doing it, is an attempt to get
more and better intuitive understanding what are the hardware and
software limits of brute force based approaches to solving problems in
the area of AI, language processing and data compression.
If you
need to deploy across multiple machines (i.e. you can't just buy a
giant machine if you need to do this sorting frequently), then I'd
suggest reading up on the subject, starting with Knuth vol 3.

Thank you much for your detailed response.

Claudio Grondi
 
P

Paul Rubin

Claudio Grondi said:
The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours) has, from what I could observe on the
task manager, done the job in only two runs of 'copying' :

That is terrible; on a reasonably fast machine these days it should
take just a few minutes.
The latter.
One of the intermediate reasons behind doing it, is an attempt to get
more and better intuitive understanding what are the hardware and
software limits of brute force based approaches to solving problems in
the area of AI, language processing and data compression.

This makes no sense; the methods you want are from the ancient days of
"data processing", nothing to do with AI. Remember that the mainframe
computers of the 1960's that this stuff was developed on, had less
computing power and memory than a modern mobile phone. But phone
companies, tax agencies, and so forth, still managed to run these big
sorting tasks on those mainframes in order to send people their phone
bills, identify the tax cheaters, etc. There is rich and wonderful
history to it.
 
D

Dennis Lee Bieber

I am on Windows and it seems, that Windows XP SP2 'sort' can work with
the file, but not without a temporary file and space for the resulting

Lovely... My desktop's been running WinXP SP2 for a year and I never
even knew there was a "sort" verb on it <G>

Considering how many times I've been opening small text files in
Excel just to sort records...
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
D

Dennis Lee Bieber

So, that the most effective measure towards shortening the time required
for sorting very large files were to use faster hard drives (e.g. 10.000
rpm instead of 7.600 rpm) and faster interfaces for the data transfer
(e.g. E-IDE or S-ATA instead of USB), right?
Definitely not USB <G> (nor FireWire).

Of course, if you wanted the real feel of classical sorting, you'd
put up a bank of 3-4 9-track tape drives <G>
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
D

Dennis Lee Bieber

The Windows XP SP 2 '/> sort' (sorting of four Gigs of 20 byte records
took 12 CPU and 18 usual hours) has, from what I could observe on the
task manager, done the job in only two runs of 'copying' :
1. first to a temporary file,
2. then to the final file.
In the second run the size of processed chunks between reading/writing
was in the order of up to tenths of Megabytes, where in the first run in
order of up to hundreds Megabytes. I suppose that the procedure behind
it was as follows:
1. decision about the size of chunks to split the file into and choosing
the size of required memory

It probably needed to allow for stack space if using a recursive
algorithm (quicksort), which, for short records, could mean more
recursion...
2. processing the chunks with in-memory sorting them and writing to the
temporary file

Assuming the 300MB you mention later is the data size per chunk, a
4GB file would take around 14 "chunks"...
3. decision about the size of buffers for merge sorting the chunks into
the final file, so that they all fit into the 300 MByte of used memory

The merge phase for this, using lots of head seeks, only needs room
to hold 14 records -- one per chunk. It is possible they have some other
optimizations; like loading all records from a chunk that are
lower/earlier than the first record of all other chunks before writing
any data. This could perhaps help if the input had been nearly sorted to
begin with, as each chunk would be nearly "done".

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
C

Claudio Grondi

Paul said:
That is terrible; on a reasonably fast machine these days it should
take just a few minutes.
Ok, I see - the misunderstanding is, that there were 4.294.967.296
records each 20 bytes long, what makes the actual file 85.899.345.920
bytes large (I just used 'Gigs' for telling the number of records, not
the size of the file).
Still not acceptable sorting time?

Claudio Grondi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top