Multiprocessing and file I/O

I

Infinity77

Hi All,

I am trying to speed up some code which reads a bunch of data from
a disk file. Just for the fun of it, I thought to try and use parallel
I/O to split the reading of the file between multiple processes.
Although I have been warned that concurrent access by multiple
processes to the same file may actually slow down the reading of the
file, I was curious to try some timings by varying the number of
processes which read the file. I know almost nothing of
multiprocessing, so I was wondering if anyone had some very simple
snippet of code which demonstrates how to read a file using
multiprocessing.

My idea was to create a "big" file by doing:

fid = open("somefile.txt", "wb")
fid.write("HELLO\n"*1e7)
fid.close()

and then using fid.seek() to point every process I start to a position
inside the file and start reading from there. For example, with 4
processes and a 10 MB file, I would tell the first process to read
from byte 0 to byte 2.5 million, the second one from 2.5 million to 5
million and so on. I just have an academic curiosity :-D

Any suggestion is very welcome, either to the approach or to the
actual implementation. Thank you for your help.

Andrea.
 
I

Igor Katson

Infinity77 said:
Hi All,

I am trying to speed up some code which reads a bunch of data from
a disk file. Just for the fun of it, I thought to try and use parallel
I/O to split the reading of the file between multiple processes.
Although I have been warned that concurrent access by multiple
processes to the same file may actually slow down the reading of the
file, I was curious to try some timings by varying the number of
processes which read the file. I know almost nothing of
multiprocessing, so I was wondering if anyone had some very simple
snippet of code which demonstrates how to read a file using
multiprocessing.

My idea was to create a "big" file by doing:

fid = open("somefile.txt", "wb")
fid.write("HELLO\n"*1e7)
fid.close()

and then using fid.seek() to point every process I start to a position
inside the file and start reading from there. For example, with 4
processes and a 10 MB file, I would tell the first process to read
from byte 0 to byte 2.5 million, the second one from 2.5 million to 5
million and so on. I just have an academic curiosity :-D

Any suggestion is very welcome, either to the approach or to the
actual implementation. Thank you for your help.

Andrea.
If the thing you would want to speed up is the processing of the file
(and not the IO), I would make one process actually read the file, and
feed the other processes with the data from the file through a queue.
 
I

Infinity77

Hi Igor,

If the thing you would want to speed up is the processing of the file
(and not the IO), I would make one process actually read the file, and
feed the other processes with the data from the file through a queue.

No, the processing of the data is fast enough, as it is very simple.
What I was asking is if anyone could share an example of using
multiprocessing to read a file, along the lines I described above.

Andrea.
 
I

Infinity77

Hi Paul & All,

Take a look at this section in an article about multi-threaded
processing of large files:

http://effbot.org/zone/wide-finder.htm#a-multi-threaded-python-solution

Thank you for the pointer, I have read the article and the follow-ups
with much interest... it's unfortunate Python is no more on the first
place though :-D
I'll see if I can come up with a faster implementation of my (f2py-
fortran-based) Python module using multiprocessing.

Thank you.

Andrea.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,898
Latest member
BlairH7607

Latest Threads

Top