Processing a file using multithreads

  • Thread starter Abhishek Pratap
  • Start date
A

Abhishek Pratap

Hi Guys

My experience with python is 2 days and I am looking for a slick way
to use multi-threading to process a file. Here is what I would like to
do which is somewhat similar to MapReduce in concept.

# test case

1. My input file is 10 GB.
2. I want to open 10 file handles each handling 1 GB of the file
3. Each file handle is processed in by an individual thread using the
same function ( so total 10 cores are assumed to be available on the
machine)
4. There will be 10 different output files
5. once the 10 jobs are complete a reduce kind of function will
combine the output.

Could you give some ideas ?

So given a file I would like to read it in #N chunks through #N file
handles and process each of them separately.

Best,
-Abhi
 
G

Gregory Ewing

Abhishek said:
3. Each file handle is processed in by an individual thread using the
same function ( so total 10 cores are assumed to be available on the
machine)

Are you expecting the processing to be CPU bound or
I/O bound?

If it's I/O bound, multiple cores won't help you, and
neither will threading, because it's the disk doing the
work, not the CPU.

If it's CPU bound, multiple threads in one Python process
won't help, because of the GIL. You'll have to fork
multiple OS processes in order to get Python code running
in parallel on different cores.
 
A

aspineux

Hi Guys

My experience with python is 2 days and I am looking for a slick way
to use multi-threading to process a file. Here is what I would like to
do which is somewhat similar to MapReduce in concept.

# test case

1. My input file is 10 GB.
2. I want to open 10 file handles each handling 1 GB of the file
3. Each file handle is processed in by an individual thread using the
same function ( so total 10 cores are assumed to be available on the
machine)
4. There will be 10 different output files
5. once the 10 jobs are complete a reduce kind of function will
combine the output.

Could you give some ideas ?

You can use "multiprocessing" module instead of thread to bypass the
GIL limitation.

First cut your file in 10 "equal" parts. If it is line based search
for the first line
close to the cut. Be sure to have "start" and "end" for each parts,
start is the address of the
first character of the first line and end is one line too much (==
start of the next block)

Then use this function to handle each part .

def handle(filename, start, end)
f=open(filename)
f.seek(start)
for l in f:
start+=len(l)
if start>=end:
break
# handle line l here
print l

Do it first in a single process/thread to be sure this is ok (easier
to debug) then split in multi processes
 
R

Roy Smith

aspineux said:
You can use "multiprocessing" module instead of thread to bypass the
GIL limitation.

I agree with this.
First cut your file in 10 "equal" parts. If it is line based search
for the first line close to the cut. Be sure to have "start" and
"end" for each parts, start is the address of the first character of
the first line and end is one line too much (== start of the next
block)

How much of the total time will be I/O and how much actual processing?
Unless your processing is trivial, the I/O time will be relatively
small. In that case, you might do well to just use the unix
command-line "split" utility to split the file into pieces first, then
process the pieces in parallel. Why waste effort getting the
file-splitting-at-line-boundaries logic correct when somebody has done
it for you?
 
A

Abhishek Pratap

Hi All

@Roy : split in unix sounds good but will it be as efficient as
opening 10 different file handles on a file. I haven't tried it so
just wondering if you have any experience with it.

Thanks for your input. Also I was not aware of the python's GIL limitation.

My application is not I/O bound as far as I can understand it. Each
line is read and then processed independently of each other. May be
this might sound I/O intensive as #N files will be read but I think if
I have 10 processes running under a parent then it might not be a
bottle neck.

Best,
-Abhi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,262
Messages
2,571,045
Members
48,769
Latest member
Clifft

Latest Threads

Top