Re: concurrent file reading/writing using python

Discussion in 'Python' started by Steve Howell, Mar 27, 2012.

  1. Steve Howell

    Steve Howell Guest

    On Mar 26, 3:56 pm, Abhishek Pratap <> wrote:
    > Hi Guys
    >
    > I am fwding this question from the python tutor list in the hope of
    > reaching more people experienced in concurrent disk access in python.
    >
    > I am trying to see if there are ways in which I can read a big file
    > concurrently on a multi core server and process data and write the
    > output to a single file as the data is processed.
    >
    > For example if I have a 50Gb file, I would like to read it in parallel
    > with 10 process/thread, each working on a 10Gb data and perform the
    > same data parallel computation on each chunk of fine collating the
    > output to a single file.
    >
    > I will appreciate your feedback. I did find some threads about this on
    > stackoverflow but it was not clear to me what would be a good  way to
    > go about implementing this.
    >


    Have you written a single-core solution to your problem? If so, can
    you post the code here?

    If CPU isn't your primary bottleneck, then you need to be careful not
    to overly complicate your solution by getting multiple cores
    involved. All the coordination might make your program slower and
    more buggy.

    If CPU is the primary bottleneck, then you might want to consider an
    approach where you only have a single thread that's reading records
    from the file, 10 at a time, and then dispatching out the calculations
    to different threads, then writing results back to disk.

    My approach would be something like this:

    1) Take a small sample of your dataset so that you can process it
    within 10 seconds or so using a simple, single-core program.
    2) Figure out whether you're CPU bound. A simple way to do this is
    to comment out the actual computation or replace it with a trivial
    stub. If you're CPU bound, the program will run much faster. If
    you're IO-bound, the program won't run much faster (since all the work
    is actually just reading from disk).
    3) Figure out how to read 10 records at a time and farm out the
    records to threads. Hopefully, your program will take significantly
    less time. At this point, don't obsess over collating data. It might
    not be 10 times as fast, but it should be somewhat faster to be worth
    your while.
    4) If the threaded approach shows promise, make sure that you can
    still generate correct output with that approach (in other words,
    figure out out synchronization and collating).

    At the end of that experiment, you should have a better feel on where
    to go next.

    What is the nature of your computation? Maybe it would be easier to
    tune the algorithm then figure out the multi-core optimization.
    Steve Howell, Mar 27, 2012
    #1
    1. Advertising

  2. Thanks for the advice Dennis.

    @Steve : I haven't actually written the code. I was thinking more on
    the generic side and wanted to check if what I thought made sense and
    I now realize it can depend on then the I/O. For starters I was just
    thinking about counting lines in a line without doing any computation
    so this can be strictly I/O bound.

    I guess what I need to ask was can we improve on the existing disk I/O
    performance by reading different portions of the file using threads or
    processes. I am kind of pointing towards a MapReduce task on a file in
    a shared file system such as GPFS(from IBM). I realize this can be
    more suited to HDFS but wanted to know if people have implemented
    something similar on a normal linux based NFS

    -Abhi


    On Mon, Mar 26, 2012 at 6:44 PM, Steve Howell <> wrote:
    > On Mar 26, 3:56 pm, Abhishek Pratap <> wrote:
    >> Hi Guys
    >>
    >> I am fwding this question from the python tutor list in the hope of
    >> reaching more people experienced in concurrent disk access in python.
    >>
    >> I am trying to see if there are ways in which I can read a big file
    >> concurrently on a multi core server and process data and write the
    >> output to a single file as the data is processed.
    >>
    >> For example if I have a 50Gb file, I would like to read it in parallel
    >> with 10 process/thread, each working on a 10Gb data and perform the
    >> same data parallel computation on each chunk of fine collating the
    >> output to a single file.
    >>
    >> I will appreciate your feedback. I did find some threads about this on
    >> stackoverflow but it was not clear to me what would be a good  way to
    >> go about implementing this.
    >>

    >
    > Have you written a single-core solution to your problem?  If so, can
    > you post the code here?
    >
    > If CPU isn't your primary bottleneck, then you need to be careful not
    > to overly complicate your solution by getting multiple cores
    > involved.  All the coordination might make your program slower and
    > more buggy.
    >
    > If CPU is the primary bottleneck, then you might want to consider an
    > approach where you only have a single thread that's reading records
    > from the file, 10 at a time, and then dispatching out the calculations
    > to different threads, then writing results back to disk.
    >
    > My approach would be something like this:
    >
    >  1) Take a small sample of your dataset so that you can process it
    > within 10 seconds or so using a simple, single-core program.
    >  2) Figure out whether you're CPU bound.  A simple way to do this is
    > to comment out the actual computation or replace it with a trivial
    > stub.  If you're CPU bound, the program will run much faster.  If
    > you're IO-bound, the program won't run much faster (since all the work
    > is actually just reading from disk).
    >  3) Figure out how to read 10 records at a time and farm out the
    > records to threads.  Hopefully, your program will take significantly
    > less time.  At this point, don't obsess over collating data.  It might
    > not be 10 times as fast, but it should be somewhat faster to be worth
    > your while.
    >  4) If the threaded approach shows promise, make sure that you can
    > still generate correct output with that approach (in other words,
    > figure out out synchronization and collating).
    >
    > At the end of that experiment, you should have a better feel on where
    > to go next.
    >
    > What is the nature of your computation?  Maybe it would be easier to
    > tune the algorithm then figure out the multi-core optimization.
    >
    >
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Abhishek Pratap, Mar 27, 2012
    #2
    1. Advertising

  3. On Mon, 26 Mar 2012 23:08:08 -0700, Abhishek Pratap
    <> declaimed the following in
    gmane.comp.python.general:

    > I guess what I need to ask was can we improve on the existing disk I/O
    > performance by reading different portions of the file using threads or
    > processes. I am kind of pointing towards a MapReduce task on a file in
    > a shared file system such as GPFS(from IBM). I realize this can be
    > more suited to HDFS but wanted to know if people have implemented
    > something similar on a normal linux based NFS
    >


    At the base, /anything/ that forces seeking on a disk is going to
    have a negative impact. Pretending that the OS has nothing else
    accessing the disk a single reader thread generates something like:

    seek to track/block and read directory information to find which blocks
    contain the file

    seek to first data track/block
    read data until end of allocated blocks on this track
    step to next track/block locations
    repeat

    If you spawn multiple threads (say, two thread for example) you end
    up with:

    1) seek to track/block and read directory information

    1) compute offset into file
    seek to [offset] track/data location
    read block

    2) seek to track/block and read directory information

    2) compute offset into file
    seek to [offset] track/data location
    read block

    LOOP
    1) seek [back] to last read location for this thread
    if end of allocated blocks on this track, step to next
    track/block
    read block

    2) seek [back] to last read location for this thread
    ...

    1/2) repeat until end of data


    Half your I/O time becomes waiting for the drive head to do seeks
    and settle.

    As has been suggested, using one master thread to just read from the
    file -- sequentially, no jumping around -- and distribute the records
    (via some sort of IPC queue) to the worker processes. Depending on the
    architecture you might use the same master to collect results and write
    them to the output file. A complication: is the output /order/ dependent
    upon the order of the input? If it is, then the collector task would
    have to block for each worker in sequence even if some have finished
    ahead of others.
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, Mar 27, 2012
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. HNguyen
    Replies:
    4
    Views:
    2,388
    HNguyen
    Dec 21, 2004
  2. Pep
    Replies:
    6
    Views:
    810
  3. Xah Lee
    Replies:
    5
    Views:
    446
    Fredrik Lundh
    Jan 23, 2005
  4. fuenfzig
    Replies:
    1
    Views:
    543
    fuenfzig
    Nov 23, 2006
  5. Patrick Maupin
    Replies:
    2
    Views:
    610
    Patrick Maupin
    Nov 27, 2009
Loading...

Share This Page