Re: Multi thread reading a file

Discussion in 'Python' started by Gabriel Genellina, Jul 1, 2009.

  1. En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <> escribió:

    > I am very new to python and I am in the process of loading a very
    > large compressed csv file into another format. I was wondering if I
    > can do this in a multi thread approach.


    Does the format conversion involve a significant processing time? If not,
    the total time is dominated by the I/O time (reading and writing the file)
    so it's doubtful you gain anything from multiple threads.

    > Here is the pseudo code I was thinking about:
    >
    > Let T = Total number of lines in a file, Example 1000000 (1 million
    > files)
    > Let B = Total number of lines in a buffer, for example 10000 lines
    >
    >
    > Create a thread to read until buffer
    > Create another thread to read buffer+buffer ( So we have 2 threads
    > now. But since the file is zipped I have to wait until the first
    > thread is completed. Unless someone knows of a clever technique.
    > Write the content of thread 1 into a numpy array
    > Write the content of thread 2 into a numpy array


    Can you process each line independently? Is the record order important? If
    not (or at least, some local dis-ordering is acceptable) you may use a few
    worker threads (doing the conversion), feed them thru a Queue object, put
    the converted lines into another Queue, and let another thread write the
    results onto the destination file.

    import Queue, threading, csv

    def convert(in_queue, out_queue):
    while True:
    row = in_queue.get()
    if row is None: break
    # ... convert row
    out_queue.put(converted_line)

    def write_output(out_queue):
    while True:
    line = out_queue.get()
    if line is None: break
    # ... write line to output file

    in_queue = Queue.Queue()
    out_queue = Queue.Queue()
    tlist = []
    for i in range(4):
    t = threading.Thread(target=convert, args=(in_queue, out_queue))
    t.start()
    tlist.append(t)
    output_thread = threading.Thread(target=write_output, args=(out_queue,))
    output_thread.start()

    with open("...") as csvfile:
    reader = csv.reader(csvfile, ...)
    for row in reader:
    in_queue.put(row)

    for t in tlist: in_queue.put(None) # indicate end-of-work
    for t in tlist: t.join() # wait until finished
    out_queue.put(None)
    output_thread.join() # wait until finished

    --
    Gabriel Genellina
     
    Gabriel Genellina, Jul 1, 2009
    #1
    1. Advertising

  2. Gabriel Genellina wrote:
    > En Tue, 30 Jun 2009 22:52:18 -0300, Mag Gam <> escribió:
    >
    >> I am very new to python and I am in the process of loading a very
    >> large compressed csv file into another format. I was wondering if I
    >> can do this in a multi thread approach.

    >
    > Does the format conversion involve a significant processing time? If
    > not, the total time is dominated by the I/O time (reading and writing
    > the file) so it's doubtful you gain anything from multiple threads.


    Well, the OP didn't say anything about multiple processors, so multiple
    threads may not help wrt. processing time. However, if the file is large
    and the OS can schedule the I/O in a way that a seek disaster is avoided
    (although that's hard to assure with today's hard disk storage density, but
    SSDs may benefit), multiple threads reading multiple partial streams may
    still reduce the overall runtime due to increased I/O throughput.

    That said, the OP was mentioning that the data was compressed, so I doubt
    that the I/O bandwidth is a problem here. As another poster put it: why
    bother? Run a few benchmarks first to see where (and if!) things really get
    slow, and then check what to do about the real problem.

    Stefan
     
    Stefan Behnel, Jul 1, 2009
    #2
    1. Advertising

  3. En Wed, 01 Jul 2009 12:49:31 -0300, Scott David Daniels
    <> escribió:

    > Gabriel Genellina wrote:
    >> ...
    >> def convert(in_queue, out_queue):
    >> while True:
    >> row = in_queue.get()
    >> if row is None: break
    >> # ... convert row
    >> out_queue.put(converted_line)

    >
    > These loops work well with the two-argument version of iter,
    > which is easy to forget, but quite useful to have in your bag
    > of tricks:
    >
    > def convert(in_queue, out_queue):
    > for row in iter(in_queue.get, None):
    > # ... convert row
    > out_queue.put(converted_line)


    Yep, I always forget about that variant of iter() -- very handy!

    --
    Gabriel Genellina
     
    Gabriel Genellina, Jul 2, 2009
    #3
  4. Gabriel Genellina

    ryles Guest

    On Jul 2, 6:10 am, "Gabriel Genellina" <> wrote:
    > En Wed, 01 Jul 2009 12:49:31 -0300, Scott David Daniels  
    > <> escribió:
    > > These loops work well with the two-argument version of iter,
    > > which is easy to forget, but quite useful to have in your bag
    > > of tricks:

    >
    > >      def convert(in_queue, out_queue):
    > >          for row in iter(in_queue.get, None):
    > >              # ... convert row
    > >              out_queue.put(converted_line)

    >
    > Yep, I always forget about that variant of iter() -- very handy!


    Yes, at first glance using iter() here seems quite elegant and clever.
    You might even pat yourself on the back, or treat yourself to an ice
    cream cone, as I once did. There is one subtle distinction, however.
    Please allow me to demonstrate.

    >>> import Queue
    >>>
    >>> queue = Queue.Queue()
    >>>
    >>> queue.put(1)
    >>> queue.put("la la la")
    >>> queue.put(None)
    >>>
    >>> list(iter(queue.get, None))

    [1, 'la la la']
    >>>
    >>> # Cool, it really works! I'm going to change all my old code to use this... new and *improved*

    ....
    >>> # And then one day your user inevitably does something like this.

    ....
    >>> class A(object):

    .... def __init__(self, value):
    .... self.value = value
    ....
    .... def __eq__(self, other):
    .... return self.value == other.value
    ....
    >>> queue.put(A(1))
    >>> queue.put(None)
    >>>
    >>> # And then this happens inside your 'generic' code (which probably even passed your unit tests).

    ....
    >>> list(iter(queue.get, None))

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "<stdin>", line 5, in __eq__
    AttributeError: 'NoneType' object has no attribute 'value'
    >>>
    >>> # Oh... yeah. I really *did* want 'is None' and not '== None' which iter() will do. Sorry guys!


    Please don't let this happen to you too ;)
     
    ryles, Jul 3, 2009
    #4
  5. Gabriel Genellina

    Paul Rubin Guest

    ryles <> writes:
    > >>> # Oh... yeah. I really *did* want 'is None' and not '== None'
    > >>> which iter() will do. Sorry guys!

    >
    > Please don't let this happen to you too ;)


    None is a perfectly good value to put onto a queue. I prefer
    using a unique sentinel to mark the end of the stream:

    sentinel = object()
     
    Paul Rubin, Jul 3, 2009
    #5
  6. Gabriel Genellina

    ryles Guest

    On Jul 2, 10:20 pm, Paul Rubin <http://> wrote:
    > ryles <> writes:
    > > >>> # Oh... yeah. I really *did* want 'is None' and not '== None'
    > > >>> which iter() will do. Sorry guys!

    >
    > > Please don't let this happen to you too ;)

    >
    > None is a perfectly good value to put onto a queue.  I prefer
    > using a unique sentinel to mark the end of the stream:
    >
    >    sentinel = object()


    I agree, this is cleaner than None. We're still in the same boat,
    though, regarding iter(). Either it's 'item == None' or 'item == object
    ()', and depending on the type, __eq__ can introduce some avoidable
    risk.

    FWIW, even object() has its disadvantages. Namely, it doesn't work for
    multiprocessing.Queue which pickles and unpickles, thus giving you a
    new object. One way to deal with this is to define a "Stopper" class
    and type check objects taken from the queue. This is not news to
    anyone who's worked with multiprocessing.Queue, though.
     
    ryles, Jul 3, 2009
    #6
  7. Gabriel Genellina

    Paul Rubin Guest

    ryles <> writes:
    > >    sentinel = object()

    >
    > I agree, this is cleaner than None. We're still in the same boat,
    > though, regarding iter(). Either it's 'item == None' or 'item == object ()'


    Use "item is sentinel".
     
    Paul Rubin, Jul 3, 2009
    #7
  8. En Fri, 03 Jul 2009 00:15:40 -0300, <//>> escribió:

    > ryles <> writes:
    >> >    sentinel = object()

    >>
    >> I agree, this is cleaner than None. We're still in the same boat,
    >> though, regarding iter(). Either it's 'item == None' or 'item == object
    >> ()'

    >
    > Use "item is sentinel".


    We're talking about the iter() builtin behavior, and that uses ==
    internally.

    It could have used an identity test, and that would be better for this
    specific case. But then iter(somefile.read, '') wouldn't work. A
    compromise solution is required; since one can customize the equality test
    but not an identity test, the former has a small advantage. (I don't know
    if this was the actual reason, or even if this really was a concious
    decision, but that's why *I* would choose == to test against the sentinel
    value).

    --
    Gabriel Genellina
     
    Gabriel Genellina, Jul 3, 2009
    #8
  9. Gabriel Genellina

    Paul Rubin Guest

    "Gabriel Genellina" <> writes:
    > We're talking about the iter() builtin behavior, and that uses ==
    > internally.


    Oh, I see. Drat.

    > It could have used an identity test, and that would be better for this
    > specific case. But then iter(somefile.read, '') wouldn't work.


    Yeah, it should allow supplying a predicate instead of using == on
    a value. How about (untested):

    from itertools import *
    ...
    for row in takewhile(lambda x: x is sentinel,
    starmap(in_queue.get, repeat(()))):
    ...
     
    Paul Rubin, Jul 3, 2009
    #9
  10. Gabriel Genellina

    ryles Guest

    On Jul 2, 11:55 pm, Paul Rubin <http://> wrote:
    > Yeah, it should allow supplying a predicate instead of using == on
    > a value.  How about (untested):
    >
    >    from itertools import *
    >    ...
    >    for row in takewhile(lambda x: x is sentinel,
    >                          starmap(in_queue.get, repeat(()))):
    >       ...


    Yeah, it's a small recipe I'm sure a lot of others have written as
    well. My old version:

    def iterwhile(callable_, predicate):
    """ Like iter() but with a predicate instead of a sentinel. """
    return itertools.takewhile(predicate, repeatfunc(callable_))

    where repeatfunc is as defined here:

    http://docs.python.org/library/itertools.html#recipes

    I wish all of these little recipes made their way into itertools or a
    like module; itertools seems a bit tightly guarded.
     
    ryles, Jul 3, 2009
    #10
  11. In message <1beffd94-cfe6-4cf6-
    >, ryles wrote:

    >>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which
    >>>> # iter() will do. Sorry guys!

    >
    > Please don't let this happen to you too ;)


    Strange. others have got told off for using "== None" instead of "is None"
    <http://groups.google.co.nz/group/comp.lang.python/msg/a1f3170fa202af57>,
    and yet it turns out Python itself does exactly the same thing.
     
    Lawrence D'Oliveiro, Jul 5, 2009
    #11
  12. On Sun, 05 Jul 2009 12:12:22 +1200, Lawrence D'Oliveiro wrote:

    > In message <1beffd94-cfe6-4cf6-
    > >, ryles wrote:
    >
    >>>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which
    >>>>> # iter() will do. Sorry guys!

    >>
    >> Please don't let this happen to you too ;)

    >
    > Strange. others have got told off for using "== None" instead of "is
    > None"
    > <http://groups.google.co.nz/group/comp.lang.python/msg/

    a1f3170fa202af57>,
    > and yet it turns out Python itself does exactly the same thing.


    That's not "strange", that's a bug. Did you report it to the tracker?



    --
    Steven
     
    Steven D'Aprano, Jul 5, 2009
    #12
  13. In message <025ff4f1$0$20657$>, Steven D'Aprano wrote:

    > On Sun, 05 Jul 2009 12:12:22 +1200, Lawrence D'Oliveiro wrote:
    >
    >> In message <>, ryles wrote:
    >>
    >>>>>> # Oh... yeah. I really *did* want 'is None' and not '== None' which
    >>>>>> # iter() will do. Sorry guys!
    >>>
    >>> Please don't let this happen to you too ;)

    >>
    >> Strange. others have got told off for using "== None" instead of "is
    >> None" <http://groups.google.co.nz/group/comp.lang.python/msg/a1f3170fa202af57>,
    >> and yet it turns out Python itself does exactly the same thing.

    >
    > That's not "strange", that's a bug.


    It's not a bug, as Gabriel Genellina has pointed out.
     
    Lawrence D'Oliveiro, Jul 5, 2009
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Johanna
    Replies:
    0
    Views:
    611
    Johanna
    Oct 13, 2004
  2. Donkey Hot
    Replies:
    3
    Views:
    4,558
    Chase Preuninger
    Apr 27, 2008
  3. liu yang
    Replies:
    4
    Views:
    2,016
    Antoninus Twink
    Jul 28, 2008
  4. Mag Gam

    Multi thread reading a file

    Mag Gam, Jul 1, 2009, in forum: Python
    Replies:
    2
    Views:
    818
    Mag Gam
    Jul 2, 2009
  5. Mag Gam

    Re: Multi thread reading a file

    Mag Gam, Jul 1, 2009, in forum: Python
    Replies:
    1
    Views:
    926
    Stefan Behnel
    Jul 2, 2009
Loading...

Share This Page