gather information from various files efficiently

Discussion in 'Python' started by Klaus Neuner, Dec 14, 2004.

  1. Klaus Neuner

    Klaus Neuner Guest

    Hello,

    I need to gather information that is contained in various files.

    Like so:

    file1:
    =====================
    foo : 1 2
    bar : 2 4
    baz : 3
    =====================

    file2:
    =====================
    foo : 5
    bar : 6
    baz : 7
    =====================

    file3:
    =====================
    foo : 4 18
    bar : 8
    =====================


    The straightforward way to solve this problem is to create a
    dictionary. Like so:


    [...]

    a, b = get_information(line)
    if a in dict.keys():
    dict[a].append(b)
    else:
    dict[a] =


    Yet, I have got 43 such files. Together they are 4,1M
    large. In the future, they will probably become much larger.
    At the moment, the process takes several hours. As it is a process
    that I have to run very often, I would like it to be faster.

    How could the problem be solved more efficiently?


    Klaus
     
    Klaus Neuner, Dec 14, 2004
    #1
    1. Advertising

  2. Klaus Neuner

    Keith Dart Guest

    Klaus Neuner wrote:
    > Hello,
    >
    > I need to gather information that is contained in various files.
    >
    > Like so:
    >
    > file1:
    > =====================
    > foo : 1 2
    > bar : 2 4
    > baz : 3
    > =====================
    >
    > file2:
    > =====================
    > foo : 5
    > bar : 6
    > baz : 7
    > =====================
    >
    > file3:
    > =====================
    > foo : 4 18
    > bar : 8
    > =====================
    >
    >
    > The straightforward way to solve this problem is to create a
    > dictionary. Like so:
    >
    >
    > [...]
    >
    > a, b = get_information(line)
    > if a in dict.keys():
    > dict[a].append(b)
    > else:
    > dict[a] =


    Aye...

    the dict.keys() line creates a temporary list, and then the 'in' does a
    linear search of the list. Better would be:

    try:
    dict[a].append(b)
    except KeyError:
    dict[a] =

    since you expect the key to be there most of the time, this method is
    most efficient. You optomistically get the dictionary entry, and on the
    exceptional case where it doesn't yet exist you add it.






    --
    \/ \/
    (O O)
    -- --------------------oOOo~(_)~oOOo----------------------------------------
    Keith Dart <>
    public key: ID: F3D288E4
    ============================================================================
     
    Keith Dart, Dec 14, 2004
    #2
    1. Advertising

  3. Klaus Neuner

    Kent Johnson Guest

    Keith Dart wrote:
    > try:
    > dict[a].append(b)
    > except KeyError:
    > dict[a] =


    or my favorite Python shortcut:
    dict.setdefault(a, []).append(b)

    Kent
     
    Kent Johnson, Dec 14, 2004
    #3
  4. Keith Dart wrote:

    > Aye...
    >
    > the dict.keys() line creates a temporary list, and then the 'in' does a
    > linear search of the list. Better would be:
    >
    > try:
    > dict[a].append(b)
    > except KeyError:
    > dict[a] =
    >
    > since you expect the key to be there most of the time, this method is
    > most efficient. You optomistically get the dictionary entry, and on the
    > exceptional case where it doesn't yet exist you add it.


    I wonder if

    dct.setdefault(a,[]).append(b)

    wouldn't be even faster. It saves setting up the try/except frame handling in
    python (I assume the C implementation of dicts achieves similar results with
    much less overhead).

    Cheers,

    f

    ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
    variables as builtin types. This, for the benefit of the OP (I know you were
    just following his code conventions).
     
    Fernando Perez, Dec 14, 2004
    #4
  5. Klaus Neuner

    Keith Dart Guest

    Kent Johnson wrote:
    > Keith Dart wrote:
    >
    >> try:
    >> dict[a].append(b)
    >> except KeyError:
    >> dict[a] =

    >
    >
    > or my favorite Python shortcut:
    > dict.setdefault(a, []).append(b)
    >
    > Kent


    Hey, when did THAT get in there? ;-) That's nice. However, the
    try..except block is a useful pattern for many similiar situations that
    the OP might want to keep in mind. It is usually better than the
    following, also:

    if dct.has_key(a):
    dct[a].append(b)
    else:
    dct[a] =


    Which is a pattern I have seen often.




    --
    \/ \/
    (O O)
    -- --------------------oOOo~(_)~oOOo----------------------------------------
    Keith Dart <>
    vcard: <http://www.kdart.com/~kdart/kdart.vcf>
    public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
    ============================================================================
     
    Keith Dart, Dec 14, 2004
    #5
  6. Keith Dart wrote:

    >>> try:
    >>> dict[a].append(b)
    >>> except KeyError:
    >>> dict[a] =


    the drawback here is that exceptions are relatively expensive; if the
    number of collisions are small, you end up throwing and catching lots
    of exceptions. in that case, there are better ways to do this.

    >> dict.setdefault(a, []).append(b)


    the drawback here is that you create a new object for each call, but
    if the number of collisions are high, you end up throwing most of them
    away. in that case, there are better ways to do this.

    (gotta love that method name, btw. a serious candidate for the "most
    confusing name in the standard library" contest... or maybe even the
    "most confusing name in the history of python" contest...)

    > Hey, when did THAT get in there? ;-) That's nice. However, the try..except block is a useful
    > pattern for many similiar situations that the OP might want to keep in mind. It is usually better
    > than the following, also:
    >
    > if dct.has_key(a):
    > dct[a].append(b)
    > else:
    > dct[a] =


    the drawback here is that if the number of collisions are high, you end
    up doing lots of extra dictionary lookups. in that case, there are better
    ways to do this.

    </F>
     
    Fredrik Lundh, Dec 14, 2004
    #6
  7. Klaus Neuner

    Keith Dart Guest

    Fredrik Lundh wrote:
    > ...
    >>if dct.has_key(a):
    >> dct[a].append(b)
    >>else:
    >> dct[a] =

    >
    >
    > the drawback here is that if the number of collisions are high, you end
    > up doing lots of extra dictionary lookups. in that case, there are better
    > ways to do this.


    Sigh, this reminds me of a discussion I had at my work once... It seems
    to write optimal Python code one must understand various probabilites of
    your data, and code according to the likely scenario. :cool: Now, perhaps
    we could write an adaptive data analyzer-code-generator... ;-)







    --
    \/ \/
    (O O)
    -- --------------------oOOo~(_)~oOOo----------------------------------------
    Keith Dart <>
    public key: ID: F3D288E4
    ============================================================================
     
    Keith Dart, Dec 14, 2004
    #7
  8. Klaus Neuner

    Keith Dart Guest

    Fredrik Lundh wrote:
    > ...
    >>if dct.has_key(a):
    >> dct[a].append(b)
    >>else:
    >> dct[a] =

    >
    >
    > the drawback here is that if the number of collisions are high, you end
    > up doing lots of extra dictionary lookups. in that case, there are better
    > ways to do this.


    Sigh, this reminds me of a discussion I had at my work once... It seems
    to write optimal Python code one must understand various probabilites of
    your data, and code according to the likely scenario. :cool: Now, perhaps
    we could write an adaptive data analyzer-code-generator... ;-)







    --
    \/ \/
    (O O)
    -- --------------------oOOo~(_)~oOOo----------------------------------------
    Keith Dart <>
    public key: ID: F3D288E4
    ============================================================================
     
    Keith Dart, Dec 14, 2004
    #8
  9. [Keith]
    > Sigh, this reminds me of a discussion I had at my work once... It seems
    > to write optimal Python code one must understand various probabilites of
    > your data, and code according to the likely scenario. :cool:


    s/Python //g

    --
    Richie Hindle
     
    Richie Hindle, Dec 14, 2004
    #9
  10. Klaus Neuner

    Peter Hansen Guest

    Keith Dart wrote:
    > Sigh, this reminds me of a discussion I had at my work once... It seems
    > to write optimal Python code one must understand various probabilites of
    > your data, and code according to the likely scenario.


    And this is different from optimizing in *any* other language
    in what way?

    -Peter
     
    Peter Hansen, Dec 14, 2004
    #10
  11. On Tue, 14 Dec 2004 10:40:56 -0500, Peter Hansen <> wrote:
    > Keith Dart wrote:
    > > Sigh, this reminds me of a discussion I had at my work once... It seems
    > > to write optimal Python code one must understand various probabilites of
    > > your data, and code according to the likely scenario.

    >
    > And this is different from optimizing in *any* other language
    > in what way?


    In other languages, by the time you get the bloody thing working it's
    time to ship, and you don't have to bother worrying about making it
    optimal.

    --
    Cheers,
    Simon B,
    ,
    http://www.brunningonline.net/simon/blog/
     
    Simon Brunning, Dec 14, 2004
    #11
  12. Klaus Neuner wrote:
    > The straightforward way to solve this problem is to create a
    > dictionary. Like so:
    >
    >
    > [...]
    >
    > a, b = get_information(line)
    > if a in dict.keys():
    > dict[a].append(b)
    > else:
    > dict[a] =
    >


    So I timed the three suggestions with a few different datasets:

    > cat builddict.py

    def askpermission(d, k, v):
    if k in d:
    d[k].append(v)
    else:
    d[k] = [v]

    def askforgiveness(d, k, v):
    try:
    d[k].append(v)
    except KeyError:
    d[k] = [v]

    def default(d, k, v):
    d.setdefault(k, []).append(v)

    def test(items, func):
    d = {}
    for k, v in items:
    func(d, k, v)



    Dataset where every value causes a collision:

    > python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i

    in range(1000)], bd.askpermission)"
    1000 loops, best of 3: 1.62 msec per loop

    > python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i

    in range(1000)], bd.askforgiveness)"
    1000 loops, best of 3: 1.58 msec per loop

    > python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i

    in range(1000)], bd.default)"
    100 loops, best of 3: 2.03 msec per loop


    Dataset where no value causes a collision:

    > python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i

    in range(1000)], bd.askpermission)"
    100 loops, best of 3: 2.29 msec per loop

    > python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i

    in range(1000)], bd.askforgiveness)"
    100 loops, best of 3: 9.96 msec per loop

    > python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i

    in range(1000)], bd.default)"
    100 loops, best of 3: 2.98 msec per loop


    Datset where one of every 5 values causes a collision:

    > python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

    i in range(1000)], bd.askpermission)"
    1000 loops, best of 3: 1.82 msec per loop

    > python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

    i in range(1000)], bd.askforgiveness)"
    1000 loops, best of 3: 1.79 msec per loop

    > python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

    i in range(1000)], bd.default)"
    100 loops, best of 3: 2.2 msec per loop


    So, when there are lots of collisions, you may get a small benefit from
    the try/except solution. If there are very few collisions, you probably
    would rather the if/else solution. The setdefault solution patterns
    about like the if/else solution, but is somewhat slower.

    I will probably continue to use setdefault, because I think it's
    prettier =) but if you're running into a speed bottleneck in this sort
    of situation, you might consider one of the other solutions.

    Steve
     
    Steven Bethard, Dec 14, 2004
    #12
  13. Klaus Neuner

    Jeff Shannon Guest

    Klaus Neuner wrote:

    >Yet, I have got 43 such files. Together they are 4,1M
    >large. In the future, they will probably become much larger.
    >At the moment, the process takes several hours. As it is a process
    >that I have to run very often, I would like it to be faster.
    >
    >


    Others have shown how you can make your dictionary code more efficient,
    which should provide a big speed boost, especially if there are many
    keys in your dicts.

    However, if you're taking this long to read files each time, perhaps
    there's a better high-level approach than just a brute-force scan of
    every file every time. You don't say anything about where those files
    are coming from, or how they're created. Are they relatively static?
    (That is to say, are they (nearly) the same files being read on each
    run?) Do you control the process that creates the files? Given the
    right conditions, you may be able to store your data in a shelve, or
    even proper database, saving you lots of time in parsing through these
    files on each run. Even if it's entirely new data on each run, you may
    be able to find a more efficient way of transferring data from whatever
    the source is into your program.

    Jeff Shannon
    Technician/Programmer
    Credit International
     
    Jeff Shannon, Dec 14, 2004
    #13
  14. Klaus Neuner

    Mike Meyer Guest

    Simon Brunning <> writes:

    > On Tue, 14 Dec 2004 10:40:56 -0500, Peter Hansen <> wrote:
    >> Keith Dart wrote:
    >> > Sigh, this reminds me of a discussion I had at my work once... It seems
    >> > to write optimal Python code one must understand various probabilites of
    >> > your data, and code according to the likely scenario.

    >>
    >> And this is different from optimizing in *any* other language
    >> in what way?

    >
    > In other languages, by the time you get the bloody thing working it's
    > time to ship, and you don't have to bother worrying about making it
    > optimal.


    +1 QOTW.

    <mike

    --
    Mike Meyer <> http://www.mired.org/home/mwm/
    Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
     
    Mike Meyer, Dec 15, 2004
    #14
  15. Klaus Neuner

    Paul McGuire Guest

    "Klaus Neuner" <> wrote in message
    news:...
    > Hello,
    >
    > I need to gather information that is contained in various files.
    >
    > Like so:
    >
    > file1:
    > =====================
    > foo : 1 2
    > bar : 2 4
    > baz : 3
    > =====================
    >
    > file2:
    > =====================
    > foo : 5
    > bar : 6
    > baz : 7
    > =====================
    >
    > file3:
    > =====================
    > foo : 4 18
    > bar : 8
    > =====================
    >
    >
    > The straightforward way to solve this problem is to create a
    > dictionary. Like so:
    >
    >
    > [...]
    >
    > a, b = get_information(line)
    > if a in dict.keys():
    > dict[a].append(b)
    > else:
    > dict[a] =
    >
    >
    > Yet, I have got 43 such files. Together they are 4,1M
    > large. In the future, they will probably become much larger.
    > At the moment, the process takes several hours. As it is a process
    > that I have to run very often, I would like it to be faster.
    >
    > How could the problem be solved more efficiently?
    >
    >
    > Klaus


    You have gotten a number of suggestions on the relative improvements for
    updating your global dictionary of values. My business partner likens code
    optimization to lowering the water in a river. Performance bottlenecks
    stick out like rocks sticking out of a river. Once you resolve one problem
    (remove the rock), you lower the water level, and the next rock/bottleneck
    appears. Have you looked at what is happening in your get_information
    method? If you are still taking long periods of time to scan through these
    files, you should look into what get_information is doing. In working with
    my pyparsing module, I've seen people scan multimegabyte files in seconds,
    so taking hours to sift through 4Mb of data sounds like there may be other
    problems going on.

    With this clean a code input, something like:

    def get_information(line):
    return map(str.strip, line.split(":",1))

    should do the trick. For that matter, you could get rid of the function
    call (calls are expensive in Python), and just inline this to :

    a,b = map(str.strip, line.split(":",1))
    if a in dct:
    dct[a] += b.split()
    else:
    dct[a] = b.split()

    (I'm guessing you want to convert b values that have multiple numbers to a
    list, based on your "dict[a] = " source line.)
    I also renamed dict to dct, per Fernando Perez's suggestion.

    -- Paul
     
    Paul McGuire, Dec 15, 2004
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Laidbak

    Gather Results From Aspx File

    Laidbak, Jan 22, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    361
    Laidbak
    Jan 22, 2004
  2. ABC
    Replies:
    1
    Views:
    448
    DavidG
    Jan 13, 2006
  3. Florencio Cano
    Replies:
    1
    Views:
    337
  4. Imaginationworks
    Replies:
    7
    Views:
    1,063
    Paul McGuire
    Feb 18, 2010
  5. Jan Kechel
    Replies:
    1
    Views:
    112
    Roger Pack
    Aug 17, 2010
Loading...

Share This Page