My Big Dict.

Discussion in 'Python' started by Xavier, Jul 2, 2003.

  1. Xavier

    Xavier Guest

    Greetings,

    (do excuse the possibly comical subject text)

    I need advice on how I can convert a text db into a dict. Here is an
    example of what I need done.

    some example data lines in the text db goes as follows:

    CODE1!DATA1 DATA2, DATA3
    CODE2!DATA1, DATA2 DATA3

    As you can see, the lines are dynamic and the data are not alike, they
    change in permission values (but that's obvious in any similar situation)

    Any idea on how I can convert 20,000+ lines of the above into the following
    protocol for use in my code?:

    TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}

    I was thinking of using AWK or something to the similar liking but I just
    wanted to check up with the list for any faster/sufficient hacks in python
    to do such a task.

    Thanks.

    -- Xavier.

    oderint dum mutuant
    Xavier, Jul 2, 2003
    #1
    1. Advertising

  2. Hello,

    On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:

    > Greetings,
    >
    > (do excuse the possibly comical subject text)
    >
    > I need advice on how I can convert a text db into a dict. Here is an
    > example of what I need done.
    >
    > some example data lines in the text db goes as follows:
    >
    > CODE1!DATA1 DATA2, DATA3
    > CODE2!DATA1, DATA2 DATA3
    >
    > As you can see, the lines are dynamic and the data are not alike, they
    > change in permission values (but that's obvious in any similar
    > situation)
    >
    > Any idea on how I can convert 20,000+ lines of the above into the
    > following protocol for use in my code?:
    >
    > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
    >
    > I was thinking of using AWK or something to the similar liking but I
    > just wanted to check up with the list for any faster/sufficient hacks
    > in python to do such a task.


    If your data is in a string you can use a regular expression to parse
    each line, then the findall method returns a list of tuples containing
    the key and the value of each item. Finally the dict class can turn this
    list into a dict. For example:

    data_re = re.compile(r"^(\w+)!(.*)", re.MULTILINE)

    bigdict = dict(data_re.findall(data))

    On my computer the second line take between 7 and 8 seconds to parse
    100000 lines.

    Try this:

    ------------------------------
    import re
    import time

    N = 100000

    print "Initialisation..."
    data = "".join(["CODE%d!DATA%d_1, DATA%d_2, DATA%d_3\n"%(i,i,i,i) for i
    in range(N)])

    data_re = re.compile(r"^(\w+)!(.*)", re.MULTILINE)

    print "Parsing..."
    start = time.time()
    bigdict = dict(data_re.findall(data))
    stop = time.time()

    print "%s items parsed in %s seconds"%(len(bigdict), stop-start)
    ------------------------------

    >
    > Thanks.
    >
    > -- Xavier.
    >
    > oderint dum mutuant
    >
    >
    >



    --

    (o_ Christophe Delord __o
    //\ http://christophe.delord.free.fr/ _`\<,_
    V_/_ mailto: (_)/ (_)
    Christophe Delord, Jul 2, 2003
    #2
    1. Advertising

  3. > "Christophe Delord" <> wrote in message
    > news:...
    > > Hello,
    > >
    > > On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:
    > >
    > > > Greetings,
    > > >
    > > > (do excuse the possibly comical subject text)
    > > >
    > > > I need advice on how I can convert a text db into a dict. Here is an
    > > > example of what I need done.
    > > >
    > > > some example data lines in the text db goes as follows:
    > > >
    > > > CODE1!DATA1 DATA2, DATA3
    > > > CODE2!DATA1, DATA2 DATA3
    > > >
    > > > As you can see, the lines are dynamic and the data are not alike, they
    > > > change in permission values (but that's obvious in any similar
    > > > situation)
    > > >
    > > > Any idea on how I can convert 20,000+ lines of the above into the
    > > > following protocol for use in my code?:
    > > >
    > > > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
    > > >

    > >
    > > If your data is in a string you can use a regular expression to parse
    > > each line, then the findall method returns a list of tuples containing
    > > the key and the value of each item. Finally the dict class can turn this
    > > list into a dict. For example:

    >
    > and you can kill a fly with a sledgehammer. why not
    >
    > f = open('somefile.txt')
    > d = {}
    > l = f.readlines()
    > for i in l:
    > a,b = i.split('!')
    > d[a] = b.strip()
    >
    > or am i missing something obvious? (b/t/w the above parsed 20000+ lines on

    a
    > celeron 500 in less than a second.)


    Your code looks good Christophe. Just two little things to be aware of:
    1) if you use split like this, then each line must contain one and only one
    '!', which means (in particular) that empy lines will bomb, and also data
    must not contain any '!' or else you'll get an exception such as
    "ValueError: unpack list of wrong size". If your data may contain '!',
    then consider slicing up each line in a different way.
    2) if your file is really huge, then you may want to fill up your dictionary
    as you're reading the file, instead of reading everything in a list and then
    building your dictionary (hence using up twice the memory).

    But apart from these details, I agree with Christophe that this is the way
    to go.

    Aurélien
    Aurélien Géron, Jul 2, 2003
    #3
  4. Xavier

    John Hunter Guest

    >>>>> "Russell" == Russell Reagan <> writes:

    drs> f = open('somefile.txt')
    drs> d = {}
    drs> l = f.readlines()
    drs> for i in l:
    drs> a,b = i.split('!')
    drs> d[a] = b.strip()


    I would make one minor modification of this. If the file were *really
    long*, you could run into troubles trying to hold it in memory. I
    find the following a little cleaner (with python 2.2), and doesn't
    require putting the whole file in memory. A file instance is an
    iterator (http://www.python.org/doc/2.2.1/whatsnew/node4.html) which
    will call readline as needed:

    d = {}
    for line in file('sometext.dat'):
    key,val = line.split('!')
    d[key] = val.strip()

    Or if you are not worried about putting it in memory, you can use list
    comprehensions for speed

    d = dict([ line.split('!') for line in file('somefile.text')])

    Russell> I have just started learning Python, and I have never
    Russell> used dictionaries in Python, and despite the fact that
    Russell> you used mostly non-descriptive variable names, I can
    Russell> still read your code perfectly and know exactly what it
    Russell> does. I think I could use dictionaries now, just from
    Russell> looking at your code snippet. Python rules :)

    Truly.

    JDH
    John Hunter, Jul 2, 2003
    #4
  5. "Aurélien Géron" <> wrote in message news:<bdua4i$18el$>...
    > "drs" wrote...
    > > "Christophe Delord" <> wrote in message
    > > news:...
    > > > Hello,
    > > >
    > > > On Wed, 2 Jul 2003 00:13:26 -0400, Xavier wrote:

    <snip>
    > > > > I need advice on how I can convert a text db into a dict. Here is an
    > > > > example of what I need done.
    > > > >
    > > > > some example data lines in the text db goes as follows:
    > > > >
    > > > > CODE1!DATA1 DATA2, DATA3
    > > > > CODE2!DATA1, DATA2 DATA3

    <snip>
    > > > > Any idea on how I can convert 20,000+ lines of the above into the
    > > > > following protocol for use in my code?:
    > > > >
    > > > > TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}
    > > > >
    > > >
    > > > If your data is in a string you can use a regular expression to parse
    > > > each line, then the findall method returns a list of tuples containing
    > > > the key and the value of each item. Finally the dict class can turn this
    > > > list into a dict. For example:

    <example snipped>
    > >
    > > and you can kill a fly with a sledgehammer. why not
    > >
    > > f = open('somefile.txt')
    > > d = {}
    > > l = f.readlines()
    > > for i in l:
    > > a,b = i.split('!')
    > > d[a] = b.strip()

    <snip>
    > Your code looks good Christophe. Just two little things to be aware of:


    I think I'm right in saying Christophe's approach was using the 're'
    module, which has been snipped, whereas the approach was the above
    using split was by "drs".

    > 1) if you use split like this, then each line must contain one and only one
    > '!', which means (in particular) that empy lines will bomb, and also data
    > must not contain any '!' or else you'll get an exception such as
    > "ValueError: unpack list of wrong size". If your data may contain '!',
    > then consider slicing up each line in a different way.


    If this is a problem, use a combination of count and index methods to
    find the first, and use slices. For example, if you don't mind
    two-lined list comps:

    d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
    for l in file('test.txt') if l.count('!')])

    > 2) if your file is really huge, then you may want to fill up your dictionary
    > as you're reading the file, instead of reading everything in a list and then
    > building your dictionary (hence using up twice the memory).

    Agreed.

    The above list comprehension has the disadvantages that it finds how
    many '!' characters for every line, and it reads the whole file in at
    once. Assuming there are going to be more data lines than not, this is
    much faster:

    d={}
    for l in file("test.txt"):
    try: i=l.index('!')
    except ValueError: continue
    d[l[:i]]=l[i+1:]

    It's often much faster to ask forgiveness than permission. I measure
    it about twice as fast as the 're' method, and about four times as
    fast as the list comp above.
    HTH,
    Paul

    >
    > But apart from these details, I agree with Christophe that this is the way
    > to go.
    >
    > Aurélien
    Paul Simmonds, Jul 2, 2003
    #5
  6. Paul Simmonds wrote:
    ....

    I'm not trying to intrude this thread, but was just
    struck by the list comprehension below, so this is
    about readability.

    > If this is a problem, use a combination of count and index methods to
    > find the first, and use slices. For example, if you don't mind
    > two-lined list comps:
    >
    > d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
    > for l in file('test.txt') if l.count('!')])


    With every respect, this looks pretty much like another
    P-language. The pure existance of list comprehensions
    does not try to force you to use it everywhere :)

    ....

    compared to this:
    ....

    > d={}
    > for l in file("test.txt"):
    > try: i=l.index('!')
    > except ValueError: continue
    > d[l[:i]]=l[i+1:]


    which is both faster in this case and easier to read.

    About speed: I'm not sure with the current Python
    version, but it might be worth trying to go without
    the exception:

    d={}
    for l in file("test.txt"):
    i=l.find('!')
    if i >= 0:
    d[l[:i]]=l[i+1:]

    and then you might even consider to split on the first
    "!", but I didn't do any timings:

    d={}
    for l in file("test.txt"):
    try:
    key, value = l.split("!", 1)
    except ValueError: continue
    d[key] = value


    cheers -- chris

    --
    Christian Tismer :^) <mailto:>
    Mission Impossible 5oftware : Have a break! Take a ride on Python's
    Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
    14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
    work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
    PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
    whom do you want to sponsor today? http://www.stackless.com/
    Christian Tismer, Jul 2, 2003
    #6
  7. Christian Tismer <> wrote in message news:<>...
    > Paul Simmonds wrote:
    > ...
    > I'm not trying to intrude this thread, but was just
    > struck by the list comprehension below, so this is
    > about readability.

    <snipped>
    > >
    > > d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
    > > for l in file('test.txt') if l.count('!')])

    >
    > With every respect, this looks pretty much like another
    > P-language. The pure existance of list comprehensions
    > does not try to force you to use it everywhere :)
    >


    Quite right. I think that mutation came from the fact that I was
    thinking in C all day. Still, I don't even write C like that...it
    should be put to sleep ASAP.

    <snip>
    > > d={}
    > > for l in file("test.txt"):
    > > try: i=l.index('!')
    > > except ValueError: continue
    > > d[l[:i]]=l[i+1:]

    >
    > About speed: I'm not sure with the current Python
    > version, but it might be worth trying to go without
    > the exception:
    >
    > d={}
    > for l in file("test.txt"):
    > i=l.find('!')
    > if i >= 0:
    > d[l[:i]]=l[i+1:]
    >
    > and then you might even consider to split on the first
    > "!", but I didn't do any timings:
    >
    > d={}
    > for l in file("test.txt"):
    > try:
    > key, value = l.split("!", 1)
    > except ValueError: continue
    > d[key] = value
    >

    Just when you think you know a language, an optional argument you've
    never used pops up to make your life easier. Thanks for pointing that
    out.

    I've done some timings on the functions above, here are the results:

    Python2.2.1, 200000 line file(all data lines)
    try/except with split: 3.08s
    if with slicing: 2.32s
    try/except with slicing: 2.34s

    So slicing seems quicker than split, and using if instead of
    try/except appears to speed it up a little more. I don't know how much
    faster the current version of the interpreter would be, but I doubt
    the ranking would change much.

    Paul
    Paul Simmonds, Jul 3, 2003
    #7
  8. Paul Simmonds wrote:

    [some alternative implementations]

    > I've done some timings on the functions above, here are the results:
    >
    > Python2.2.1, 200000 line file(all data lines)
    > try/except with split: 3.08s
    > if with slicing: 2.32s
    > try/except with slicing: 2.34s
    >
    > So slicing seems quicker than split, and using if instead of
    > try/except appears to speed it up a little more. I don't know how much
    > faster the current version of the interpreter would be, but I doubt
    > the ranking would change much.


    Interesting. I doubt that split() itself is slow, instead
    I believe that the pure fact that you are calling a function
    instead of using a syntactic construct makes things slower,
    since method lookup is not so cheap. Unfortunately, split()
    cannot be cached into a local variable, since it is obtained
    as a new method of the line, all the time. On the other hand,
    the same holds for the find method...

    Well, I wrote a test program and figured out, that the test
    results were very dependant from the order of calling the
    functions! This means, the results are not independent,
    probably due to the memory usage.
    Here some results on Win32, testing repeatedly...

    D:\slpdev\src\2.2\src\PCbuild>python -i \python22\py\testlines.py
    >>> test()

    function test_index for 200000 lines took 1.064 seconds.
    function test_find for 200000 lines took 1.402 seconds.
    function test_split for 200000 lines took 1.560 seconds.
    >>> test()

    function test_index for 200000 lines took 1.395 seconds.
    function test_find for 200000 lines took 1.502 seconds.
    function test_split for 200000 lines took 1.888 seconds.
    >>> test()

    function test_index for 200000 lines took 1.416 seconds.
    function test_find for 200000 lines took 1.655 seconds.
    function test_split for 200000 lines took 1.755 seconds.
    >>>


    For that reason, I added a command line mode for testing
    single functions, with these results:

    D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py index
    function test_index for 200000 lines took 1.056 seconds.

    D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py find
    function test_find for 200000 lines took 1.092 seconds.

    D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py split
    function test_split for 200000 lines took 1.255 seconds.

    The results look much more reasonable; the index thing still
    seems to be optimum.

    Then I added another test, using an unbound str.index function,
    which was again a bit faster.
    Finally, I moved the try..except clause out of the game, by
    using an explicit, restartable iterator, see the attached program.

    D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py index3
    function test_index3 for 200000 lines took 0.997 seconds.

    As a side result, split seems to be unnecessarily slow.

    cheers - chris
    --
    Christian Tismer :^) <mailto:>
    Mission Impossible 5oftware : Have a break! Take a ride on Python's
    Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
    14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
    work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
    PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
    whom do you want to sponsor today? http://www.stackless.com/


    import sys, time

    def test_index(data):
    d={}
    for l in data:
    try: i=l.index('!')
    except ValueError: continue
    d[l[:i]]=l[i+1:]
    return d

    def test_find(data):
    d={}
    for l in data:
    i=l.find('!')
    if i >= 0:
    d[l[:i]]=l[i+1:]
    return d

    def test_split(data):
    d={}
    for l in data:
    try:
    key, value = l.split("!", 1)
    except ValueError: continue
    d[key] = value
    return d

    def test_index2(data):
    d={}
    idx = str.index
    for l in data:
    try: i=idx(l, '!')
    except ValueError: continue
    d[l[:i]]=l[i+1:]
    return d

    def test_index3(data):
    d={}
    idx = str.index
    it = iter(data)
    while 1:
    try:
    for l in it:
    i=idx(l, '!')
    d[l[:i]]=l[i+1:]
    else:
    return d
    except ValueError: continue


    def make_data(n=200000):
    return [ "this is some silly key %d!and that some silly value" % i for i in xrange(n) ]

    def test(funcnames, n=200000):
    if sys.platform == "win32":
    default_timer = time.clock
    else:
    default_timer = time.time

    data = make_data(n)
    for name in funcnames.split():
    fname = "test_"+name
    f = globals()[fname]
    t = default_timer()
    f(data)
    t = default_timer() - t
    print "function %-10s for %d lines took %0.3f seconds." % (fname, n, t)

    if __name__ == "__main__":
    funcnames = "index find split index2 index3"
    if len(sys.argv) > 1:
    funcnames = " ".join(sys.argv[1:])
    test(funcnames)
    Christian Tismer, Jul 5, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Skip Montanaro
    Replies:
    0
    Views:
    412
    Skip Montanaro
    Aug 15, 2003
  2. Shaguf
    Replies:
    0
    Views:
    346
    Shaguf
    Dec 24, 2008
  3. Shaguf
    Replies:
    0
    Views:
    443
    Shaguf
    Dec 26, 2008
  4. Shaguf
    Replies:
    0
    Views:
    229
    Shaguf
    Dec 26, 2008
  5. Shaguf
    Replies:
    0
    Views:
    211
    Shaguf
    Dec 24, 2008
Loading...

Share This Page