My Big Dict.

Xavier · Jul 2, 2003

Greetings,

(do excuse the possibly comical subject text)

I need advice on how I can convert a text db into a dict. Here is an
example of what I need done.

some example data lines in the text db goes as follows:

CODE1!DATA1 DATA2, DATA3
CODE2!DATA1, DATA2 DATA3

As you can see, the lines are dynamic and the data are not alike, they
change in permission values (but that's obvious in any similar situation)

Any idea on how I can convert 20,000+ lines of the above into the following
protocol for use in my code?:

TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}

I was thinking of using AWK or something to the similar liking but I just
wanted to check up with the list for any faster/sufficient hacks in python
to do such a task.

Thanks.

-- Xavier.

oderint dum mutuant

Christophe Delord · Jul 2, 2003

Hello,

Greetings,

(do excuse the possibly comical subject text)

I need advice on how I can convert a text db into a dict. Here is an
example of what I need done.

some example data lines in the text db goes as follows:

CODE1!DATA1 DATA2, DATA3
CODE2!DATA1, DATA2 DATA3

As you can see, the lines are dynamic and the data are not alike, they
change in permission values (but that's obvious in any similar
situation)

Any idea on how I can convert 20,000+ lines of the above into the
following protocol for use in my code?:

TXTDB = {'CODE1': 'DATA1 DATA2, DATA3', 'CODE2': 'DATA1, DATA2 DATA3'}

I was thinking of using AWK or something to the similar liking but I
just wanted to check up with the list for any faster/sufficient hacks
in python to do such a task.

If your data is in a string you can use a regular expression to parse
each line, then the findall method returns a list of tuples containing
the key and the value of each item. Finally the dict class can turn this
list into a dict. For example:

data_re = re.compile(r"^(\w+)!(.*)", re.MULTILINE)

bigdict = dict(data_re.findall(data))

On my computer the second line take between 7 and 8 seconds to parse
100000 lines.

Try this:

------------------------------
import re
import time

N = 100000

print "Initialisation..."
data = "".join(["CODE%d!DATA%d_1, DATA%d_2, DATA%d_3\n"%(i,i,i,i) for i
in range(N)])

data_re = re.compile(r"^(\w+)!(.*)", re.MULTILINE)

print "Parsing..."
start = time.time()
bigdict = dict(data_re.findall(data))
stop = time.time()

print "%s items parsed in %s seconds"%(len(bigdict), stop-start)
------------------------------

Aurélien Géron · Jul 2, 2003

Christophe Delord said:
Hello,

If your data is in a string you can use a regular expression to parse
each line, then the findall method returns a list of tuples containing
the key and the value of each item. Finally the dict class can turn this
list into a dict. For example:

Click to expand...

and you can kill a fly with a sledgehammer. why not

f = open('somefile.txt')
d = {}
l = f.readlines()
for i in l:
a,b = i.split('!')
d[a] = b.strip()

or am i missing something obvious? (b/t/w the above parsed 20000+ lines on a
celeron 500 in less than a second.)

Your code looks good Christophe. Just two little things to be aware of:
1) if you use split like this, then each line must contain one and only one
'!', which means (in particular) that empy lines will bomb, and also data
must not contain any '!' or else you'll get an exception such as
"ValueError: unpack list of wrong size". If your data may contain '!',
then consider slicing up each line in a different way.
2) if your file is really huge, then you may want to fill up your dictionary
as you're reading the file, instead of reading everything in a list and then
building your dictionary (hence using up twice the memory).

But apart from these details, I agree with Christophe that this is the way
to go.

Aurélien

John Hunter · Jul 2, 2003

drs> f = open('somefile.txt')
drs> d = {}
drs> l = f.readlines()
drs> for i in l:
drs> a,b = i.split('!')
drs> d[a] = b.strip()

I would make one minor modification of this. If the file were *really
long*, you could run into troubles trying to hold it in memory. I
find the following a little cleaner (with python 2.2), and doesn't
require putting the whole file in memory. A file instance is an
iterator (http://www.python.org/doc/2.2.1/whatsnew/node4.html) which
will call readline as needed:

d = {}
for line in file('sometext.dat'):
key,val = line.split('!')
d[key] = val.strip()

Or if you are not worried about putting it in memory, you can use list
comprehensions for speed

d = dict([ line.split('!') for line in file('somefile.text')])

Russell> I have just started learning Python, and I have never
Russell> used dictionaries in Python, and despite the fact that
Russell> you used mostly non-descriptive variable names, I can
Russell> still read your code perfectly and know exactly what it
Russell> does. I think I could use dictionaries now, just from
Russell> looking at your code snippet. Python rules

Truly.

JDH

Paul Simmonds · Jul 2, 2003

Aurélien Géron said:
and you can kill a fly with a sledgehammer. why not

f = open('somefile.txt')
d = {}
l = f.readlines()
for i in l:
a,b = i.split('!')
d[a] = b.strip()

Click to expand...

Your code looks good Christophe. Just two little things to be aware of:

I think I'm right in saying Christophe's approach was using the 're'
module, which has been snipped, whereas the approach was the above
using split was by "drs".

1) if you use split like this, then each line must contain one and only one
'!', which means (in particular) that empy lines will bomb, and also data
must not contain any '!' or else you'll get an exception such as
"ValueError: unpack list of wrong size". If your data may contain '!',
then consider slicing up each line in a different way.

If this is a problem, use a combination of count and index methods to
find the first, and use slices. For example, if you don't mind
two-lined list comps:

d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
for l in file('test.txt') if l.count('!')])

2) if your file is really huge, then you may want to fill up your dictionary
as you're reading the file, instead of reading everything in a list and then
building your dictionary (hence using up twice the memory).

Agreed.

The above list comprehension has the disadvantages that it finds how
many '!' characters for every line, and it reads the whole file in at
once. Assuming there are going to be more data lines than not, this is
much faster:

d={}
for l in file("test.txt"):
try: i=l.index('!')
except ValueError: continue
d[l[:i]]=l[i+1:]

It's often much faster to ask forgiveness than permission. I measure
it about twice as fast as the 're' method, and about four times as
fast as the list comp above.
HTH,
Paul

Christian Tismer · Jul 2, 2003

Paul Simmonds wrote:
....

I'm not trying to intrude this thread, but was just
struck by the list comprehension below, so this is
about readability.

If this is a problem, use a combination of count and index methods to
find the first, and use slices. For example, if you don't mind
two-lined list comps:

d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
for l in file('test.txt') if l.count('!')])

With every respect, this looks pretty much like another
P-language. The pure existance of list comprehensions
does not try to force you to use it everywhere

....

compared to this:
....

d={}
for l in file("test.txt"):
try: i=l.index('!')
except ValueError: continue
d[l[:i]]=l[i+1:]

which is both faster in this case and easier to read.

About speed: I'm not sure with the current Python
version, but it might be worth trying to go without
the exception:

d={}
for l in file("test.txt"):
i=l.find('!')
if i >= 0:
d[l[:i]]=l[i+1:]

and then you might even consider to split on the first
"!", but I didn't do any timings:

d={}
for l in file("test.txt"):
try:
key, value = l.split("!", 1)
except ValueError: continue
d[key] = value

cheers -- chris

--
Christian Tismer :^) <mailto:[email protected]>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/

Paul Simmonds · Jul 3, 2003

Christian Tismer said:
Paul Simmonds wrote:
...
I'm not trying to intrude this thread, but was just
struck by the list comprehension below, so this is
about readability.

d=dict([(l[:l.index('!')],l[l.index('!')+1:-1])\
for l in file('test.txt') if l.count('!')])

Click to expand...

With every respect, this looks pretty much like another
P-language. The pure existance of list comprehensions
does not try to force you to use it everywhere

Quite right. I think that mutation came from the fact that I was
thinking in C all day. Still, I don't even write C like that...it
should be put to sleep ASAP.

d={}
for l in file("test.txt"):
try: i=l.index('!')
except ValueError: continue
d[l[:i]]=l[i+1:]

Click to expand...

About speed: I'm not sure with the current Python
version, but it might be worth trying to go without
the exception:

d={}
for l in file("test.txt"):
i=l.find('!')
if i >= 0:
d[l[:i]]=l[i+1:]

and then you might even consider to split on the first
"!", but I didn't do any timings:

d={}
for l in file("test.txt"):
try:
key, value = l.split("!", 1)
except ValueError: continue
d[key] = value

Just when you think you know a language, an optional argument you've
never used pops up to make your life easier. Thanks for pointing that
out.

I've done some timings on the functions above, here are the results:

Python2.2.1, 200000 line file(all data lines)
try/except with split: 3.08s
if with slicing: 2.32s
try/except with slicing: 2.34s

So slicing seems quicker than split, and using if instead of
try/except appears to speed it up a little more. I don't know how much
faster the current version of the interpreter would be, but I doubt
the ranking would change much.

Paul

Christian Tismer · Jul 5, 2003

Paul Simmonds wrote:

[some alternative implementations]

I've done some timings on the functions above, here are the results:

Python2.2.1, 200000 line file(all data lines)
try/except with split: 3.08s
if with slicing: 2.32s
try/except with slicing: 2.34s

So slicing seems quicker than split, and using if instead of
try/except appears to speed it up a little more. I don't know how much
faster the current version of the interpreter would be, but I doubt
the ranking would change much.

Interesting. I doubt that split() itself is slow, instead
I believe that the pure fact that you are calling a function
instead of using a syntactic construct makes things slower,
since method lookup is not so cheap. Unfortunately, split()
cannot be cached into a local variable, since it is obtained
as a new method of the line, all the time. On the other hand,
the same holds for the find method...

Well, I wrote a test program and figured out, that the test
results were very dependant from the order of calling the
functions! This means, the results are not independent,
probably due to the memory usage.
Here some results on Win32, testing repeatedly...

D:\slpdev\src\2.2\src\PCbuild>python -i \python22\py\testlines.pyfunction test_index for 200000 lines took 1.064 seconds.
function test_find for 200000 lines took 1.402 seconds.
function test_split for 200000 lines took 1.560 seconds.function test_index for 200000 lines took 1.395 seconds.
function test_find for 200000 lines took 1.502 seconds.
function test_split for 200000 lines took 1.888 seconds.function test_index for 200000 lines took 1.416 seconds.
function test_find for 200000 lines took 1.655 seconds.
function test_split for 200000 lines took 1.755 seconds.
For that reason, I added a command line mode for testing
single functions, with these results:

D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py index
function test_index for 200000 lines took 1.056 seconds.

D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py find
function test_find for 200000 lines took 1.092 seconds.

D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py split
function test_split for 200000 lines took 1.255 seconds.

The results look much more reasonable; the index thing still
seems to be optimum.

Then I added another test, using an unbound str.index function,
which was again a bit faster.
Finally, I moved the try..except clause out of the game, by
using an explicit, restartable iterator, see the attached program.

D:\slpdev\src\2.2\src\PCbuild>python \python22\py\testlines.py index3
function test_index3 for 200000 lines took 0.997 seconds.

As a side result, split seems to be unnecessarily slow.

cheers - chris
--
Christian Tismer :^) <mailto:[email protected]>
Mission Impossible 5oftware : Have a break! Take a ride on Python's
Johannes-Niemeyer-Weg 9a : *Starship* http://starship.python.net/
14109 Berlin : PGP key -> http://wwwkeys.pgp.net/
work +49 30 89 09 53 34 home +49 30 802 86 56 pager +49 173 24 18 776
PGP 0x57F3BF04 9064 F4E1 D754 C2FF 1619 305B C09C 5A3B 57F3 BF04
whom do you want to sponsor today? http://www.stackless.com/

import sys, time

def test_index(data):
d={}
for l in data:
try: i=l.index('!')
except ValueError: continue
d[l[:i]]=l[i+1:]
return d

def test_find(data):
d={}
for l in data:
i=l.find('!')
if i >= 0:
d[l[:i]]=l[i+1:]
return d

def test_split(data):
d={}
for l in data:
try:
key, value = l.split("!", 1)
except ValueError: continue
d[key] = value
return d

def test_index2(data):
d={}
idx = str.index
for l in data:
try: i=idx(l, '!')
except ValueError: continue
d[l[:i]]=l[i+1:]
return d

def test_index3(data):
d={}
idx = str.index
it = iter(data)
while 1:
try:
for l in it:
i=idx(l, '!')
d[l[:i]]=l[i+1:]
else:
return d
except ValueError: continue

def make_data(n=200000):
return [ "this is some silly key %d!and that some silly value" % i for i in xrange(n) ]

def test(funcnames, n=200000):
if sys.platform == "win32":
default_timer = time.clock
else:
default_timer = time.time

data = make_data(n)
for name in funcnames.split():
fname = "test_"+name
f = globals()[fname]
t = default_timer()
f(data)
t = default_timer() - t
print "function %-10s for %d lines took %0.3f seconds." % (fname, n, t)

if __name__ == "__main__":
funcnames = "index find split index2 index3"
if len(sys.argv) > 1:
funcnames = " ".join(sys.argv[1:])
test(funcnames)

fileinput.input, readlines and ...	7	Jun 24, 2009
Newbie Question: python mysqldb performance question	3	May 21, 2007
dict is really slow for big truck	13	Apr 28, 2009
newbie questions on XML- attribute, data with multiple lines	6	Jul 27, 2006
Python dict as unicode	1	Nov 24, 2010
using Gridview xpath to get xml node	3	Jan 6, 2009
Adding rows to table inside a form for input	7	Oct 2, 2009
J2EE struts import data	1	Jul 6, 2007

My Big Dict.

Xavier

Christophe Delord

Aurélien Géron

John Hunter

Paul Simmonds

Christian Tismer

Paul Simmonds

Christian Tismer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads