gather information from various files efficiently

Klaus Neuner · Dec 14, 2004

Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================

The straightforward way to solve this problem is to create a
dictionary. Like so:

[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =

Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?

Klaus

Keith Dart · Dec 14, 2004

Klaus said:
Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================

The straightforward way to solve this problem is to create a
dictionary. Like so:

[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =

Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] =

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
public key: ID: F3D288E4
============================================================================

Kent Johnson · Dec 14, 2004

Keith said:
try:
dict[a].append(b)
except KeyError:
dict[a] =

or my favorite Python shortcut:
dict.setdefault(a, []).append(b)

Kent

Fernando Perez · Dec 14, 2004

Keith said:
Aye...

the dict.keys() line creates a temporary list, and then the 'in' does a
linear search of the list. Better would be:

try:
dict[a].append(b)
except KeyError:
dict[a] =

since you expect the key to be there most of the time, this method is
most efficient. You optomistically get the dictionary entry, and on the
exceptional case where it doesn't yet exist you add it.

I wonder if

dct.setdefault(a,[]).append(b)

wouldn't be even faster. It saves setting up the try/except frame handling in
python (I assume the C implementation of dicts achieves similar results with
much less overhead).

Cheers,

f

ps. I changed dict->dct because it's a generally Bad Idea (TM) to name local
variables as builtin types. This, for the benefit of the OP (I know you were
just following his code conventions).

Keith Dart · Dec 14, 2004

Kent said:
Keith said:

try:
dict[a].append(b)
except KeyError:
dict[a] =

Click to expand...

or my favorite Python shortcut:
dict.setdefault(a, []).append(b)

Kent

Hey, when did THAT get in there? ;-) That's nice. However, the
try..except block is a useful pattern for many similiar situations that
the OP might want to keep in mind. It is usually better than the
following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =

Which is a pattern I have seen often.

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
vcard: <http://www.kdart.com/~kdart/kdart.vcf>
public key: ID: F3D288E4 URL: <http://www.kdart.com/~kdart/public.key>
============================================================================

Fredrik Lundh · Dec 14, 2004

Keith said:
try:
dict[a].append(b)
except KeyError:
dict[a] =

Click to expand...

Click to expand...

the drawback here is that exceptions are relatively expensive; if the
number of collisions are small, you end up throwing and catching lots
of exceptions. in that case, there are better ways to do this.

dict.setdefault(a, []).append(b)

Click to expand...

Click to expand...

the drawback here is that you create a new object for each call, but
if the number of collisions are high, you end up throwing most of them
away. in that case, there are better ways to do this.

(gotta love that method name, btw. a serious candidate for the "most
confusing name in the standard library" contest... or maybe even the
"most confusing name in the history of python" contest...)

Hey, when did THAT get in there? ;-) That's nice. However, the try..except block is a useful
pattern for many similiar situations that the OP might want to keep in mind. It is usually better
than the following, also:

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =

Click to expand...

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

</F>

Keith Dart · Dec 14, 2004

Fredrik said:
...

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =

Click to expand...

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
public key: ID: F3D288E4
============================================================================

Keith Dart · Dec 14, 2004

Fredrik said:
...

if dct.has_key(a):
dct[a].append(b)
else:
dct[a] =

Click to expand...

the drawback here is that if the number of collisions are high, you end
up doing lots of extra dictionary lookups. in that case, there are better
ways to do this.

Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario. Now, perhaps
we could write an adaptive data analyzer-code-generator... ;-)

--
\/ \/
(O O)
-- --------------------oOOo~(_)~oOOo----------------------------------------
Keith Dart <[email protected]>
public key: ID: F3D288E4
============================================================================

Richie Hindle · Dec 14, 2004

[Keith]

Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario.

s/Python //g

Peter Hansen · Dec 14, 2004

Keith said:
Sigh, this reminds me of a discussion I had at my work once... It seems
to write optimal Python code one must understand various probabilites of
your data, and code according to the likely scenario.

And this is different from optimizing in *any* other language
in what way?

-Peter

Simon Brunning · Dec 14, 2004

And this is different from optimizing in *any* other language
in what way?

In other languages, by the time you get the bloody thing working it's
time to ship, and you don't have to bother worrying about making it
optimal.

Steven Bethard · Dec 14, 2004

Klaus said:
The straightforward way to solve this problem is to create a
dictionary. Like so:

[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =

So I timed the three suggestions with a few different datasets:

> cat builddict.py

Click to expand...

def askpermission(d, k, v):
if k in d:
d[k].append(v)
else:
d[k] = [v]

def askforgiveness(d, k, v):
try:
d[k].append(v)
except KeyError:
d[k] = [v]

def default(d, k, v):
d.setdefault(k, []).append(v)

def test(items, func):
d = {}
for k, v in items:
func(d, k, v)

Dataset where every value causes a collision:

> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i

Click to expand...

in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.62 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i

Click to expand...

in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.58 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(100, i) for i

Click to expand...

in range(1000)], bd.default)"
100 loops, best of 3: 2.03 msec per loop

Dataset where no value causes a collision:

> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i

Click to expand...

in range(1000)], bd.askpermission)"
100 loops, best of 3: 2.29 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i

Click to expand...

in range(1000)], bd.askforgiveness)"
100 loops, best of 3: 9.96 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i, i) for i

Click to expand...

in range(1000)], bd.default)"
100 loops, best of 3: 2.98 msec per loop

Datset where one of every 5 values causes a collision:

> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

Click to expand...

i in range(1000)], bd.askpermission)"
1000 loops, best of 3: 1.82 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

Click to expand...

i in range(1000)], bd.askforgiveness)"
1000 loops, best of 3: 1.79 msec per loop

> python -m timeit -s "import builddict as bd" "bd.test([(i % 5, i) for

Click to expand...

i in range(1000)], bd.default)"
100 loops, best of 3: 2.2 msec per loop

So, when there are lots of collisions, you may get a small benefit from
the try/except solution. If there are very few collisions, you probably
would rather the if/else solution. The setdefault solution patterns
about like the if/else solution, but is somewhat slower.

I will probably continue to use setdefault, because I think it's
prettier =) but if you're running into a speed bottleneck in this sort
of situation, you might consider one of the other solutions.

Steve

Jeff Shannon · Dec 14, 2004

Klaus said:
Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

Others have shown how you can make your dictionary code more efficient,
which should provide a big speed boost, especially if there are many
keys in your dicts.

However, if you're taking this long to read files each time, perhaps
there's a better high-level approach than just a brute-force scan of
every file every time. You don't say anything about where those files
are coming from, or how they're created. Are they relatively static?
(That is to say, are they (nearly) the same files being read on each
run?) Do you control the process that creates the files? Given the
right conditions, you may be able to store your data in a shelve, or
even proper database, saving you lots of time in parsing through these
files on each run. Even if it's entirely new data on each run, you may
be able to find a more efficient way of transferring data from whatever
the source is into your program.

Jeff Shannon
Technician/Programmer
Credit International

Mike Meyer · Dec 15, 2004

Simon Brunning said:
In other languages, by the time you get the bloody thing working it's
time to ship, and you don't have to bother worrying about making it
optimal.

+1 QOTW.

<mike

Paul McGuire · Dec 15, 2004

Klaus Neuner said:
Hello,

I need to gather information that is contained in various files.

Like so:

file1:
=====================
foo : 1 2
bar : 2 4
baz : 3
=====================

file2:
=====================
foo : 5
bar : 6
baz : 7
=====================

file3:
=====================
foo : 4 18
bar : 8
=====================

The straightforward way to solve this problem is to create a
dictionary. Like so:

[...]

a, b = get_information(line)
if a in dict.keys():
dict[a].append(b)
else:
dict[a] =

Yet, I have got 43 such files. Together they are 4,1M
large. In the future, they will probably become much larger.
At the moment, the process takes several hours. As it is a process
that I have to run very often, I would like it to be faster.

How could the problem be solved more efficiently?

Klaus

You have gotten a number of suggestions on the relative improvements for
updating your global dictionary of values. My business partner likens code
optimization to lowering the water in a river. Performance bottlenecks
stick out like rocks sticking out of a river. Once you resolve one problem
(remove the rock), you lower the water level, and the next rock/bottleneck
appears. Have you looked at what is happening in your get_information
method? If you are still taking long periods of time to scan through these
files, you should look into what get_information is doing. In working with
my pyparsing module, I've seen people scan multimegabyte files in seconds,
so taking hours to sift through 4Mb of data sounds like there may be other
problems going on.

With this clean a code input, something like:

def get_information(line):
return map(str.strip, line.split(":",1))

should do the trick. For that matter, you could get rid of the function
call (calls are expensive in Python), and just inline this to :

a,b = map(str.strip, line.split(":",1))
if a in dct:
dct[a] += b.split()
else:
dct[a] = b.split()

(I'm guessing you want to convert b values that have multiple numbers to a
list, based on your "dict[a] = " source line.)
I also renamed dict to dct, per Fernando Perez's suggestion.

-- Paul

FAQ 6.17 How do I efficiently match many regular expressions at once?	0	Apr 28, 2011
Lib to generate XML/JSON[P] output from a DTD/XSD/JSON Schema/etc	1	Feb 14, 2013
Open Multiples Files at same time with multiprocessing - How"declare" dynamically the var?	1	Nov 11, 2010
Writing multiple files with with-context	1	May 23, 2011
processing input from multiple files	2	Oct 14, 2010
Default mutable parameters in functions	10	Apr 3, 2014
Accessing files from various functions	1	Aug 12, 2003
extern variable	3	Aug 7, 2009

gather information from various files efficiently

Klaus Neuner

Keith Dart

Kent Johnson

Fernando Perez

Keith Dart

Fredrik Lundh

Keith Dart

Keith Dart

Richie Hindle

Peter Hansen

Simon Brunning

Steven Bethard

Jeff Shannon

Mike Meyer

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads