Efficient processing of large nuumeric data file

David Sanders · Jan 18, 2008

Hi,

I am processing large files of numerical data. Each line is either a
single (positive) integer, or a pair of positive integers, where the
second represents the number of times that the first number is
repeated in the data -- this is to avoid generating huge raw files,
since one particular number is often repeated in the data generation
step.

My question is how to process such files efficiently to obtain a
frequency histogram of the data (how many times each number occurs in
the data, taking into account the repetitions). My current code is as
follows:

-------------------
#!/usr/bin/env python
# Counts the occurrences of integers in a file and makes a histogram
of them
# Allows for a second field which gives the number of counts of each
datum

import sys
args = sys.argv
num_args = len(args)

if num_args < 2:
print "Syntaxis: count.py archivo"
sys.exit();

name = args[1]
file = open(name, "r")

hist = {} # dictionary for histogram
num = 0

for line in file:
data = line.split()
first = int(data[0])

if len(data) == 1:
count = 1
else:
count = int(data[1]) # more than one repetition

if first in hist: # add the information to the histogram
hist[first]+=count
else:
hist[first]=count

num+=count

keys = hist.keys()
keys.sort()

print "# i fraction hist"
for i in keys:
print i, float(hist)/num, hist
---------------------

The data files are large (~100 million lines), and this code takes a
long time to run (compared to just doing wc -l, for example).

Am I doing something very inefficient? (Any general comments on my
pythonic (or otherwise) style are also appreciated!) Is
"line.split()" efficient, for example?

Is a dictionary the right way to do this? In any given file, there is
an upper bound on the data, so it seems to me that some kind of array
(numpy?) would be more efficient, but the upper bound changes in each
file.

Thanks and best wishes,
David.

George Sakkis · Jan 18, 2008

Hi,

I am processing large files of numerical data. Each line is either a
single (positive) integer, or a pair of positive integers, where the
second represents the number of times that the first number is
repeated in the data -- this is to avoid generating huge raw files,
since one particular number is often repeated in the data generation
step.

My question is how to process such files efficiently to obtain a
frequency histogram of the data (how many times each number occurs in
the data, taking into account the repetitions). My current code is as
follows:

-------------------
#!/usr/bin/env python
# Counts the occurrences of integers in a file and makes a histogram
of them
# Allows for a second field which gives the number of counts of each
datum

import sys
args = sys.argv
num_args = len(args)

if num_args < 2:
print "Syntaxis: count.py archivo"
sys.exit();

name = args[1]
file = open(name, "r")

hist = {} # dictionary for histogram
num = 0

for line in file:
data = line.split()
first = int(data[0])

if len(data) == 1:
count = 1
else:
count = int(data[1]) # more than one repetition

if first in hist: # add the information to the histogram
hist[first]+=count
else:
hist[first]=count

num+=count

keys = hist.keys()
keys.sort()

print "# i fraction hist"
for i in keys:
print i, float(hist)/num, hist
---------------------

The data files are large (~100 million lines), and this code takes a
long time to run (compared to just doing wc -l, for example).

Am I doing something very inefficient? (Any general comments on my
pythonic (or otherwise) style are also appreciated!) Is
"line.split()" efficient, for example?

Without further information, I don't see anything particularly
inefficient. What may help here is if you have any a priori knowledge
about the data, specifically:

- How often does a single number occur compared to a pair of numbers ?
E.g. if a single number is much more common that a pair, you can avoid
split() most of the times:
try: first,count = int(line), 1
except ValueError:
first,count = map(int,line.split())

Similarly if the pair is much more frequent than the single number,
just invert the above so that the common case is in the 'try' block
and the infrequent in 'except'. However if the two cases have similar
frequency, or if you have no a priori knowledge, try/except will
likely be slower.

- What proportion of the first numbers is unique ? If it's small
enough, a faster way to update the dict is:
try: hist[first]+=count
except KeyError:
hist[first] = 1

Is a dictionary the right way to do this? In any given file, there is
an upper bound on the data, so it seems to me that some kind of array
(numpy?) would be more efficient, but the upper bound changes in each
file.

Click to expand...

Yes, dict is the right data structure; since Python 2.5,
collections.defaultdict is an alternative. numpy is good for
processing numeric data once they are already in arrays, not for
populating them.

George

Matimus · Jan 18, 2008

Hi,

I am processing large files of numerical data. Each line is either a
single (positive) integer, or a pair of positive integers, where the
second represents the number of times that the first number is
repeated in the data -- this is to avoid generating huge raw files,
since one particular number is often repeated in the data generation
step.

My question is how to process such files efficiently to obtain a
frequency histogram of the data (how many times each number occurs in
the data, taking into account the repetitions). My current code is as
follows:

-------------------
#!/usr/bin/env python
# Counts the occurrences of integers in a file and makes a histogram
of them
# Allows for a second field which gives the number of counts of each
datum

import sys
args = sys.argv
num_args = len(args)

if num_args < 2:
print "Syntaxis: count.py archivo"
sys.exit();

name = args[1]
file = open(name, "r")

hist = {} # dictionary for histogram
num = 0

for line in file:
data = line.split()
first = int(data[0])

if len(data) == 1:
count = 1
else:
count = int(data[1]) # more than one repetition

if first in hist: # add the information to the histogram
hist[first]+=count
else:
hist[first]=count

num+=count

keys = hist.keys()
keys.sort()

print "# i fraction hist"
for i in keys:
print i, float(hist)/num, hist
---------------------

The data files are large (~100 million lines), and this code takes a
long time to run (compared to just doing wc -l, for example).

Am I doing something very inefficient? (Any general comments on my
pythonic (or otherwise) style are also appreciated!) Is
"line.split()" efficient, for example?

Is a dictionary the right way to do this? In any given file, there is
an upper bound on the data, so it seems to me that some kind of array
(numpy?) would be more efficient, but the upper bound changes in each
file.

My first suggestion is to wrap your code in a function. Functions run
much faster in python than module level code, so that will give you a
speed up right away. My second suggestion is to look into using
defaultdict for your histogram. A dictionary is a very appropriate way
to store this data. There has been some mention of a bag type, which
would do exactly what you need, but unfortunately there is not a built
in bag type (yet). I would write it something like this:

from collections import defaultdict

def get_hist(file_name):
hist = defaultdict(int)
f = open(filename,"r")
for line in f:
vals = line.split()
val = int(vals[0])
try: # don't look to see if you will cause an error,
# just cause it and then deal with it
cnt = int(vals[1])
except IndexError:
cnt = 1
hist[val] += cnt
return hist

HTH

Matt

Paul Rubin · Jan 18, 2008

David Sanders said:
The data files are large (~100 million lines), and this code takes a
long time to run (compared to just doing wc -l, for example).

wc is written in carefully optimized C and will almost certainly
run faster than any python program.

Am I doing something very inefficient? (Any general comments on my
pythonic (or otherwise) style are also appreciated!) Is
"line.split()" efficient, for example?

Your implementation's efficiency is not too bad. Stylistically it's
not quite fluent but there's nothing to really criticize--you may
develop a more concise style with experience, or maybe not.
One small optimization you could make is to use collections.defaultdict
to hold the counters instead of a regular dict, so you can get rid of
the test for whether a key is in the dict.

Keep an eye on your program's memory consumption as it runs. The
overhead of a pair of python ints and a dictionary cell to hold them
is some dozens of bytes at minimum. If you have a lot of distinct
keys and not enough memory to hold them all in the large dict, your
system may be thrashing. If that is happening, the two basic
solutions are 1) buy more memory; or, 2) divide the input into smaller
pieces, attack them separately, and merge the results.

If I were writing this program and didn't have to run it too often,
I'd probably use the unix "sort" utility to sort the input (that
utility does an external disk sort if the input is large enough to
require it) then make a single pass over the sorted list to count up
each group of keys (see itertools.groupby for a convenient way to do
that).

Dennis Lee Bieber · Jan 18, 2008

My question is how to process such files efficiently to obtain a
frequency histogram of the data (how many times each number occurs in
the data, taking into account the repetitions). My current code is as
follows:

No idea if the using .get() is faster, but it may reduce some if
nesting...

-------------------
#!/usr/bin/env python
# Counts the occurrences of integers in a file and makes a histogram
of them
# Allows for a second field which gives the number of counts of each
datum

import sys
args = sys.argv
num_args = len(args)

I might just use the full qualified name sys.argv rather than a
local reference...

if num_args < 2:

import sys
if len(sys.argv) < 2:

print "Syntaxis: count.py archivo"
sys.exit();

name = args[1]
file = open(name, "r")

#don't use 'file' -- that is a built-in function
fin = open(sys.argv[1], "r")

hist = {} # dictionary for histogram
num = 0

for line in file:

for line in fin:

data = line.split()
first = int(data[0])

data = [int(itm) for itm in line.split()] #convert all

if len(data) == 1:
count = 1
else:
count = int(data[1]) # more than one repetition

if first in hist: # add the information to the histogram
hist[first]+=count
else:
hist[first]=count

if len(data) == 1:
hist[data[0]] = hist.get(data[0], 0) + 1
num += 1
else:
hist[data[0]] = hist.get(data[0], 0) + data[1]
num += data[1]

num+=count

keys = hist.keys()
keys.sort()

print "# i fraction hist"
for i in keys:
print i, float(hist)/num, hist
---------------------

said:

Is a dictionary the right way to do this? In any given file, there is
an upper bound on the data, so it seems to me that some kind of array
(numpy?) would be more efficient, but the upper bound changes in each
file.

Click to expand...

Presuming the key integer is always positive, it might be possible
to use a list with some logic to extend the length of the list as
needed...

hist = [0] #initialize with index 0
....
data = [int(itm) for itm in line.split()]
if data[0] >= len(hist):
hist.extend([0] * (data[0] - len(hist) + 1)
if len(data) == 1:
hist[data[0]] += 1
else:
hist[data[0]] += data[1]
....
num = sum(hist) #I think sum is a built-in
....
for i, hcnt in enumerate(hist):
if hcnt != 0:
print i, float(hcnt)/num, hcnt

#need to test if enumerate is faster than

for i in xrange(len(hist)):
if hist != 0:
print i, float(hist)/num, hist
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Tim Chase · Jan 18, 2008

for line in file:

The first thing I would try is just doing a

for line in file:
pass

to see how much time is consumed merely by iterating over the
file. This should give you a baseline from which you can base
your timings

data = line.split()
first = int(data[0])

if len(data) == 1:
count = 1
else:
count = int(data[1]) # more than one repetition

Well, some experiments I might try:

try:
first, count = map(int, data)
except:
first = int(data[0])
count = 1

or possibly

first = int(data[0])
try:
count = int(data[1])
except:
count = 0

or even

# pad it to contain at least two items
# then slice off the first two
# and then map() calls to int()
first, count = map(int,(data + [1])[:2])

I don't know how efficient len() is (if it's internally linearly
counting the items in data, or if it's caching the length as data
is created/assigned/modifed) and how that efficiency compares to
try/except blocks, map() or int() calls.

I'm not sure any of them is more or less "pythonic", but they
should all do the same thing.

if first in hist: # add the information to the histogram
hist[first]+=count
else:
hist[first]=count

This might also be written as

hist[first] = hist.get(first, 0) + count

Is a dictionary the right way to do this? In any given file, there is
an upper bound on the data, so it seems to me that some kind of array
(numpy?) would be more efficient, but the upper bound changes in each
file.

I'm not sure an array would net you great savings here, since the
upper-bound seems to be an unknown. If "first" has a known
maximum (surely, the program generating this file has an idea to
the range of allowed values), you could just create an array the
length of the span of numbers, initialized to zero, which would
reduce the hist.get() call to just

hist[first] += count

and then you'd iterate over hist (which would already be sorted
because it's in index order) and use those where count != 0 to
avoid the holes.

Otherwise, your code looks good...the above just riff on various
ways of rewriting your code in case one nets you extra
time-savings per loop.

-tkc

Paul Rubin · Jan 18, 2008

Tim Chase said:
first = int(data[0])
try:
count = int(data[1])
except:
count = 0

By the time you're down to this kind of thing making a difference,
it's probably more important to compile with pyrex or psyco.

Steven D'Aprano · Jan 18, 2008

I don't know how efficient len() is (if it's internally linearly
counting the items in data, or if it's caching the length as data is
created/assigned/modifed)

It depends on what argument you pass to len().

Lists, tuples and dicts (and maybe strings?) know how long they are, so
len() takes essentially constant time.

Custom classes are free to define __len__ anyway they like.

class MyList(list):
"""Just like the built-in list, but stupider."""
def __len__(self):
# Untested, for obvious reasons.
import random
import sys
while True:
guess = random.randrange(0, sys.maxint)
count = 0
for i in self:
count += 1
if count == guess:
return count

Caching __len__ is left as an exercise to the reader...

Steven D'Aprano · Jan 18, 2008

wc is written in carefully optimized C and will almost certainly run
faster than any python program.

However, wc -l doesn't do the same thing as what the Original Poster is
trying to do. There is little comparison between counting the number of
lines and building a histogram, except that both tasks have to see each
line. Naturally the second task will take longer compared to wc.

("Why does it take so long to make a three-tier wedding cake? I can boil
an egg in three minutes!!!")

bearophileHUGS · Jan 18, 2008

Matt:

from collections import defaultdict

def get_hist(file_name):
hist = defaultdict(int)
f = open(filename,"r")
for line in f:
vals = line.split()
val = int(vals[0])
try: # don't look to see if you will cause an error,
# just cause it and then deal with it
cnt = int(vals[1])
except IndexError:
cnt = 1
hist[val] += cnt
return hist

But usually in tight loops exceptions slow down the Python code, so
this is quite faster (2.5 times faster with Psyco, about 2 times
without, with about 30% of lines with a space in it):

import psyco
from collections import defaultdict

def get_hist(file_name):
hist = defaultdict(int)

for line in open(file_name):
if " " in line:
pair = line.split()
hist[int(pair[0])] += int(pair[1])
else:
hist[int(line)] += 1

return hist

psyco.bind(get_hist)

It doesn't work if lines may contain spurious spaces...

Bye,
bearophile

bearophileHUGS · Jan 18, 2008

....and just for fun this D code is about 3.2 times faster than the
Psyco version for the same dataset (30% lines with a space):

import std.stdio, std.conv, std.string, std.stream;

int[int] get_hist(string file_name) {
int[int] hist;

foreach(string line; new BufferedFile(file_name)) {
int pos = find(line, ' ');
if (pos == -1)
hist[toInt(line)]++;
else
hist[toInt(line[0 .. pos])] += toInt(line[pos+1 .. $]);
}

return hist;
}

void main(string[] args) {
writefln( get_hist(args[1]).length );
}

Bye,
bearophile

David Sanders · Jan 19, 2008

Hi,

I am processing large files of numerical data. Each line is either a
single (positive) integer, or a pair of positive integers, where the
second represents the number of times that the first number is
repeated in the data -- this is to avoid generating huge raw files,
since one particular number is often repeated in the data generation
step.

My question is how to process such files efficiently to obtain a
frequency histogram of the data (how many times each number occurs in
the data, taking into account the repetitions). My current code is as
follows:

Many thanks to all for the very detailed and helpful replies. I'm
glad to see I was on the right track, but more happy to have learnt
some different approaches.

Thanks and best wishes,
David.

Jorgen Grahn · Jan 20, 2008

Hi,

I am processing large files of numerical data. Each line is either a
single (positive) integer, or a pair of positive integers, where the
second represents the number of times that the first number is
repeated in the data -- this is to avoid generating huge raw files,
since one particular number is often repeated in the data generation
step.

My question is how to process such files efficiently to obtain a
frequency histogram of the data (how many times each number occurs in
the data, taking into account the repetitions). My current code is as
follows:
....

The data files are large (~100 million lines), and this code takes a
long time to run (compared to just doing wc -l, for example).

I don't know if you are in control of the *generation* of data, but
I think it's often better and more convenient to pipe the raw data
through 'gzip -c' (i.e. gzip-compress it before it hits the disk)
than to figure out a smart application-specific compression scheme.

Maybe if you didn't have a homegrown file format, there would have
been readymade histogram utilities? Or at least a good reason to
spend the time writing an optimized C version.

/Jorgen

What risks of data corruption are reduced by divide large size PST files?	0	Oct 13, 2025
Why should I split large MBOX files?	0	Mar 31, 2026
How do I fix Outlook PST file errors quickly?	0	Apr 15, 2026
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
lists as an efficient implementation of large two-dimensionalarrays(!)	0	Feb 2, 2010
How to Convert PST to EML File Easily?	5	Mar 26, 2026
Processing large CSV files - how to maximise throughput?	11	Oct 24, 2013
Unable to Split Password-Protected PST File Help Needed	0	Feb 28, 2026

Efficient processing of large nuumeric data file

David Sanders

George Sakkis

Matimus

Paul Rubin

Dennis Lee Bieber

Tim Chase

Paul Rubin

Steven D'Aprano

Steven D'Aprano

bearophileHUGS

bearophileHUGS

David Sanders

Jorgen Grahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads