Efficient processing of large nuumeric data file

Discussion in 'Python' started by David Sanders, Jan 18, 2008.

  1. Hi,

    I am processing large files of numerical data. Each line is either a
    single (positive) integer, or a pair of positive integers, where the
    second represents the number of times that the first number is
    repeated in the data -- this is to avoid generating huge raw files,
    since one particular number is often repeated in the data generation
    step.

    My question is how to process such files efficiently to obtain a
    frequency histogram of the data (how many times each number occurs in
    the data, taking into account the repetitions). My current code is as
    follows:

    -------------------
    #!/usr/bin/env python
    # Counts the occurrences of integers in a file and makes a histogram
    of them
    # Allows for a second field which gives the number of counts of each
    datum

    import sys
    args = sys.argv
    num_args = len(args)

    if num_args < 2:
    print "Syntaxis: count.py archivo"
    sys.exit();

    name = args[1]
    file = open(name, "r")

    hist = {} # dictionary for histogram
    num = 0

    for line in file:
    data = line.split()
    first = int(data[0])

    if len(data) == 1:
    count = 1
    else:
    count = int(data[1]) # more than one repetition

    if first in hist: # add the information to the histogram
    hist[first]+=count
    else:
    hist[first]=count

    num+=count

    keys = hist.keys()
    keys.sort()

    print "# i fraction hist"
    for i in keys:
    print i, float(hist)/num, hist
    ---------------------

    The data files are large (~100 million lines), and this code takes a
    long time to run (compared to just doing wc -l, for example).

    Am I doing something very inefficient? (Any general comments on my
    pythonic (or otherwise) style are also appreciated!) Is
    "line.split()" efficient, for example?

    Is a dictionary the right way to do this? In any given file, there is
    an upper bound on the data, so it seems to me that some kind of array
    (numpy?) would be more efficient, but the upper bound changes in each
    file.

    Thanks and best wishes,
    David.
     
    David Sanders, Jan 18, 2008
    #1
    1. Advertising

  2. On Jan 18, 12:15 pm, David Sanders <> wrote:

    > Hi,
    >
    > I am processing large files of numerical data. Each line is either a
    > single (positive) integer, or a pair of positive integers, where the
    > second represents the number of times that the first number is
    > repeated in the data -- this is to avoid generating huge raw files,
    > since one particular number is often repeated in the data generation
    > step.
    >
    > My question is how to process such files efficiently to obtain a
    > frequency histogram of the data (how many times each number occurs in
    > the data, taking into account the repetitions). My current code is as
    > follows:
    >
    > -------------------
    > #!/usr/bin/env python
    > # Counts the occurrences of integers in a file and makes a histogram
    > of them
    > # Allows for a second field which gives the number of counts of each
    > datum
    >
    > import sys
    > args = sys.argv
    > num_args = len(args)
    >
    > if num_args < 2:
    > print "Syntaxis: count.py archivo"
    > sys.exit();
    >
    > name = args[1]
    > file = open(name, "r")
    >
    > hist = {} # dictionary for histogram
    > num = 0
    >
    > for line in file:
    > data = line.split()
    > first = int(data[0])
    >
    > if len(data) == 1:
    > count = 1
    > else:
    > count = int(data[1]) # more than one repetition
    >
    > if first in hist: # add the information to the histogram
    > hist[first]+=count
    > else:
    > hist[first]=count
    >
    > num+=count
    >
    > keys = hist.keys()
    > keys.sort()
    >
    > print "# i fraction hist"
    > for i in keys:
    > print i, float(hist)/num, hist
    > ---------------------
    >
    > The data files are large (~100 million lines), and this code takes a
    > long time to run (compared to just doing wc -l, for example).
    >
    > Am I doing something very inefficient? (Any general comments on my
    > pythonic (or otherwise) style are also appreciated!) Is
    > "line.split()" efficient, for example?


    Without further information, I don't see anything particularly
    inefficient. What may help here is if you have any a priori knowledge
    about the data, specifically:

    - How often does a single number occur compared to a pair of numbers ?
    E.g. if a single number is much more common that a pair, you can avoid
    split() most of the times:
    try: first,count = int(line), 1
    except ValueError:
    first,count = map(int,line.split())

    Similarly if the pair is much more frequent than the single number,
    just invert the above so that the common case is in the 'try' block
    and the infrequent in 'except'. However if the two cases have similar
    frequency, or if you have no a priori knowledge, try/except will
    likely be slower.

    - What proportion of the first numbers is unique ? If it's small
    enough, a faster way to update the dict is:
    try: hist[first]+=count
    except KeyError:
    hist[first] = 1

    > Is a dictionary the right way to do this? In any given file, there is
    > an upper bound on the data, so it seems to me that some kind of array
    > (numpy?) would be more efficient, but the upper bound changes in each
    > file.


    Yes, dict is the right data structure; since Python 2.5,
    collections.defaultdict is an alternative. numpy is good for
    processing numeric data once they are already in arrays, not for
    populating them.

    George
     
    George Sakkis, Jan 18, 2008
    #2
    1. Advertising

  3. David Sanders

    Matimus Guest

    On Jan 18, 9:15 am, David Sanders <> wrote:
    > Hi,
    >
    > I am processing large files of numerical data. Each line is either a
    > single (positive) integer, or a pair of positive integers, where the
    > second represents the number of times that the first number is
    > repeated in the data -- this is to avoid generating huge raw files,
    > since one particular number is often repeated in the data generation
    > step.
    >
    > My question is how to process such files efficiently to obtain a
    > frequency histogram of the data (how many times each number occurs in
    > the data, taking into account the repetitions). My current code is as
    > follows:
    >
    > -------------------
    > #!/usr/bin/env python
    > # Counts the occurrences of integers in a file and makes a histogram
    > of them
    > # Allows for a second field which gives the number of counts of each
    > datum
    >
    > import sys
    > args = sys.argv
    > num_args = len(args)
    >
    > if num_args < 2:
    > print "Syntaxis: count.py archivo"
    > sys.exit();
    >
    > name = args[1]
    > file = open(name, "r")
    >
    > hist = {} # dictionary for histogram
    > num = 0
    >
    > for line in file:
    > data = line.split()
    > first = int(data[0])
    >
    > if len(data) == 1:
    > count = 1
    > else:
    > count = int(data[1]) # more than one repetition
    >
    > if first in hist: # add the information to the histogram
    > hist[first]+=count
    > else:
    > hist[first]=count
    >
    > num+=count
    >
    > keys = hist.keys()
    > keys.sort()
    >
    > print "# i fraction hist"
    > for i in keys:
    > print i, float(hist)/num, hist
    > ---------------------
    >
    > The data files are large (~100 million lines), and this code takes a
    > long time to run (compared to just doing wc -l, for example).
    >
    > Am I doing something very inefficient? (Any general comments on my
    > pythonic (or otherwise) style are also appreciated!) Is
    > "line.split()" efficient, for example?
    >
    > Is a dictionary the right way to do this? In any given file, there is
    > an upper bound on the data, so it seems to me that some kind of array
    > (numpy?) would be more efficient, but the upper bound changes in each
    > file.



    My first suggestion is to wrap your code in a function. Functions run
    much faster in python than module level code, so that will give you a
    speed up right away. My second suggestion is to look into using
    defaultdict for your histogram. A dictionary is a very appropriate way
    to store this data. There has been some mention of a bag type, which
    would do exactly what you need, but unfortunately there is not a built
    in bag type (yet). I would write it something like this:

    from collections import defaultdict

    def get_hist(file_name):
    hist = defaultdict(int)
    f = open(filename,"r")
    for line in f:
    vals = line.split()
    val = int(vals[0])
    try: # don't look to see if you will cause an error,
    # just cause it and then deal with it
    cnt = int(vals[1])
    except IndexError:
    cnt = 1
    hist[val] += cnt
    return hist

    HTH

    Matt
     
    Matimus, Jan 18, 2008
    #3
  4. David Sanders

    Paul Rubin Guest

    David Sanders <> writes:
    > The data files are large (~100 million lines), and this code takes a
    > long time to run (compared to just doing wc -l, for example).


    wc is written in carefully optimized C and will almost certainly
    run faster than any python program.

    > Am I doing something very inefficient? (Any general comments on my
    > pythonic (or otherwise) style are also appreciated!) Is
    > "line.split()" efficient, for example?


    Your implementation's efficiency is not too bad. Stylistically it's
    not quite fluent but there's nothing to really criticize--you may
    develop a more concise style with experience, or maybe not.
    One small optimization you could make is to use collections.defaultdict
    to hold the counters instead of a regular dict, so you can get rid of
    the test for whether a key is in the dict.

    Keep an eye on your program's memory consumption as it runs. The
    overhead of a pair of python ints and a dictionary cell to hold them
    is some dozens of bytes at minimum. If you have a lot of distinct
    keys and not enough memory to hold them all in the large dict, your
    system may be thrashing. If that is happening, the two basic
    solutions are 1) buy more memory; or, 2) divide the input into smaller
    pieces, attack them separately, and merge the results.

    If I were writing this program and didn't have to run it too often,
    I'd probably use the unix "sort" utility to sort the input (that
    utility does an external disk sort if the input is large enough to
    require it) then make a single pass over the sorted list to count up
    each group of keys (see itertools.groupby for a convenient way to do
    that).
     
    Paul Rubin, Jan 18, 2008
    #4
  5. On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders
    <> declaimed the following in comp.lang.python:
    >
    > My question is how to process such files efficiently to obtain a
    > frequency histogram of the data (how many times each number occurs in
    > the data, taking into account the repetitions). My current code is as
    > follows:
    >

    No idea if the using .get() is faster, but it may reduce some if
    nesting...

    > -------------------
    > #!/usr/bin/env python
    > # Counts the occurrences of integers in a file and makes a histogram
    > of them
    > # Allows for a second field which gives the number of counts of each
    > datum
    >
    > import sys
    > args = sys.argv
    > num_args = len(args)
    >

    I might just use the full qualified name sys.argv rather than a
    local reference...
    > if num_args < 2:


    import sys
    if len(sys.argv) < 2:

    > print "Syntaxis: count.py archivo"
    > sys.exit();
    >
    > name = args[1]
    > file = open(name, "r")
    >

    #don't use 'file' -- that is a built-in function
    fin = open(sys.argv[1], "r")

    > hist = {} # dictionary for histogram
    > num = 0
    >
    > for line in file:

    for line in fin:

    > data = line.split()
    > first = int(data[0])
    >

    data = [int(itm) for itm in line.split()] #convert all


    > if len(data) == 1:
    > count = 1
    > else:
    > count = int(data[1]) # more than one repetition
    >
    > if first in hist: # add the information to the histogram
    > hist[first]+=count
    > else:
    > hist[first]=count
    >

    if len(data) == 1:
    hist[data[0]] = hist.get(data[0], 0) + 1
    num += 1
    else:
    hist[data[0]] = hist.get(data[0], 0) + data[1]
    num += data[1]

    > num+=count
    >
    > keys = hist.keys()
    > keys.sort()
    >
    > print "# i fraction hist"
    > for i in keys:
    > print i, float(hist)/num, hist
    > ---------------------


    <snip>
    > Is a dictionary the right way to do this? In any given file, there is
    > an upper bound on the data, so it seems to me that some kind of array
    > (numpy?) would be more efficient, but the upper bound changes in each
    > file.


    Presuming the key integer is always positive, it might be possible
    to use a list with some logic to extend the length of the list as
    needed...

    hist = [0] #initialize with index 0
    ....
    data = [int(itm) for itm in line.split()]
    if data[0] >= len(hist):
    hist.extend([0] * (data[0] - len(hist) + 1)
    if len(data) == 1:
    hist[data[0]] += 1
    else:
    hist[data[0]] += data[1]
    ....
    num = sum(hist) #I think sum is a built-in
    ....
    for i, hcnt in enumerate(hist):
    if hcnt != 0:
    print i, float(hcnt)/num, hcnt

    #need to test if enumerate is faster than

    for i in xrange(len(hist)):
    if hist != 0:
    print i, float(hist)/num, hist
    --
    Wulfraed Dennis Lee Bieber KD6MOG

    HTTP://wlfraed.home.netcom.com/
    (Bestiaria Support Staff: )
    HTTP://www.bestiaria.com/
     
    Dennis Lee Bieber, Jan 18, 2008
    #5
  6. David Sanders

    Tim Chase Guest

    > for line in file:

    The first thing I would try is just doing a

    for line in file:
    pass

    to see how much time is consumed merely by iterating over the
    file. This should give you a baseline from which you can base
    your timings

    > data = line.split()
    > first = int(data[0])
    >
    > if len(data) == 1:
    > count = 1
    > else:
    > count = int(data[1]) # more than one repetition


    Well, some experiments I might try:

    try:
    first, count = map(int, data)
    except:
    first = int(data[0])
    count = 1

    or possibly

    first = int(data[0])
    try:
    count = int(data[1])
    except:
    count = 0

    or even

    # pad it to contain at least two items
    # then slice off the first two
    # and then map() calls to int()
    first, count = map(int,(data + [1])[:2])

    I don't know how efficient len() is (if it's internally linearly
    counting the items in data, or if it's caching the length as data
    is created/assigned/modifed) and how that efficiency compares to
    try/except blocks, map() or int() calls.

    I'm not sure any of them is more or less "pythonic", but they
    should all do the same thing.

    > if first in hist: # add the information to the histogram
    > hist[first]+=count
    > else:
    > hist[first]=count


    This might also be written as

    hist[first] = hist.get(first, 0) + count

    > Is a dictionary the right way to do this? In any given file, there is
    > an upper bound on the data, so it seems to me that some kind of array
    > (numpy?) would be more efficient, but the upper bound changes in each
    > file.


    I'm not sure an array would net you great savings here, since the
    upper-bound seems to be an unknown. If "first" has a known
    maximum (surely, the program generating this file has an idea to
    the range of allowed values), you could just create an array the
    length of the span of numbers, initialized to zero, which would
    reduce the hist.get() call to just

    hist[first] += count

    and then you'd iterate over hist (which would already be sorted
    because it's in index order) and use those where count != 0 to
    avoid the holes.

    Otherwise, your code looks good...the above just riff on various
    ways of rewriting your code in case one nets you extra
    time-savings per loop.

    -tkc
     
    Tim Chase, Jan 18, 2008
    #6
  7. David Sanders

    Paul Rubin Guest

    Tim Chase <> writes:
    > first = int(data[0])
    > try:
    > count = int(data[1])
    > except:
    > count = 0


    By the time you're down to this kind of thing making a difference,
    it's probably more important to compile with pyrex or psyco.
     
    Paul Rubin, Jan 18, 2008
    #7
  8. On Fri, 18 Jan 2008 12:06:56 -0600, Tim Chase wrote:

    > I don't know how efficient len() is (if it's internally linearly
    > counting the items in data, or if it's caching the length as data is
    > created/assigned/modifed)


    It depends on what argument you pass to len().

    Lists, tuples and dicts (and maybe strings?) know how long they are, so
    len() takes essentially constant time.

    Custom classes are free to define __len__ anyway they like.

    class MyList(list):
    """Just like the built-in list, but stupider."""
    def __len__(self):
    # Untested, for obvious reasons.
    import random
    import sys
    while True:
    guess = random.randrange(0, sys.maxint)
    count = 0
    for i in self:
    count += 1
    if count == guess:
    return count


    Caching __len__ is left as an exercise to the reader...



    --
    Steven
     
    Steven D'Aprano, Jan 18, 2008
    #8
  9. On Fri, 18 Jan 2008 09:58:57 -0800, Paul Rubin wrote:

    > David Sanders <> writes:
    >> The data files are large (~100 million lines), and this code takes a
    >> long time to run (compared to just doing wc -l, for example).

    >
    > wc is written in carefully optimized C and will almost certainly run
    > faster than any python program.


    However, wc -l doesn't do the same thing as what the Original Poster is
    trying to do. There is little comparison between counting the number of
    lines and building a histogram, except that both tasks have to see each
    line. Naturally the second task will take longer compared to wc.

    ("Why does it take so long to make a three-tier wedding cake? I can boil
    an egg in three minutes!!!")


    --
    Steven
     
    Steven D'Aprano, Jan 18, 2008
    #9
  10. David Sanders

    Guest

    Matt:

    > from collections import defaultdict
    >
    > def get_hist(file_name):
    > hist = defaultdict(int)
    > f = open(filename,"r")
    > for line in f:
    > vals = line.split()
    > val = int(vals[0])
    > try: # don't look to see if you will cause an error,
    > # just cause it and then deal with it
    > cnt = int(vals[1])
    > except IndexError:
    > cnt = 1
    > hist[val] += cnt
    > return hist



    But usually in tight loops exceptions slow down the Python code, so
    this is quite faster (2.5 times faster with Psyco, about 2 times
    without, with about 30% of lines with a space in it):

    import psyco
    from collections import defaultdict

    def get_hist(file_name):
    hist = defaultdict(int)

    for line in open(file_name):
    if " " in line:
    pair = line.split()
    hist[int(pair[0])] += int(pair[1])
    else:
    hist[int(line)] += 1

    return hist

    psyco.bind(get_hist)

    It doesn't work if lines may contain spurious spaces...

    Bye,
    bearophile
     
    , Jan 19, 2008
    #10
  11. David Sanders

    Guest

    ....and just for fun this D code is about 3.2 times faster than the
    Psyco version for the same dataset (30% lines with a space):


    import std.stdio, std.conv, std.string, std.stream;

    int[int] get_hist(string file_name) {
    int[int] hist;

    foreach(string line; new BufferedFile(file_name)) {
    int pos = find(line, ' ');
    if (pos == -1)
    hist[toInt(line)]++;
    else
    hist[toInt(line[0 .. pos])] += toInt(line[pos+1 .. $]);
    }

    return hist;
    }

    void main(string[] args) {
    writefln( get_hist(args[1]).length );
    }


    Bye,
    bearophile
     
    , Jan 19, 2008
    #11
  12. On Jan 18, 11:15 am, David Sanders <> wrote:
    > Hi,
    >
    > I am processing large files of numerical data. Each line is either a
    > single (positive) integer, or a pair of positive integers, where the
    > second represents the number of times that the first number is
    > repeated in the data -- this is to avoid generating huge raw files,
    > since one particular number is often repeated in the data generation
    > step.
    >
    > My question is how to process such files efficiently to obtain a
    > frequency histogram of the data (how many times each number occurs in
    > the data, taking into account the repetitions). My current code is as
    > follows:


    Many thanks to all for the very detailed and helpful replies. I'm
    glad to see I was on the right track, but more happy to have learnt
    some different approaches.

    Thanks and best wishes,
    David.
     
    David Sanders, Jan 19, 2008
    #12
  13. David Sanders

    Jorgen Grahn Guest

    On Fri, 18 Jan 2008 09:15:58 -0800 (PST), David Sanders <> wrote:
    > Hi,
    >
    > I am processing large files of numerical data. Each line is either a
    > single (positive) integer, or a pair of positive integers, where the
    > second represents the number of times that the first number is
    > repeated in the data -- this is to avoid generating huge raw files,
    > since one particular number is often repeated in the data generation
    > step.
    >
    > My question is how to process such files efficiently to obtain a
    > frequency histogram of the data (how many times each number occurs in
    > the data, taking into account the repetitions). My current code is as
    > follows:


    ....

    > The data files are large (~100 million lines), and this code takes a
    > long time to run (compared to just doing wc -l, for example).


    I don't know if you are in control of the *generation* of data, but
    I think it's often better and more convenient to pipe the raw data
    through 'gzip -c' (i.e. gzip-compress it before it hits the disk)
    than to figure out a smart application-specific compression scheme.

    Maybe if you didn't have a homegrown file format, there would have
    been readymade histogram utilities? Or at least a good reason to
    spend the time writing an optimized C version.

    /Jorgen

    --
    // Jorgen Grahn <grahn@ Ph'nglui mglw'nafh Cthulhu
    \X/ snipabacken.se> R'lyeh wgah'nagl fhtagn!
     
    Jorgen Grahn, Jan 20, 2008
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Madhusudhanan Chandrasekaran

    regd efficient methods to manipulate *large* files

    Madhusudhanan Chandrasekaran, May 1, 2006, in forum: Python
    Replies:
    2
    Views:
    348
    Paddy
    May 1, 2006
  2. mrhicks
    Replies:
    3
    Views:
    346
    James Dow Allen
    Sep 1, 2004
  3. John [H2O]

    More efficient array processing

    John [H2O], Oct 23, 2008, in forum: Python
    Replies:
    10
    Views:
    495
    sturlamolden
    Oct 24, 2008
  4. Replies:
    2
    Views:
    252
  5. Replies:
    5
    Views:
    919
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page