number of different lines in a file

Discussion in 'Python' started by r.e.s., May 18, 2006.

  1. r.e.s.

    r.e.s. Guest

    I have a million-line text file with 100 characters per line,
    and simply need to determine how many of the lines are distinct.

    On my PC, this little program just goes to never-never land:

    def number_distinct(fn):
    f = file(fn)
    x = f.readline().strip()
    L = []
    while x<>'':
    if x not in L:
    L = L + [x]
    x = f.readline().strip()
    return len(L)

    Would anyone care to point out improvements?
    Is there a better algorithm for doing this?
    r.e.s., May 18, 2006
    #1
    1. Advertising

  2. r.e.s.

    Larry Bates Guest

    r.e.s. wrote:
    > I have a million-line text file with 100 characters per line,
    > and simply need to determine how many of the lines are distinct.
    >
    > On my PC, this little program just goes to never-never land:
    >
    > def number_distinct(fn):
    > f = file(fn)
    > x = f.readline().strip()
    > L = []
    > while x<>'':
    > if x not in L:
    > L = L + [x]
    > x = f.readline().strip()
    > return len(L)
    >
    > Would anyone care to point out improvements?
    > Is there a better algorithm for doing this?


    Sounds like homework, but I'll bite.

    def number_distinct(fn):
    hash_dict={}
    total_lines=0
    for line in open(fn, 'r'):
    total_lines+=1
    key=hash(line.strip())
    if hash_dict.has_key(key): continue
    hash_dict[key]=1

    return total_lines, len(hash_dict.keys())

    if __name__=="__main__":
    fn='c:\\test.txt'
    total_lines, distinct_lines=number_distinct(fn)
    print "Total lines=%i, distinct lines=%i" % (total_lines, distinct_lines)


    -Larry Bates
    Larry Bates, May 18, 2006
    #2
    1. Advertising

  3. r.e.s.

    Ben Finney Guest

    "r.e.s." <> writes:

    > I have a million-line text file with 100 characters per line,
    > and simply need to determine how many of the lines are distinct.


    I'd generalise it by allowing the caller to pass any iterable set of
    items. A file handle can be iterated this way, but so can any
    sequence or iterable.

    def count_distinct(seq):
    """ Count the number of distinct items """
    counts = dict()
    for item in seq:
    if not item in counts:
    counts[item] = 0
    counts[item] += 1
    return len(counts)

    >>> infile = file('foo.txt')
    >>> for line in file('foo.txt'):

    ... print line,
    ...
    abc
    def
    ghi
    abc
    ghi
    def
    xyz
    abc
    abc
    def

    >>> infile = file('foo.txt')
    >>> print count_distinct(infile)

    5

    --
    \ "A man may be a fool and not know it -- but not if he is |
    `\ married." -- Henry L. Mencken |
    _o__) |
    Ben Finney
    Ben Finney, May 18, 2006
    #3
  4. r.e.s. wrote:

    > I have a million-line text file with 100 characters per line,
    > and simply need to determine how many of the lines are distinct.
    >
    > On my PC, this little program just goes to never-never land:
    >
    > def number_distinct(fn):
    > f = file(fn)
    > x = f.readline().strip()
    > L = []
    > while x<>'':
    > if x not in L:
    > L = L + [x]
    > x = f.readline().strip()
    > return len(L)


    ouch.

    > Would anyone care to point out improvements?
    > Is there a better algorithm for doing this?


    try this:

    def number_distinct(fn):
    return len(set(s.strip() for s in open(fn)))

    </F>
    Fredrik Lundh, May 18, 2006
    #4
  5. r.e.s.

    Bill Pursell Guest

    r.e.s. wrote:
    > I have a million-line text file with 100 characters per line,
    > and simply need to determine how many of the lines are distinct.
    >
    > On my PC, this little program just goes to never-never land:
    >
    > def number_distinct(fn):
    > f = file(fn)
    > x = f.readline().strip()
    > L = []
    > while x<>'':
    > if x not in L:
    > L = L + [x]
    > x = f.readline().strip()
    > return len(L)
    >
    > Would anyone care to point out improvements?
    > Is there a better algorithm for doing this?


    Have you tried
    cat file | sort | uniq | wc -l ?
    sort might choke on the large file, and this isn't python, but it
    might work. You might try breaking the file into
    smaller peices, maybe based on the first character, and then
    process them seperately. The time killer is probably
    the "x not in L" line, since L is getting very large. By
    subdividing the problem initially, that time constraint
    will be better.
    Bill Pursell, May 18, 2006
    #5
  6. r.e.s.

    Tim Chase Guest

    > I have a million-line text file with 100 characters per line,
    > and simply need to determine how many of the lines are distinct.


    A few ideas:

    1) the shell way:

    bash$ sort file.in | uniq | wc -l

    This doesn't strip whitespace...a little sed magic would
    strip off whitespace for you:

    bash$ sed 'regexp' file.in | sort | uniq | wc -l

    where 'regexp' is something like this atrocity

    's/^[[:space:]]*\(\([[:space:]]*[^[:space:]][^[:space:]]*\)*\)[[:space:]]*$/\1/'

    (If your sed supports "\s" and "\S" for "whitespace" and
    "nonwhitespace", it makes the expression a lot less hairy:

    's/^\s*\(\(\s*\S\S*\)*\)\s*$/\1/'

    and, IMHO, a little easier to read. There might be a
    nice/concise perl one-liner for this too)

    2) use a python set:

    s = set()
    for line in open("file.in"):
    s.add(line.strip())
    return len(s)

    3) compact #2:

    return len(set([line.strip() for line in file("file.in")]))

    or, if stripping the lines isn't a concern, it can just be

    return len(set(file("file.in")))

    The logic in the set keeps track of ensuring that no
    duplicates get entered.

    Depending on how many results you *expect*, this could
    become cumbersome, as you have to have every unique line in
    memory. A stream-oriented solution can be kinder on system
    resources, but would require that the input be sorted first.

    -tkc
    Tim Chase, May 18, 2006
    #6
  7. r.e.s. wrote:
    > I have a million-line text file with 100 characters per line,
    > and simply need to determine how many of the lines are distinct.
    >
    > On my PC, this little program just goes to never-never land:
    >
    > def number_distinct(fn):
    > f = file(fn)
    > x = f.readline().strip()
    > L = []
    > while x<>'':
    > if x not in L:
    > L = L + [x]
    > x = f.readline().strip()
    > return len(L)
    >
    > Would anyone care to point out improvements?
    > Is there a better algorithm for doing this?


    Take a look at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

    It is a python approach to the uniq command on *nix.
    Andrew Robert, May 18, 2006
    #7
  8. r.e.s.

    r.e.s. Guest

    "Tim Chase" <> wrote ...
    > 2) use a python set:
    >
    > s = set()
    > for line in open("file.in"):
    > s.add(line.strip())
    > return len(s)
    >
    > 3) compact #2:
    >
    > return len(set([line.strip() for line in file("file.in")]))
    >
    > or, if stripping the lines isn't a concern, it can just be
    >
    > return len(set(file("file.in")))
    >
    > The logic in the set keeps track of ensuring that no
    > duplicates get entered.
    >
    > Depending on how many results you *expect*, this could
    > become cumbersome, as you have to have every unique line in
    > memory. A stream-oriented solution can be kinder on system
    > resources, but would require that the input be sorted first.


    Thank you (and all the others who responded!) -- set() does
    the trick, reducing the job to about a minute. I may play
    later with the other alternatives people mentionsed (dict(),
    hash(),...), just out of curiosity. I take your point about
    the "expected number", which in my case was around 0-10 (as
    it turned out, there were no dups).

    BTW, the first thing I tried was Fredrik Lundh's program:

    def number_distinct(fn):
    return len(set(s.strip() for s in open(fn)))

    which worked without the square brackets. Interesting that
    omitting them doesn't seem to matter.
    r.e.s., May 19, 2006
    #8
  9. r.e.s.

    pac Guest

    A generator expression can "share" the parenthesis of a function call.
    The syntax is explained in PEP 289, which is also in "What's new" in
    the Python 2.4 docs.

    Nice line of code!
    pac, May 19, 2006
    #9
  10. r.e.s. wrote:

    > BTW, the first thing I tried was Fredrik Lundh's program:
    >
    > def number_distinct(fn):
    > return len(set(s.strip() for s in open(fn)))
    >
    > which worked without the square brackets. Interesting that
    > omitting them doesn't seem to matter.


    a for loop inside square brackets is a "list comprehension", and the
    result is a list. if you use a list comprehension inside a function
    call, the full list is built *before* the function is called. in this
    case, this would mean that the entire file would be read into memory
    before the set was constructed.

    if you change the square brackets to ordinary parentheses, you get a
    generator expression instead:

    http://pyref.infogami.com/generator-expressions

    the generator expression results in an iterator object that calculates
    the values one by one. if you pass it to a function that expects an
    iterator, that function will end up "running the for loop itself", and
    no extra storage is needed. (in this case, you still need memory to
    hold the set, of course, so the difference between a list comprehension
    and a generator expression will only matter if you have lots of duplicates).

    finally, a syntax shortcut lets you remove the parentheses if the
    generator expression is the only argument in a function call, as in the
    above example.

    </F>
    Fredrik Lundh, May 19, 2006
    #10
  11. Fredrik Lundh wrote:

    >a for loop inside square brackets is a "list comprehension", and the
    >result is a list. if you use a list comprehension inside a function
    >call, the full list is built *before* the function is called. in this
    >case, this would mean that the entire file would be read into memory
    >before the set was constructed.
    >
    >if you change the square brackets to ordinary parentheses, you get a
    >generator expression instead:
    >
    > http://pyref.infogami.com/generator-expressions
    >
    >the generator expression results in an iterator object that calculates
    >the values one by one. if you pass it to a function that expects an
    >iterator, that function will end up "running the for loop itself", and
    >no extra storage is needed. (in this case, you still need memory to
    >hold the set, of course, so the difference between a list comprehension
    >and a generator expression will only matter if you have lots of duplicates).
    >
    >

    This is interesting. I wonder how this compares to uniq in
    performance?

    I actually had this problem a couple of weeks ago when I discovered
    that my son's .Xsession file was 26 GB and had filled the disk
    partition (!). Apparently some games he was playing were spewing
    out a lot of errors, and I wanted to find out which ones were at fault.

    Basically, uniq died on this task (well, it probably was working, but
    not completed after over 10 hours). I was using it something like
    this:

    cat Xsession.errors | uniq > Xsession.uniq

    It never occured to me to use the Python dict/set approach. Now I
    wonder if it would've worked better somehow. Of course my file was
    26,000 X larger than the one in this problem, and definitely would
    not fit in memory. I suspect that there were as many as a million
    duplicates for some messages in that file. Would the generator
    version above have helped me out, I wonder?

    Unfortunately, I deleted the file, so I can't really try it out. I suppose
    I could create synthetic data with the logging module to try it out.

    Cheers,
    Terry

    --
    Terry Hancock ()
    Anansi Spaceworks http://www.AnansiSpaceworks.com
    Terry Hancock, May 19, 2006
    #11
  12. r.e.s.

    Ben Stroud Guest


    >
    >It never occured to me to use the Python dict/set approach. Now I
    >wonder if it would've worked better somehow. Of course my file was
    >26,000 X larger than the one in this problem, and definitely would
    >not fit in memory. I suspect that there were as many as a million
    >duplicates for some messages in that file. Would the generator
    >version above have helped me out, I wonder?
    >
    >
    >
    >


    You could use a dbm file approach which would provide a external
    dict/set interface through Python bindings. This would use less memory.

    1. Add records to dbm as keys
    2. dbm (if configured correctly) will only keep unique keys
    3. Count keys

    Cheers,
    Ben
    Ben Stroud, May 19, 2006
    #12
  13. r.e.s.

    Tim Chase Guest

    > I actually had this problem a couple of weeks ago when I
    > discovered that my son's .Xsession file was 26 GB and had
    > filled the disk partition (!). Apparently some games he was
    > playing were spewing out a lot of errors, and I wanted to find
    > out which ones were at fault.
    >
    > Basically, uniq died on this task (well, it probably was
    > working, but not completed after over 10 hours). I was using
    > it something like this:
    >
    > cat Xsession.errors | uniq > Xsession.uniq


    A couple things I noticed that may play into matters:

    1) uniq is a dedicated tool for the task of uniquely identifying
    *neighboring* lines in the file. It doesn't get much faster than
    that, *if* that's your input. This leads to #4 below.

    2) (uneventfully?) you have a superfluous use of cat. I don't
    know if that's bogging matters down, but you can just use

    uniq < Xsession.errors > Xsession.uniq

    which would save you from having each line touched twice...once
    by cat, and once by uniq.

    3) as "uniq" doesn't report on its progress, if it's processing a
    humongous 26 gig file, it may just sit there churning for a long
    time before finishing. It looks like it may have taken >10hr :)

    4) "uniq" requires sorted input. Unless you've sorted your
    Xsession.errors before-hand, your output isn't likely to be as
    helpful. The python set/generator scheme may work well to keep
    you from having to sort matters first--especially if you only
    have a fairly scant handful of unique errors.

    5) I presume wherever you were writing Xsession.uniq had enough
    space...you mentioned your son filling your HDD. It may gasp,
    wheeze and die if there wasn't enough space...or it might just
    hang. I'd hope it would be smart enough to gracefully report
    "out of disk-space" errors in the process.

    6) unless I'm experiencing trouble, I just tend to keep my
    ..xsession-errors file as a soft-link to /dev/null, especially as
    (when I use KDE rather than Fluxbox) KDE likes to spit out
    mountains of KIO file errors. It's easy enough to unlink it and
    let it generate the file if needed.

    7) With a file this large, you most certainly want to use a
    generator scheme rather than trying to load each of the lines in
    the file :) (Note to Bruno...yes, *this* would be one of those
    places you mentioned to me earlier about *not* using readlines()
    ;)

    If you're using 2.3.x, and don't have 2.4's nice syntax for

    len(set(line.strip() for line in file("xsession.errors")))

    you should be able to bypass reading the whole file into memory
    (and make use of sets) with

    from sets import Set as set
    s = set()
    for line in file("xsession.errors"):
    s.add(line.strip())
    return len(s)

    In your case, you likely don't have to call strip() and can just
    get away with adding each line to the set.

    Just a few ideas for the next time you have a multi-gig
    Xsession.errors situation :)

    -tkc
    Tim Chase, May 19, 2006
    #13
  14. r.e.s.

    Kaz Kylheku Guest

    Bill Pursell wrote:
    > Have you tried
    > cat file | sort | uniq | wc -l ?


    The standard input file descriptor of sort can be attached directly to
    a file. You don't need a file catenating process in order to feed it:

    sort < file | uniq | wc -l

    Sort has the uniq functionality built in:

    sort -u < file | wc -l

    > sort might choke on the large file, and this isn't python, but it
    > might work.


    Solid implementations of sort can use external storage for large files,
    and perform a poly-phase type sort, rather than doing the entire sort
    in memory.

    I seem to recall that GNU sort does something like this, using
    temporary files.

    Naively written Python code is a lot more likely to choke on a large
    data set.

    > You might try breaking the file into
    > smaller peices, maybe based on the first character, and then
    > process them seperately.


    No, the way this is done is simply to read the file and insert the data
    into an ordered data structure until memory fills up. After that, you
    keep reading the file and inseting, but each time you insert, you
    remove the smallest element and write it out to the segment file. You
    keep doing it until it's no longer possible to extract a smallest
    element which is greater than all that have been already written to the
    file. When that happens, you start a new file. That does not happen
    until you have filled memory at least twice. So for instance with half
    a gig of RAM, you can produce merge segments on the order of a gig.
    Kaz Kylheku, May 19, 2006
    #14
  15. r.e.s.

    Kaz Kylheku Guest

    Bill Pursell wrote:
    > Have you tried
    > cat file | sort | uniq | wc -l ?


    The standard input file descriptor of sort can be attached directly to
    a file. You don't need a file catenating process in order to feed it:

    sort < file | uniq | wc -l

    And sort also takes a filename argument:

    sort file | uniq | wc -l

    And sort has the uniq functionality built in:

    sort -u file | wc -l

    Really, all this piping between little utilities is inefficient
    bullshit, isn't it! All that IPC through the kernel, copying the data.

    Why can't sort also count the damn lines?

    There should be one huge utility which can do it all in a single
    address space.

    > sort might choke on the large file, and this isn't python, but it
    > might work.


    Solid implementations of sort can use external storage for large files,
    and perform a poly-phase type sort, rather than doing the entire sort
    in memory.

    I seem to recall that GNU sort does something like this, using
    temporary files.

    Naively written Python code is a lot more likely to choke on a large
    data set.

    > You might try breaking the file into
    > smaller peices, maybe based on the first character, and then
    > process them seperately.


    No, the way this is done is simply to read the file and insert the data
    into an ordered data structure until memory fills up. After that, you
    keep reading the file and inseting, but each time you insert, you
    remove the smallest element and write it out to the segment file. You
    keep doing it until it's no longer possible to extract a smallest
    element which is greater than all that have been already written to the
    file. When that happens, you start a new file. That does not happen
    until you have filled memory at least twice. So for instance with half
    a gig of RAM, you can produce merge segments on the order of a gig.
    Kaz Kylheku, May 19, 2006
    #15
  16. r.e.s.

    Paddy Guest

    If the log has a lot of repeated lines in its original state then
    running uniq twice, once up front to reduce what needs to be sorted,
    might be quicker?

    uniq log_file | sort| uniq | wc -l

    - Pad.
    Paddy, May 19, 2006
    #16
  17. r.e.s.

    Paul McGuire Guest

    "Paddy" <> wrote in message
    news:...
    > If the log has a lot of repeated lines in its original state then
    > running uniq twice, once up front to reduce what needs to be sorted,
    > might be quicker?
    >
    > uniq log_file | sort| uniq | wc -l
    >
    > - Pad.
    >


    Why would the second running of uniq remove any additional lines that
    weren't removed in the first pass?

    For that matter, if this is a log file, wont every line have a timestamp,
    making duplicates extremely unlikely?

    -- Paul
    Paul McGuire, May 19, 2006
    #17
  18. On 2006-05-19, Kaz Kylheku <> wrote:

    > There should be one huge utility which can do it all in a single
    > address space.


    Sure, as long as it can do all of everything you'll ever need
    to do, you're set! It would be the One True Program.

    Isnt' that what Emacs is supposed to be?

    --
    Grant Edwards grante Yow! My mind is making
    at ashtrays in Dayton...
    visi.com
    Grant Edwards, May 19, 2006
    #18
  19. On 2006-05-19, Paul McGuire <._bogus_.com> wrote:

    >> If the log has a lot of repeated lines in its original state then
    >> running uniq twice, once up front to reduce what needs to be sorted,
    >> might be quicker?
    >>
    >> uniq log_file | sort| uniq | wc -l
    >>
    >> - Pad.

    >
    > Why would the second running of uniq remove any additional lines that
    > weren't removed in the first pass?


    Because uniq only removes _adjacent_ identical lines.

    > For that matter, if this is a log file, wont every line have a timestamp,
    > making duplicates extremely unlikely?


    Probably.

    --
    Grant Edwards grante Yow! If our behavior is
    at strict, we do not need fun!
    visi.com
    Grant Edwards, May 19, 2006
    #19
  20. r.e.s.

    Kaz Kylheku Guest

    Paddy wrote:
    > If the log has a lot of repeated lines in its original state then
    > running uniq twice, once up front to reduce what needs to be sorted,
    > might be quicker?


    Having the uniq and sort steps integrated in a single piece of software
    allows for the most optimization opportunities.

    The sort utility, under -u, could squash duplicate lines on the input
    side /and/ the output side.

    > uniq log_file | sort| uniq | wc -l


    Now you have two more pipeline elements, two more tasks running, and
    four more copies of the data being made as it travels through two extra
    pipes in the kernel.

    Or, only two more copies if you are lucky enough to have a "zero copy"
    pipe implementation whcih allows data to go from the writer's buffer
    directly to the reader's one without intermediate kernel buffering.
    Kaz Kylheku, May 19, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack
    Replies:
    9
    Views:
    2,631
  2. Joe Wright
    Replies:
    0
    Views:
    491
    Joe Wright
    Jul 27, 2003
  3. Murali
    Replies:
    2
    Views:
    532
    Jerry Coffin
    Mar 9, 2006
  4. PerlFAQ Server
    Replies:
    0
    Views:
    153
    PerlFAQ Server
    Jan 14, 2011
  5. PerlFAQ Server
    Replies:
    0
    Views:
    140
    PerlFAQ Server
    Apr 19, 2011
Loading...

Share This Page