number of different lines in a file

R

r.e.s.

I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

On my PC, this little program just goes to never-never land:

def number_distinct(fn):
f = file(fn)
x = f.readline().strip()
L = []
while x<>'':
if x not in L:
L = L + [x]
x = f.readline().strip()
return len(L)

Would anyone care to point out improvements?
Is there a better algorithm for doing this?
 
L

Larry Bates

r.e.s. said:
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

On my PC, this little program just goes to never-never land:

def number_distinct(fn):
f = file(fn)
x = f.readline().strip()
L = []
while x<>'':
if x not in L:
L = L + [x]
x = f.readline().strip()
return len(L)

Would anyone care to point out improvements?
Is there a better algorithm for doing this?

Sounds like homework, but I'll bite.

def number_distinct(fn):
hash_dict={}
total_lines=0
for line in open(fn, 'r'):
total_lines+=1
key=hash(line.strip())
if hash_dict.has_key(key): continue
hash_dict[key]=1

return total_lines, len(hash_dict.keys())

if __name__=="__main__":
fn='c:\\test.txt'
total_lines, distinct_lines=number_distinct(fn)
print "Total lines=%i, distinct lines=%i" % (total_lines, distinct_lines)


-Larry Bates
 
B

Ben Finney

r.e.s. said:
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

I'd generalise it by allowing the caller to pass any iterable set of
items. A file handle can be iterated this way, but so can any
sequence or iterable.

def count_distinct(seq):
""" Count the number of distinct items """
counts = dict()
for item in seq:
if not item in counts:
counts[item] = 0
counts[item] += 1
return len(counts)
... print line,
...
abc
def
ghi
abc
ghi
def
xyz
abc
abc
def
5
 
F

Fredrik Lundh

r.e.s. said:
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

On my PC, this little program just goes to never-never land:

def number_distinct(fn):
f = file(fn)
x = f.readline().strip()
L = []
while x<>'':
if x not in L:
L = L + [x]
x = f.readline().strip()
return len(L)
ouch.

Would anyone care to point out improvements?
Is there a better algorithm for doing this?

try this:

def number_distinct(fn):
return len(set(s.strip() for s in open(fn)))

</F>
 
B

Bill Pursell

r.e.s. said:
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

On my PC, this little program just goes to never-never land:

def number_distinct(fn):
f = file(fn)
x = f.readline().strip()
L = []
while x<>'':
if x not in L:
L = L + [x]
x = f.readline().strip()
return len(L)

Would anyone care to point out improvements?
Is there a better algorithm for doing this?

Have you tried
cat file | sort | uniq | wc -l ?
sort might choke on the large file, and this isn't python, but it
might work. You might try breaking the file into
smaller peices, maybe based on the first character, and then
process them seperately. The time killer is probably
the "x not in L" line, since L is getting very large. By
subdividing the problem initially, that time constraint
will be better.
 
T

Tim Chase

I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

A few ideas:

1) the shell way:

bash$ sort file.in | uniq | wc -l

This doesn't strip whitespace...a little sed magic would
strip off whitespace for you:

bash$ sed 'regexp' file.in | sort | uniq | wc -l

where 'regexp' is something like this atrocity

's/^[[:space:]]*\(\([[:space:]]*[^[:space:]][^[:space:]]*\)*\)[[:space:]]*$/\1/'

(If your sed supports "\s" and "\S" for "whitespace" and
"nonwhitespace", it makes the expression a lot less hairy:

's/^\s*\(\(\s*\S\S*\)*\)\s*$/\1/'

and, IMHO, a little easier to read. There might be a
nice/concise perl one-liner for this too)

2) use a python set:

s = set()
for line in open("file.in"):
s.add(line.strip())
return len(s)

3) compact #2:

return len(set([line.strip() for line in file("file.in")]))

or, if stripping the lines isn't a concern, it can just be

return len(set(file("file.in")))

The logic in the set keeps track of ensuring that no
duplicates get entered.

Depending on how many results you *expect*, this could
become cumbersome, as you have to have every unique line in
memory. A stream-oriented solution can be kinder on system
resources, but would require that the input be sorted first.

-tkc
 
A

Andrew Robert

r.e.s. said:
I have a million-line text file with 100 characters per line,
and simply need to determine how many of the lines are distinct.

On my PC, this little program just goes to never-never land:

def number_distinct(fn):
f = file(fn)
x = f.readline().strip()
L = []
while x<>'':
if x not in L:
L = L + [x]
x = f.readline().strip()
return len(L)

Would anyone care to point out improvements?
Is there a better algorithm for doing this?

Take a look at http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52560

It is a python approach to the uniq command on *nix.
 
R

r.e.s.

Tim Chase said:
2) use a python set:

s = set()
for line in open("file.in"):
s.add(line.strip())
return len(s)

3) compact #2:

return len(set([line.strip() for line in file("file.in")]))

or, if stripping the lines isn't a concern, it can just be

return len(set(file("file.in")))

The logic in the set keeps track of ensuring that no
duplicates get entered.

Depending on how many results you *expect*, this could
become cumbersome, as you have to have every unique line in
memory. A stream-oriented solution can be kinder on system
resources, but would require that the input be sorted first.

Thank you (and all the others who responded!) -- set() does
the trick, reducing the job to about a minute. I may play
later with the other alternatives people mentionsed (dict(),
hash(),...), just out of curiosity. I take your point about
the "expected number", which in my case was around 0-10 (as
it turned out, there were no dups).

BTW, the first thing I tried was Fredrik Lundh's program:

def number_distinct(fn):
return len(set(s.strip() for s in open(fn)))

which worked without the square brackets. Interesting that
omitting them doesn't seem to matter.
 
P

pac

A generator expression can "share" the parenthesis of a function call.
The syntax is explained in PEP 289, which is also in "What's new" in
the Python 2.4 docs.

Nice line of code!
 
F

Fredrik Lundh

r.e.s. said:
BTW, the first thing I tried was Fredrik Lundh's program:

def number_distinct(fn):
return len(set(s.strip() for s in open(fn)))

which worked without the square brackets. Interesting that
omitting them doesn't seem to matter.

a for loop inside square brackets is a "list comprehension", and the
result is a list. if you use a list comprehension inside a function
call, the full list is built *before* the function is called. in this
case, this would mean that the entire file would be read into memory
before the set was constructed.

if you change the square brackets to ordinary parentheses, you get a
generator expression instead:

http://pyref.infogami.com/generator-expressions

the generator expression results in an iterator object that calculates
the values one by one. if you pass it to a function that expects an
iterator, that function will end up "running the for loop itself", and
no extra storage is needed. (in this case, you still need memory to
hold the set, of course, so the difference between a list comprehension
and a generator expression will only matter if you have lots of duplicates).

finally, a syntax shortcut lets you remove the parentheses if the
generator expression is the only argument in a function call, as in the
above example.

</F>
 
T

Terry Hancock

Fredrik said:
a for loop inside square brackets is a "list comprehension", and the
result is a list. if you use a list comprehension inside a function
call, the full list is built *before* the function is called. in this
case, this would mean that the entire file would be read into memory
before the set was constructed.

if you change the square brackets to ordinary parentheses, you get a
generator expression instead:

http://pyref.infogami.com/generator-expressions

the generator expression results in an iterator object that calculates
the values one by one. if you pass it to a function that expects an
iterator, that function will end up "running the for loop itself", and
no extra storage is needed. (in this case, you still need memory to
hold the set, of course, so the difference between a list comprehension
and a generator expression will only matter if you have lots of duplicates).
This is interesting. I wonder how this compares to uniq in
performance?

I actually had this problem a couple of weeks ago when I discovered
that my son's .Xsession file was 26 GB and had filled the disk
partition (!). Apparently some games he was playing were spewing
out a lot of errors, and I wanted to find out which ones were at fault.

Basically, uniq died on this task (well, it probably was working, but
not completed after over 10 hours). I was using it something like
this:

cat Xsession.errors | uniq > Xsession.uniq

It never occured to me to use the Python dict/set approach. Now I
wonder if it would've worked better somehow. Of course my file was
26,000 X larger than the one in this problem, and definitely would
not fit in memory. I suspect that there were as many as a million
duplicates for some messages in that file. Would the generator
version above have helped me out, I wonder?

Unfortunately, I deleted the file, so I can't really try it out. I suppose
I could create synthetic data with the logging module to try it out.

Cheers,
Terry
 
B

Ben Stroud

It never occured to me to use the Python dict/set approach. Now I
wonder if it would've worked better somehow. Of course my file was
26,000 X larger than the one in this problem, and definitely would
not fit in memory. I suspect that there were as many as a million
duplicates for some messages in that file. Would the generator
version above have helped me out, I wonder?

You could use a dbm file approach which would provide a external
dict/set interface through Python bindings. This would use less memory.

1. Add records to dbm as keys
2. dbm (if configured correctly) will only keep unique keys
3. Count keys

Cheers,
Ben
 
T

Tim Chase

I actually had this problem a couple of weeks ago when I
> discovered that my son's .Xsession file was 26 GB and had
> filled the disk partition (!). Apparently some games he was
> playing were spewing out a lot of errors, and I wanted to find
> out which ones were at fault.
>
> Basically, uniq died on this task (well, it probably was
> working, but not completed after over 10 hours). I was using
> it something like this:
>
> cat Xsession.errors | uniq > Xsession.uniq

A couple things I noticed that may play into matters:

1) uniq is a dedicated tool for the task of uniquely identifying
*neighboring* lines in the file. It doesn't get much faster than
that, *if* that's your input. This leads to #4 below.

2) (uneventfully?) you have a superfluous use of cat. I don't
know if that's bogging matters down, but you can just use

uniq < Xsession.errors > Xsession.uniq

which would save you from having each line touched twice...once
by cat, and once by uniq.

3) as "uniq" doesn't report on its progress, if it's processing a
humongous 26 gig file, it may just sit there churning for a long
time before finishing. It looks like it may have taken >10hr :)

4) "uniq" requires sorted input. Unless you've sorted your
Xsession.errors before-hand, your output isn't likely to be as
helpful. The python set/generator scheme may work well to keep
you from having to sort matters first--especially if you only
have a fairly scant handful of unique errors.

5) I presume wherever you were writing Xsession.uniq had enough
space...you mentioned your son filling your HDD. It may gasp,
wheeze and die if there wasn't enough space...or it might just
hang. I'd hope it would be smart enough to gracefully report
"out of disk-space" errors in the process.

6) unless I'm experiencing trouble, I just tend to keep my
..xsession-errors file as a soft-link to /dev/null, especially as
(when I use KDE rather than Fluxbox) KDE likes to spit out
mountains of KIO file errors. It's easy enough to unlink it and
let it generate the file if needed.

7) With a file this large, you most certainly want to use a
generator scheme rather than trying to load each of the lines in
the file :) (Note to Bruno...yes, *this* would be one of those
places you mentioned to me earlier about *not* using readlines()
;)

If you're using 2.3.x, and don't have 2.4's nice syntax for

len(set(line.strip() for line in file("xsession.errors")))

you should be able to bypass reading the whole file into memory
(and make use of sets) with

from sets import Set as set
s = set()
for line in file("xsession.errors"):
s.add(line.strip())
return len(s)

In your case, you likely don't have to call strip() and can just
get away with adding each line to the set.

Just a few ideas for the next time you have a multi-gig
Xsession.errors situation :)

-tkc
 
K

Kaz Kylheku

Bill said:
Have you tried
cat file | sort | uniq | wc -l ?

The standard input file descriptor of sort can be attached directly to
a file. You don't need a file catenating process in order to feed it:

sort < file | uniq | wc -l

Sort has the uniq functionality built in:

sort -u < file | wc -l
sort might choke on the large file, and this isn't python, but it
might work.

Solid implementations of sort can use external storage for large files,
and perform a poly-phase type sort, rather than doing the entire sort
in memory.

I seem to recall that GNU sort does something like this, using
temporary files.

Naively written Python code is a lot more likely to choke on a large
data set.
You might try breaking the file into
smaller peices, maybe based on the first character, and then
process them seperately.

No, the way this is done is simply to read the file and insert the data
into an ordered data structure until memory fills up. After that, you
keep reading the file and inseting, but each time you insert, you
remove the smallest element and write it out to the segment file. You
keep doing it until it's no longer possible to extract a smallest
element which is greater than all that have been already written to the
file. When that happens, you start a new file. That does not happen
until you have filled memory at least twice. So for instance with half
a gig of RAM, you can produce merge segments on the order of a gig.
 
K

Kaz Kylheku

Bill said:
Have you tried
cat file | sort | uniq | wc -l ?

The standard input file descriptor of sort can be attached directly to
a file. You don't need a file catenating process in order to feed it:

sort < file | uniq | wc -l

And sort also takes a filename argument:

sort file | uniq | wc -l

And sort has the uniq functionality built in:

sort -u file | wc -l

Really, all this piping between little utilities is inefficient
bullshit, isn't it! All that IPC through the kernel, copying the data.

Why can't sort also count the damn lines?

There should be one huge utility which can do it all in a single
address space.
sort might choke on the large file, and this isn't python, but it
might work.

Solid implementations of sort can use external storage for large files,
and perform a poly-phase type sort, rather than doing the entire sort
in memory.

I seem to recall that GNU sort does something like this, using
temporary files.

Naively written Python code is a lot more likely to choke on a large
data set.
You might try breaking the file into
smaller peices, maybe based on the first character, and then
process them seperately.

No, the way this is done is simply to read the file and insert the data
into an ordered data structure until memory fills up. After that, you
keep reading the file and inseting, but each time you insert, you
remove the smallest element and write it out to the segment file. You
keep doing it until it's no longer possible to extract a smallest
element which is greater than all that have been already written to the
file. When that happens, you start a new file. That does not happen
until you have filled memory at least twice. So for instance with half
a gig of RAM, you can produce merge segments on the order of a gig.
 
P

Paddy

If the log has a lot of repeated lines in its original state then
running uniq twice, once up front to reduce what needs to be sorted,
might be quicker?

uniq log_file | sort| uniq | wc -l

- Pad.
 
P

Paul McGuire

Paddy said:
If the log has a lot of repeated lines in its original state then
running uniq twice, once up front to reduce what needs to be sorted,
might be quicker?

uniq log_file | sort| uniq | wc -l

- Pad.

Why would the second running of uniq remove any additional lines that
weren't removed in the first pass?

For that matter, if this is a log file, wont every line have a timestamp,
making duplicates extremely unlikely?

-- Paul
 
G

Grant Edwards

There should be one huge utility which can do it all in a single
address space.

Sure, as long as it can do all of everything you'll ever need
to do, you're set! It would be the One True Program.

Isnt' that what Emacs is supposed to be?
 
G

Grant Edwards

Why would the second running of uniq remove any additional lines that
weren't removed in the first pass?

Because uniq only removes _adjacent_ identical lines.
For that matter, if this is a log file, wont every line have a timestamp,
making duplicates extremely unlikely?

Probably.
 
K

Kaz Kylheku

Paddy said:
If the log has a lot of repeated lines in its original state then
running uniq twice, once up front to reduce what needs to be sorted,
might be quicker?

Having the uniq and sort steps integrated in a single piece of software
allows for the most optimization opportunities.

The sort utility, under -u, could squash duplicate lines on the input
side /and/ the output side.
uniq log_file | sort| uniq | wc -l

Now you have two more pipeline elements, two more tasks running, and
four more copies of the data being made as it travels through two extra
pipes in the kernel.

Or, only two more copies if you are lucky enough to have a "zero copy"
pipe implementation whcih allows data to go from the writer's buffer
directly to the reader's one without intermediate kernel buffering.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top