# Re: count

Discussion in 'Python' started by Vilya Harvey, Jul 8, 2009.

1. ### Vilya HarveyGuest

2009/7/8 Dhananjay <>:
> I wanted to sort column 2 in assending order  and I read whole file in array
> "data" and did the following:
>
> data.sort(key = lambda fieldsfields[2]))
>
> I have sorted column 2, however I want to count the numbers in the column 2.
> i.e. I want to know, for example, how many repeates of say '3' (first row,
> 2nd column in above data) are there in column 2.

One thing: indexes in Python start from 0, so the second column has an
index of 1 not 2. In other words, it should be data.sort(key = lambda

With that out of the way, the following will print out a count of each
unique item in the second column:

from itertools import groupby
for x, g in groupby([fields[1] for fields in data]):
print x, len(tuple(g))

Hope that helps,
Vil.

Vilya Harvey, Jul 8, 2009

2. ### BearophileGuest

Vilya Harvey:
> from itertools import groupby
> for x, g in groupby([fields[1] for fields in data]):
>     print x, len(tuple(g))

Avoid that len(tuple(g)), use something like the following, it's lazy
and saves some memory.

def leniter(iterator):
"""leniter(iterator): return the length of a given
iterator, consuming it, without creating a list.
Never use it with infinite iterators.

>>> leniter()

Traceback (most recent call last):
...
TypeError: leniter() takes exactly 1 argument (0 given)
>>> leniter([])

0
>>> leniter([1])

1
>>> leniter(iter([1]))

1
>>> leniter(x for x in xrange(100) if x%2)

50
>>> from itertools import groupby
>>> [(leniter(g), h) for h,g in groupby("aaaabccaadeeee")]

[(4, 'a'), (1, 'b'), (2, 'c'), (2, 'a'), (1, 'd'), (4, 'e')]

>>> def foo0():

... if False: yield 1
>>> leniter(foo0())

0

>>> def foo1(): yield 1
>>> leniter(foo1())

1
"""
# This code is faster than: sum(1 for _ in iterator)
if hasattr(iterator, "__len__"):
return len(iterator)
nelements = 0
for _ in iterator:
nelements += 1
return nelements

Bye,
bearophile

Bearophile, Jul 8, 2009

3. ### Paul RubinGuest

Bearophile <> writes:
> >     print x, len(tuple(g))

>
> Avoid that len(tuple(g)), use something like the following

print x, sum(1 for _ in g)

Paul Rubin, Jul 8, 2009
4. ### AahzGuest

In article <>,
Bearophile <> wrote:
>Vilya Harvey:
>>
>> from itertools import groupby
>> for x, g in groupby([fields[1] for fields in data]):
>> =A0 =A0 print x, len(tuple(g))

>
>Avoid that len(tuple(g)), use something like the following, it's lazy
>and saves some memory.

The question is whether it saves time, have you tested it?
--
Aahz () <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha

Aahz, Jul 8, 2009
5. ### Paul RubinGuest

(Aahz) writes:
> >Avoid that len(tuple(g)), use something like the following, it's lazy
> >and saves some memory.

> The question is whether it saves time, have you tested it?

len(tuple(xrange(100000000))) ... hmm.

Paul Rubin, Jul 8, 2009
6. ### AahzGuest

In article <>,
Paul Rubin <http://> wrote:
> (Aahz) writes:
>>>
>>>Avoid that len(tuple(g)), use something like the following, it's lazy
>>>and saves some memory.

>>
>> The question is whether it saves time, have you tested it?

>
>len(tuple(xrange(100000000))) ... hmm.

When dealing with small N, O() can get easily swamped by the constant
factors. How often do you deal with more than a hundred fields?
--
Aahz () <*> http://www.pythoncraft.com/

"as long as we like the same operating system, things are cool." --piranha

Aahz, Jul 8, 2009
7. ### Paul RubinGuest

(Aahz) writes:
> When dealing with small N, O() can get easily swamped by the constant
> factors. How often do you deal with more than a hundred fields?

The number of fields in the OP's post was not stated. Expecting it to
be less than 100 seems like an ill-advised presumption. If N is
unknown, speed-tuning the case where N is small at the expense of
consuming monstrous amounts of memory when N is large sounds
somewhere between a premature optimization and a nasty bug.

Paul Rubin, Jul 8, 2009
8. ### J. Clifford DyerGuest

On Wed, 2009-07-08 at 14:45 -0700, Paul Rubin wrote:
> (Aahz) writes:
> > >Avoid that len(tuple(g)), use something like the following, it's lazy
> > >and saves some memory.

> > The question is whether it saves time, have you tested it?

>
> len(tuple(xrange(100000000))) ... hmm.

timer.py
--------
from datetime import datetime

def tupler(n):
return len(tuple(xrange(n)))

def summer(n):
return sum(1 for x in xrange(n))

def test_func(f, n):
print f.__name__,
start = datetime.now()
print f(n)
end = datetime.now()
print "Start: %s" % start
print "End: %s" % end
print "Duration: %s" % (end - start,)

if __name__ == '__main__':
test_func(summer, 10000000)
test_func(tupler, 10000000)
test_func(summer, 100000000)
test_func(tupler, 100000000)

\$ python timer.py
summer 10000000
Start: 2009-07-08 22:02:13.216689
End: 2009-07-08 22:02:15.855931
Duration: 0:00:02.639242
tupler 10000000
Start: 2009-07-08 22:02:15.856122
End: 2009-07-08 22:02:16.743153
Duration: 0:00:00.887031
summer 100000000
Start: 2009-07-08 22:02:16.743863
End: 2009-07-08 22:02:49.372756
Duration: 0:00:32.628893
Killed
\$

Note that "Killed" did not come from anything I did. The tupler just
bombed out when the tuple got too big for it to handle. Tupler was
faster for as large an input as it could handle, as well as for small
inputs (test not shown).

J. Clifford Dyer, Jul 9, 2009
9. ### BearophileGuest

Paul Rubin:
> print x, sum(1 for _ in g)

Don't use that, use my function If g has a __len__ you are wasting
time. And sum(1 ...) is (on my PC) slower.

J. Clifford Dyer:
> if __name__ == '__main__':
>     test_func(summer, 10000000)
>     test_func(tupler, 10000000)
>     test_func(summer, 100000000)
>     test_func(tupler, 100000000)

Have you forgotten my function?

Bye,
bearophile

Bearophile, Jul 9, 2009
10. ### J. Cliff DyerGuest

Bearophile wins! (This only times the loop itself. It doesn't check
for __len__)

summer:5
0:00:00.000051
bearophile:5
0:00:00.000009
summer:50
0:00:00.000030
bearophile:50
0:00:00.000013
summer:500
0:00:00.000077
bearophile:500
0:00:00.000053
summer:5000
0:00:00.000575
bearophile:5000
0:00:00.000473
summer:50000
0:00:00.005583
bearophile:50000
0:00:00.004625
summer:500000
0:00:00.055834
bearophile:500000
0:00:00.046137
summer:5000000
0:00:00.426734
bearophile:5000000
0:00:00.349573
summer:50000000
0:00:04.180920
bearophile:50000000
0:00:03.652311
summer:500000000
0:00:42.647885
bearophile: 500000000
0:00:35.190550

On Thu, 2009-07-09 at 04:04 -0700, Bearophile wrote:
> Paul Rubin:
> > print x, sum(1 for _ in g)

>
> Don't use that, use my function If g has a __len__ you are wasting
> time. And sum(1 ...) is (on my PC) slower.
>
>
> J. Clifford Dyer:
> > if __name__ == '__main__':
> > test_func(summer, 10000000)
> > test_func(tupler, 10000000)
> > test_func(summer, 100000000)
> > test_func(tupler, 100000000)

>
> Have you forgotten my function?
>
> Bye,
> bearophile

J. Cliff Dyer, Jul 9, 2009