How to count lines in a text file ?

L

Ling Lee

Hi all.

I'm trying to write a program that:
1) Ask me what file I want to count number of lines in, and then counts the
lines and writes the answear out.

2) I made the first part like this:

in_file = raw_input("What is the name of the file you want to open: ")
in_file = open("test.txt","r")
text = in_file.read()

3) I think that I have to use a for loop ( something like: for line in text:
count +=1)
Or maybee I have to do create a def: something like: ( def loop(line,
count)), but not sure how to do this properly.
And then perhaps use the readlines() function, but again not quite sure how
to do this. So do one of you have a good idea.

Thanks for all help
 
L

Ling Lee

Oh I just did it.

Just used the line:

print "%d lines in your choosen file" % len(open("test.txt").readlines())

Thanks though :)
 
P

Phil Frost

Yes, you need a for loop, and a count variable. You can count in several
ways. File objects are iterable, and they iterate over the lines in the
file. readlines() returns a list of the lines, which will have the same
effect, but because it builds the entire list in memory first, it uses
more memory. Example:

########

filename = raw_input('file? ')
file = open(filename)

lines = 0
for line in file:
# line is ignored here, but it contains each line of the file,
# including the newline
lines += 1

print '%r has %r lines' % (filename, lines)

########

another alternative is to use the standard posix program "wc" with the
-l option, but this isn't Python.
 
A

Alex Martelli

Ling Lee said:
Oh I just did it.

Just used the line:

print "%d lines in your choosen file" % len(open("test.txt").readlines())

Thanks though :)

You're welcome;-). However, this approach reads all of the file into
memory at once. If you must be able to deal with humungoug files, too
big to fit in memory at once, try something like:

numlines = 0
for line in open('text.txt'): numlines += 1


Alex
 
R

Roland Heiber

Ling said:
Hi all.

I'm trying to write a program that:
1) Ask me what file I want to count number of lines in, and then counts the
lines and writes the answear out.

2) I made the first part like this:

in_file = raw_input("What is the name of the file you want to open: ")
in_file = open("test.txt","r")
text = in_file.read()

3) I think that I have to use a for loop ( something like: for line in text:
count +=1)
Or maybee I have to do create a def: something like: ( def loop(line,
count)), but not sure how to do this properly.
And then perhaps use the readlines() function, but again not quite sure how
to do this. So do one of you have a good idea.

Thanks for all help
text = in_file.readlines()
print len(text)

HtH, Roland
 
L

Ling Lee

Thanks for you replies :)

I just ran the program with a different file name, and it only counts the
number of lines in the file named test.txt. I try to give it a nother try
with your input...

Thanks again... for the fast reply... Hope I get it right this time :)
 
E

Erik Heneryd

Phil said:
another alternative is to use the standard posix program "wc" with the
-l option, but this isn't Python.

Not the same thing. wc -l counts newline bytes, not "real" lines.


Erik
 
B

Brian van den Broek

Ling Lee said unto the world upon 2004-09-20 09:36:
Thanks for you replies :)

I just ran the program with a different file name, and it only counts the
number of lines in the file named test.txt. I try to give it a nother try
with your input...

Thanks again... for the fast reply... Hope I get it right this time :)

Hi Ling Lee,

you've got:

in_file = raw_input("What is the name of the file you want to open: ")
in_file = open("test.txt","r")

What this does is take the user input and assign it the name "in_file"
and then promptly reassigns the name "in_file" to the output of
open("test.txt","r").

So, you never make use of the input, and keep asking it to open test.txt
instead.

Try something like:

in_file_name = raw_input("What is the file you want to open: ")
in_file = open(in_file_name,"r")

Also, and I say this as a fellow newbie, you might want to check out the
Tutor list: <http://mail.python.org/pipermail/tutor/>

HTH,

Brian vdB
 
A

Andrew Dalke

Ling said:
2) I made the first part like this:

in_file = raw_input("What is the name of the file you want to open: ")
in_file = open("test.txt","r")
text = in_file.read()

You have two different objects related to the file.
One is the filename (the result of calling raw_input) and
the other is the file handle (the result of calling open).
You are using same variable name for both of them. You
really should make them different.

First you get the file name and reference it by the variable
named 'in_file'. Next you use another filename ("test.txt")
for the open call. This returns a file handle, but not
a file handle to the file named in 'in_file'.

You then change things so that 'in_file' no longer refers
to the filename but now refers to the file handle.

A nicer solution is to use one variable name for the name
(like "in_filename") and another for the handle (you can
keep "in_file" if you want to). In the following I
reformatted it so the example fits in under 80 colums

in_filename = raw_input("What is the name of the file "
"you want to open: ")
in_file = open(in_filename,"r")
text = in_file.read()


Now the in_file.read() reads all of the file into memory. There
are several ways to count the number of lines. The first is
to count the number of newline characters. Because the newline
character is special, it's most often written as what's called
an escape code. In this case, "\n". Others are backspace ("\b")
and beep ("\g"), and backslash ("\\") since otherwise there's
no way to get the single character "\".

Here's how to cound the number of newlines in the text

num_lines = text.count("\n")

print "There are", num_lines, "in", in_filename


This will work for almost every file except for one where
the last line doesn't end with a newline. It's rare, but
it does happen. To fix that you need to see if the
text ends with a newline and if it doesn't then add one
more to the count


num_lines = text.count("\n")
if not text.endswith("\n"):
num_lines = num_lines + 1

print "There are", num_lines, "in", in_filename

3) I think that I have to use a for loop ( something like
for line in text: count +=1)

Something like that will work. When you say "for xxxx in string"
it loops through every character in the string, and not
every line. What you need is some way to get the lines.

One solution is to use the 'splitlines' method of strings.
This knows how to deal with the "final line doesn't end with
a newline" case and return a list of all the lines. You
can use it like this

count = 0
for line in text.splitlines():
count = count + 1

or, since splitlines() returns a list of lines you can
also do

count = len(text.splitlines())

It turns out that reading lines from a file is very common.
When you say "for xxx in file" it loops through every line
in the file. This is not a list so you can't say

len(open(in_filename, "r")) # DOES NOT WORK

instead you need to have the explicit loop, like this

count = 0
for line in open(in_filename, "r")):
count = count + 1

An advantage to this approach is that it doesn't read
the whole file into memory. That's only a problems
if you have a large file. Try counting the number of
lines in a 1.5 GB file!

By the way, the "r" is the default for the a file open.
Most people omit it from the parameter list and just use

open(in_filename)

Hope this helped!

By the way, you might want to look at the "Beginner's
Guide to Python" page at http://python.org/topics/learn/ .
It has pointers to resources that might help, including
the tutor mailing list meant for people like you who
are learning to program in Python.

Andrew
(e-mail address removed)
 
C

Christos TZOTZIOY Georgiou

Oh I just did it.

Just used the line:

print "%d lines in your choosen file" % len(open("test.txt").readlines())

Thanks though :)
[Alex]
You're welcome;-). However, this approach reads all of the file into
memory at once. If you must be able to deal with humungoug files, too
big to fit in memory at once, try something like:

numlines = 0
for line in open('text.txt'): numlines += 1

And a short story of premature optimisation follows...

Saw the plain code above and instantly the programmer's instinct of
optimisation came into action... we all know that C loops are faster
than python loops, right? So I spent 2 minutes of my time to write the
following 'clever' function:

def count_lines(filename):
fp = open(filename)
count = 1 + max(enumerate(fp))[0]
fp.close()
return count

Proud of my programming skills, I timed it against another function
containing Alex' code. Guess what? My code was slower... (and I should
put a try: except Value: clause to cater for empty files)

Of course, on second thought, the reason must be that enumerate
generates one tuple for every line in the file; in any case, I'll mark
this rule:

C loops are *always* faster than python loops, unless the loop does
something useful ;-) in the latter case, timeit.py is your friend.
 
A

Alex Martelli

Christos TZOTZIOY Georgiou said:
And a short story of premature optimisation follows...

Thanks for sharing!
def count_lines(filename):
fp = open(filename)
count = 1 + max(enumerate(fp))[0]
fp.close()
return count

Cute, actually!
containing Alex' code. Guess what? My code was slower... (and I should
put a try: except Value: clause to cater for empty files)

Of course, on second thought, the reason must be that enumerate
generates one tuple for every line in the file; in any case, I'll mark

I thought built-ins could recycle their tuples, sometimes, but you may
in fact be right (we should check with Raymong Hettinger, though).

With 2.4, I measure 30 msec with your approach, and 24 with mine, to
count the 45425 lines of /usr/share/dict/words on my Linux box
(admittedly not a creat example of 'humungous file'); and similarly
kjv.txt, a King James' Bible (31103 lines, but 10 times the size of the
words file), 41 with yours, 36 with mine. They're pretty close. At
least they beat len(file(...).readlines()), which takes 33 on words, 62
on kjv.txt...

If one is really in a hurry counting lines, a dedicated C extension
might help. E.g.:

static PyObject *count(PyObject *self, PyObject *args)
{
PyObject* seq;
PyObject* item;
int result;

/* get one argument as an iterator */
if(!PyArg_ParseTuple(args, "O", &seq))
return 0;
seq = PyObject_GetIter(seq);
if(!seq)
return 0;

/* count items */
result = 0;
while((item=PyIter_Next(seq))) {
result += 1;
Py_DECREF(item);
}

/* clean up and return result */
Py_DECREF(seq);
return Py_BuildValue("i", result);
}

Using this count-items-in-iterable thingy, words takes 10 msec, kjv
takes 26.

Happier news is that one does NOT have to learn C to gain this.
Consider the Pyrex file:

def count(seq):
cdef int i
it = iter(seq)
i = 0
for x in it:
i = i + 1
return i

pyrexc'ing this and building the Python extension from the resulting C
file gives just about the same performance as the pure-C coding: 10 msec
on words, 26 on kjv, the same to within 1% as pure-C coding (there is a
systematic speedup of a bit less than 1% for the C-coded function).

And if one doesn't even want to use pyrex? Why, that's what psyco is
for...:

import psyco
def count(seq):
it = iter(seq)
i = 0
for x in it:
i = i + 1
return i
psyco.bind(seq)

Again to the same level of precision, the SAME numbers, 10 and 26 msec
(actually, in this case the less-than-1% systematic bias is in favour of
psyco compared to pure-C coding...!-)


So: your instinct that C-coded loops are faster weren't too badly off...
and you can get the same performance (just about) with Pyrex or (on an
intel or compatible processor, only -- sigh) with psyco.


Alex
 
B

Bengt Richter

You're welcome;-). However, this approach reads all of the file into
memory at once. If you must be able to deal with humungoug files, too
big to fit in memory at once, try something like:

numlines = 0
for line in open('text.txt'): numlines += 1

I don't have 2.4, but how would that compare with a generator expression like (untested)

sum(1 for line in open('text.txt'))

or, if you _are_ willing to read in the whole file,

open('text.txt').read().count('\n')

Regards,
Bengt Richter
 
A

Alex Martelli

Bengt Richter said:
I don't have 2.4

2.4a3 is freely available for download and everybody's _encouraged_ to
download it and try it out -- come on, don't be the last one to!-)
but how would that compare with a generator expression like (untested)

sum(1 for line in open('text.txt'))

or, if you _are_ willing to read in the whole file,

open('text.txt').read().count('\n')

I'm not on the same machine as when I ran the other timing measurements
(including pyrex &c) but here's the results on this one machine...:

$ wc /usr/share/dict/words
234937 234937 2486825 /usr/share/dict/words
$ python2.4 ~/cb/timeit.py "numlines=0
for line in file('/usr/share/dict/words'): numlines+=1"
10 loops, best of 3: 3.08e+05 usec per loop
$ python2.4 ~/cb/timeit.py
"file('/usr/share/dict/words').read().count('\n')"
10 loops, best of 3: 2.72e+05 usec per loop
$ python2.4 ~/cb/timeit.py
"len(file('/usr/share/dict/words').readlines())"
10 loops, best of 3: 3.25e+05 usec per loop
$ python2.4 ~/cb/timeit.py "sum(1 for line in
file('/usr/share/dict/words'))"
10 loops, best of 3: 4.42e+05 usec per loop

Last but not least...:

$ python2.4 ~/cb/timeit.py -s'import cou'
"cou.cou(file('/usr/share/dict/words'))"
10 loops, best of 3: 2.05e+05 usec per loop

where cou.pyx is the pyrex program I've already shown on the other
subthread. Using the count.c I've also shown takes 2.03e+05 usec.
(Can't try psyco here, not an intel-like cpu).


Summary: "sum(1 for ...)" is no speed daemon; the plain loop is best
among the pure-python approaches for files that can't fit in memory. If
the file DOES fit in memory, read().count('\n') is faster, but
len(...readlines()) is slower. Pyrex rocks, essentially removing the
need for C-coded extensions (less than a 1% advantage) -- and so does
psyco, but not if you're using a Mac (quick, somebody gift Armin Rigo
with a Mac before it's too late...!!!).


Alex
 
A

Andrew Dalke

Bengt said:
or, if you _are_ willing to read in the whole file,

open('text.txt').read().count('\n')

Except the last line might not have a terminal newline.

Andrew
(e-mail address removed)
 
A

Andrew Dalke

Alex said:
If one is really in a hurry counting lines, a dedicated C extension
might help. E.g.:

static PyObject *count(PyObject *self, PyObject *args) ...
Using this count-items-in-iterable thingy

There's been a few times I've wanted a function like
this. I keep expecting that len(iterable) will work,
but of course it doesn't.

Would itertools.len(iterable) be useful? More likely
the name collision with len itself would be a problem,
so perhaps itertools.length(iterable).


BTW, I saw itertools.count and figured that might be
it. Nope. And don't try the following

:)

Andrew
(e-mail address removed)
 
A

Alex Martelli

Andrew Dalke said:
There's been a few times I've wanted a function like

Me too, that's why I wrote the C and Pyrex versions:).
this. I keep expecting that len(iterable) will work,
but of course it doesn't.

Yep -- it would probably be too risky to have len(...) consume a whole
iterator, beginning users wouldn't expect that and might get burnt.

Would itertools.len(iterable) be useful? More likely
the name collision with len itself would be a problem,
so perhaps itertools.length(iterable).

Unfortunately, itertools's functions are there to produce iterators, not
to consume them. I doubt Raymond Hettinger, itertools' guru, would
approve of changing that (though one could surely ask him, and if he
surprised me, I guess the change might get in).

There's currently no good single place for 'accumulators', i.e.
consumers of iterators which produce scalars or thereabouts -- sum, max,
and min, are built-ins; other useful accumulators can be found in heapq
(because they're implemented via a heap...)... and there's nowhere to
put the obviously needed "trivial" accumulators, such as average,
median, variance, count...

A "stats" module was proposed, but also shot down (presumably people
have more ambitious ideas about 'statistics' than there simple
accumulators, alas -- I'm not sure exactly what the problem was).


Alex
 
A

Alex Martelli

Andrew Dalke said:
Except the last line might not have a terminal newline.

....and wc would then not count that non-line as a line, so why should
we...? Witness...:

$ echo -n 'bu'>em
$ wc em
0 1 2 em

zero lines, one word, two characters: seems right to me.


Alex
 
A

Andrew Dalke

Alex said:
....and wc would then not count that non-line as a line, so why should
we...? Witness...:


'Cause that's what Python does. Witness:

% echo -n 'bu' | python -c \
? 'import sys; print len(sys.stdin.readlines())'
1

;)

Andrew
(e-mail address removed)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top