Find the number of lines in a text file

C

Chris Brat

Hi,

I need to find the total number of lines in a text file -so that I can
skip the header and filler information and process just the body.

I've done some searching and can't find a class or method that does
this directly and the solutions I've found require either :

- reading each line of the file (using the BufferedReader) and
incrementing a line counter for each line, or
- using the LineNumberReader directly and using its result from its
getLineNumber() method once the entire file is read, or
- searching for the eol characters and counting these.
- using the the RandomAccessFile, seeking to the end of the file and
dividing the total number of bytes by the number of bytes expected in
the line (I believe this relies on guarantee that each line will have
the same number of characters).

I dont like the idea of counting eol characters or having to read the
entire file twice (once to get the number of line numbers and the
second time to do my actual processing).


Does anyone have another or better solution?

Thanks,
Chris
 
I

Ingo R. Homann

Hi,

Chris said:
Hi,

I need to find the total number of lines in a text file -so that I can
skip the header and filler information and process just the body.

I've done some searching and can't find a class or method that does
this directly and the solutions I've found require either :

- reading each line of the file (using the BufferedReader) and
incrementing a line counter for each line, or
- using the LineNumberReader directly and using its result from its
getLineNumber() method once the entire file is read, or

Note that this is the same. (IIRC LineNumberReader internally does
exactly the same as your first suggestion)
- searching for the eol characters and counting these.

Note that this is also nearly the same - especially in 'runtime-complexity'.
- using the the RandomAccessFile, seeking to the end of the file and
dividing the total number of bytes by the number of bytes expected in
the line (I believe this relies on guarantee that each line will have
the same number of characters).

Of course this relies on this guarantee! Is it really guaranteed? If
yes, this is the best idea. BTW: AFAIK you do not need a RAF for that -
IIRC, java.io.File has the methode you need (getLength() or sth like
that) as well.
I dont like the idea of counting eol characters or having to read the
entire file twice (once to get the number of line numbers and the
second time to do my actual processing).
>
Does anyone have another or better solution?

Depends on what *exactly* you want to do, and on the format of the file,
and what you mean with "skip the header and filler information and
process just the body" - isn't it possible to do everything by reading
the file only once? (Do "headers" and "fillers" have a certain prefix? ...?)

Ciao,
Ingo
 
A

Andrew Thompson

Chris Brat wrote:
....
I dont like the idea of counting eol characters or having to read the
entire file twice (once to get the number of line numbers and the
second time to do my actual processing). ...
Does anyone have another or better solution?

'Use a file-system that counts them for you'?

(Which is my way of saying. Other tools that provide
a line count are doing something like "counting the EOL's"
internally - even if they might imply otherwise and obscure
the details.)

Note that if you *know* that further processing is required
on the file(s), it probably makes more sense to read the
lines into an array on the first pass.

(And a LineNumberReader or similar might be the best
way to sort those EOL's)

Andrew T.
 
C

Chris Brat

Hi Ingo,

I effectively want to skip an known number of lines in a file
(immediately at the beginning of the file) and immediately at the end
of a file. The header is not a problem to skip but the footer is.

Sorry, I actually meant 'footer' and not 'filller'

The scenario is :The user defines that the first 6 lines and the last 3
lines of the file are in an unknown format and must be ignored - this
means that they must not be checked for validity and processed.

Regards,
Chris
 
I

Ingo R. Homann

Hi,

Chris said:
The scenario is :The user defines that the first 6 lines and the last 3
lines of the file are in an unknown format and must be ignored - this
means that they must not be checked for validity and processed.

I think, the simplest idea would be to buffer 3 lines...

Ciao,
Ingo
 
C

Chris Brat

Hi Andrew,
'Use a file-system that counts them for you'?
Unfortunately I do not maintain the environment and it is very possible
that the OS and everything associated with it may change in the future
without my knowlege.
Note that if you *know* that further processing is required
on the file(s), it probably makes more sense to read the
lines into an array on the first pass.
True - do you think this is a good idea with a file of 30 000+ lines
though?
I dont think the memory expense is worth the few extra seconds.

To be honest I was hoping that someone knew of a OSS lib (like commons
IO) or a method I didn't know of in the java.io package that already
did this.

Thanks for the input though.

Regards,
Chris
 
I

Ingo R. Homann

Hi,

Chris said:
To be honest I was hoping that someone knew of a OSS lib (like commons
IO) or a method I didn't know of in the java.io package that already
did this.

Well if the filesystem/os does not cache this information, how should a
library get the information without reading the whole file? I would have
to be 'magic'!

Ciao,
Ingo
 
B

bugbear

Chris said:
Hi Ingo,

I effectively want to skip an known number of lines in a file
(immediately at the beginning of the file) and immediately at the end
of a file. The header is not a problem to skip but the footer is.

Sorry, I actually meant 'footer' and not 'filller'

The scenario is :The user defines that the first 6 lines and the last 3
lines of the file are in an unknown format and must be ignored - this
means that they must not be checked for validity and processed.

In that case you certainly don't need to count the total
number of lines in the file.

Simply count the first 6 lines moving forward,
then lseek to the end, and count the last 3 lines backwards.

RandomAccessFile may be useful to you.

You now have the offsets withing the file that define
your "valid zone".

Either work with these, or create
a IO decorator that presents the subset of the
file as stream/reader object.

BugBear
 
S

Simon

Chris said:
I effectively want to skip an known number of lines in a file
(immediately at the beginning of the file) and immediately at the end
of a file. The header is not a problem to skip but the footer is.

Sorry, I actually meant 'footer' and not 'filller'

The scenario is :The user defines that the first 6 lines and the last 3
lines of the file are in an unknown format and must be ignored - this
means that they must not be checked for validity and processed.

Use a queue, e.g. a List<String>:

1. Skip the header
2. Create a queue containing the lines.
3. Read 3 lines into the queue
4. As long as there are more lines
- read one line and append it to the queue.
- take the first line out of the queue and process it
5. Throw away the 3 remaining lines in the queue.

Cheers,
Simon
 
C

Chris Brat

That's brilliant !!

Thanks Simon.
Use a queue, e.g. a List<String>:

1. Skip the header
2. Create a queue containing the lines.
3. Read 3 lines into the queue
4. As long as there are more lines
- read one line and append it to the queue.
- take the first line out of the queue and process it
5. Throw away the 3 remaining lines in the queue.

Cheers,
Simon
 
A

Andrew Thompson

Chris Brat wrote:
....
True - do you think this is a good idea with a file of 30 000+ lines
though?

I would need to run some tests (as I suggest
you do, since I do not 'need to know'*)

* For this current environment, in which I have no need to
parse text files of such length.
I dont think the memory expense is worth the few extra seconds.

The results might surprise you (they might not,
as well). In situations as fundamental is this,
it pays to do a quick test, though.

OTOH - Ingo raised some interesting points re. the
file format. There might be some significant 'cheating'
you can do if the files are of 'fixed line length'.

Andrew T.
 
E

EJP

File file = ...;
LineNumberReader lnr = new LineNumberReader(new FileReader(file));
lnr.skip(file.length()-1);
int lines = lnr.getLineNumber();
 
I

Ingo R. Homann

Hi,
File file = ...;
LineNumberReader lnr = new LineNumberReader(new FileReader(file));
lnr.skip(file.length()-1);
int lines = lnr.getLineNumber();

Bad idea - that's exactly what Chris wanted to avoid! (Or what do you
think this code does internally?)

Ciao,
Ingo
 
E

EJP

Ingo said:
Bad idea - that's exactly what Chris wanted to avoid! (Or what do you
think this code does internally?)

I don't think he *can* avod it actually, and thanks, I know exactly what
the code does internally too.
 
I

Ingo R. Homann

Hi EJP,
I don't think he *can* avod it actually, and thanks, I know exactly what
the code does internally too.

Then, I think it would be a good idea to tell this to the OP, because I
imagine that he does not know exactly what the code does internally.

I think he might find it a very interesting idea and will give it a try
just to find out that it is a bad idea and that it is exactly what he
wanted to avoid. ;-)

Ciao,
Ingo
 
C

Chris Brat

Ingo,

I was asking for other possibly better solutions because none of those
that I found myself seemed like the best way to do it.

Please dont make comments on my behalf - I appreciate any suggestions
by contributors.

EJP, I tested your solution and it gives a 300ms performance
improvement on a 40 Mb file.

Regards,
Chris
 
I

Ingo R. Homann

Hi,

Chris said:
I was asking for other possibly better solutions...
>
EJP, I tested your solution and it gives [no real] performance
improvement on a 40 Mb file.

Well, internally, it does *exactly* the same what you wanted to avoid.
Without asking someone and without testing anything, just with thinking
a bit about the problem, I can tell you:

THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT
READING THE WHOLE FILE.

Sorry for shouting, but that is a fact.

However, your (*other*) problem (reading a file only once, but skipping
the last three lines) can be solved otherwise, as Simon and me mentioned.

Ciao,
Ingo
 
M

Michael Rauscher

Hi Ingo ;)
EJP, I tested your solution and it gives [no real] performance
improvement on a 40 Mb file.

Well, internally, it does *exactly* the same what you wanted to avoid.
Without asking someone and without testing anything, just with thinking
a bit about the problem, I can tell you:

THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT
READING THE WHOLE FILE.

Sorry for shouting, but that is a fact.

No reason to shout. It just happened what you've already predicted:

<quote>
I think he might find it a very interesting idea and will give it a try
just to find out that it is a bad idea and that it is exactly what he
wanted to avoid. ;-)
</quote>

LOL
Michael
 
T

Tor Iver Wilhelmsen

Ingo R. Homann said:
THERE IS NO POSSIBILITY TO GET THE NUMER OF LINES IN A FILE WITHOUT
READING THE WHOLE FILE.

Exception: If it is known the file has a set line (record) size in
bytes, and the line separator is known, then the number of lines =
file.size()/(recordSize+separatorSize)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top