Read a file line by line with a maximum number of characters per line

H

Hugo

Hello,

I want to read a file line by line. I first used the readLine() method
which returns a string, but if the line contains too much characters
I'm ending with an OutOfMemory exception. I could use the
read(buffer[], maxChars) method, but this method does not take in
account the end of the line. So my buffer could contain more than a
single line. I would like to use benefit of both methods, meaning
using a method which return a string representing a file's line (like
readLine()), but with a maximum characters per line (like
read(buffer[], maxChars)).

I tried to read the file character by character searching for "\r\n"
and with a maximum number of method calls but it takes too much time
to read the file.

Could someone have an idea how to proceed ?

Thanks a lot.

Hugo
 
T

Thomas Weidenfeller

Hugo said:
I tried to read the file character by character searching for "\r\n"
and with a maximum number of method calls but it takes too much time
to read the file.

That's unlikely. If readLine() was fast enough for you, char-by-char
reading should be fast enough for you, too. Because this is what
readLine() does internally to find the line ends.
Could someone have an idea how to proceed ?

I suggest you revise your code.

/Thomas
 
M

Matt Humphrey

Hugo said:
Hello,

I want to read a file line by line. I first used the readLine() method
which returns a string, but if the line contains too much characters
I'm ending with an OutOfMemory exception. I could use the
read(buffer[], maxChars) method, but this method does not take in
account the end of the line. So my buffer could contain more than a
single line. I would like to use benefit of both methods, meaning
using a method which return a string representing a file's line (like
readLine()), but with a maximum characters per line (like
read(buffer[], maxChars)).

I tried to read the file character by character searching for "\r\n"
and with a maximum number of method calls but it takes too much time
to read the file.

Could someone have an idea how to proceed ?

There's an implicit contradiction in what you're asking for. You want to
process the data by lines, but some lines are too big to be processed.
You're going to have to give up one of these. Before you make that decision,
however, you should ask yourself a couple of questions.

1) Is the OutOfMemory exception really being caused by an input line that is
too large? Will such lines be common or expected and must your program
defend against them? Are the lines supposed to be less than a particular
length such that a very long one constitutes an invalid input file?

2) How important is it that your data be processed by lines? Are you
scanning for something in particular? or are you just counting lines as you
go? Is each line parsed independently or scanned for data? As in part 1,
will there never be valid data after a particular length?

3) You say that reading and searching is too slow, but are you using a
BufferedReader? Also, what do you mean by "slow" as your tests that run
using readline simply fail with an exception, perhaps the file is so large
that "slow" is normal.

I would guess that, realistically, you're going to have to give up the idea
of processing data by lines in order to protect your program from input
files that consists of 2.4Gb of data with no carriage returns at all.

To do this you have to change your input system so that it is not line
oriented but that is uses some other structure such as words or phrases,
etc. You say you've tried but that it takes too much time to search for the
end of line. Consider this: the readline method must also search (stop at)
the end of line and if it can do it with reasonable performance so can
you--the answer is probably in how you buffer the data. I would recommend
you look at a design centered on reading (buffering) a large chunk and
tokenizing it according to whatever you're looking for. This tokenizer
would refill the buffer when it gets low and handle the two unpleasant cases
of a line (or whatever you're looking for) either spanning multiple blocks
or there being several within one block. It may be possible for your
tokenizer to read simlpy read a character a time from a BufferedReader and
for you to scan for what you're looking for.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
W

Will Hartung

Thomas Weidenfeller said:
That's unlikely. If readLine() was fast enough for you, char-by-char
reading should be fast enough for you, too. Because this is what
readLine() does internally to find the line ends.

Really?

I would think since they're using a buffered reader, they'd load blocks of
data in big gulps and then scan it. That's what I would do.

Regards,

Will Hartung
([email protected])
 
S

Steve Horsley

Matt Humphrey wrote:

<lots of good advice snipped>

I would like to add: If you are looking for line endings,
remember that BufferedReader.readLine accepts line endings
of any of the following sequences:
"\n"
"\r"
"\r\n"

I advise that you try and emulate this.
It will save you much grief one day.

Steve
 
T

Thomas Weidenfeller

Will said:

I suggest you read the source code.
I would think since they're using a buffered reader, they'd load blocks of
data in big gulps and then scan it.

They read once char after the other from the buffer, check it, and run a
small state machine to handle \r\n.

/Thomas
 
H

Hugo

Matt Humphrey said:
Hugo said:
Hello,

I want to read a file line by line. I first used the readLine() method
which returns a string, but if the line contains too much characters
I'm ending with an OutOfMemory exception. I could use the
read(buffer[], maxChars) method, but this method does not take in
account the end of the line. So my buffer could contain more than a
single line. I would like to use benefit of both methods, meaning
using a method which return a string representing a file's line (like
readLine()), but with a maximum characters per line (like
read(buffer[], maxChars)).

I tried to read the file character by character searching for "\r\n"
and with a maximum number of method calls but it takes too much time
to read the file.

Could someone have an idea how to proceed ?

There's an implicit contradiction in what you're asking for. You want to
process the data by lines, but some lines are too big to be processed.
You're going to have to give up one of these. Before you make that decision,
however, you should ask yourself a couple of questions.

1) Is the OutOfMemory exception really being caused by an input line that is
too large? Will such lines be common or expected and must your program
defend against them? Are the lines supposed to be less than a particular
length such that a very long one constitutes an invalid input file?

When my file is about 3MBytes on only one line, I get an OutOfMemory
error.
These large lines are not exepected and do not correspond to a normal
behaviour, but it may happen and I must protect my system against
them.
2) How important is it that your data be processed by lines? Are you
scanning for something in particular? or are you just counting lines as you
go? Is each line parsed independently or scanned for data? As in part 1,
will there never be valid data after a particular length?

It is important to be processed by line because the user may want to
look for a particular keyword at a particular position in the file
(column, line). Yes, each read line is sent to a scanner one after the
other.
3) You say that reading and searching is too slow, but are you using a
BufferedReader? Also, what do you mean by "slow" as your tests that run
using readline simply fail with an exception, perhaps the file is so large
that "slow" is normal.

Yes, I am using a BufferedReader. Slow means 45 minutes for a 3MBytes
file when I read it char by char without using readLine() !!
I would guess that, realistically, you're going to have to give up the idea
of processing data by lines in order to protect your program from input
files that consists of 2.4Gb of data with no carriage returns at all.

To do this you have to change your input system so that it is not line
oriented but that is uses some other structure such as words or phrases,
etc. You say you've tried but that it takes too much time to search for the
end of line. Consider this: the readline method must also search (stop at)
the end of line and if it can do it with reasonable performance so can
you--the answer is probably in how you buffer the data. I would recommend
you look at a design centered on reading (buffering) a large chunk and
tokenizing it according to whatever you're looking for. This tokenizer
would refill the buffer when it gets low and handle the two unpleasant cases
of a line (or whatever you're looking for) either spanning multiple blocks
or there being several within one block. It may be possible for your
tokenizer to read simlpy read a character a time from a BufferedReader and
for you to scan for what you're looking for.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/


Here the code I use to read a my file char by char with a maximum
number of read charachters :

private String readLineWithMaxSize(BufferedReader br) throws
IOException {
String finalLine = null;
int readCharacter = -1;
char[] lineChars = new char[204800];
boolean bufferFull = false;
if (br != null) {
int index = 0;
readCharacter = br.read();
// If the read character does not correspond to a new line
or to
// an end of file, we treat it.
while (readCharacter != -1 && readCharacter != '\r' &&
readCharacter != '\n') {
// if the buffer is not full, we add the character to
the array of characters
if (!bufferFull) {
lineChars[index] = (char) readCharacter;
index++;
bufferFull = index >= lineChars.length;
}
readCharacter = br.read();
}
// If the read character is \r and the next one is \n, we
skip it.
if (readCharacter == '\r') {
br.mark(2);
int nextReadCharacter = br.read();
if (nextReadCharacter != '\n') {
br.reset();
}
}
// We construct a string representing the line from the
buffer of
// characters read
if (index != 0) {
finalLine = new String(lineChars);
} else if (readCharacter == '\r' || readCharacter == '\n')
{
finalLine = "";
}
}
return finalLine;
}
 
O

Owen Jacobson

Matt Humphrey wrote:

<lots of good advice snipped>

I would like to add: If you are looking for line endings,
remember that BufferedReader.readLine accepts line endings
of any of the following sequences:
"\n"
"\r"
"\r\n"

I advise that you try and emulate this.
It will save you much grief one day.

(This is because the various platforms Java runs on haven't, historically,
agreed on a line terminator/separator.)

Just out of curiousity, and because I'm about to go to bed and therefore
don't want to start coding, how many lines is the pathological sequence:

"\r\r\n\r\r\n\n\r\r\n"?
()(??)()(??)()()(??) <-- helpful markers
 
M

Matt Humphrey

Hugo said:
"Matt Humphrey" <[email protected]> wrote in message

<snip>

Here the code I use to read a my file char by char with a maximum
number of read charachters :

private String readLineWithMaxSize(BufferedReader br) throws
IOException {
String finalLine = null;
int readCharacter = -1;
char[] lineChars = new char[204800];
boolean bufferFull = false;
if (br != null) {
int index = 0;
readCharacter = br.read();
// If the read character does not correspond to a new line
or to
// an end of file, we treat it.
while (readCharacter != -1 && readCharacter != '\r' &&
readCharacter != '\n') {
// if the buffer is not full, we add the character to
the array of characters
if (!bufferFull) {
lineChars[index] = (char) readCharacter;
index++;
bufferFull = index >= lineChars.length;
}
readCharacter = br.read();
}
// If the read character is \r and the next one is \n, we
skip it.
if (readCharacter == '\r') {
br.mark(2);
int nextReadCharacter = br.read();
if (nextReadCharacter != '\n') {
br.reset();
}
}
// We construct a string representing the line from the
buffer of
// characters read
if (index != 0) {
finalLine = new String(lineChars);
} else if (readCharacter == '\r' || readCharacter == '\n')
{
finalLine = "";
}
}
return finalLine;
}

I compiled your code and it ran fine for me. I wrote a program that creates
a test file with 1 short line, a line of 3.5Mb and a final short line. Your
code above on my 1.7Ghz Windows 2000 machine with Java 1.4.2_03 with no
special memory expansion -Xmx set runs in less than a second. I wrote a
similar version based on StringBuffer that returns the complete 3.4Mb string
and it works perfectly fine also. Note that your code above has a serious
problem--every string it returns will be 204800 characters long. You won't
need many of these for your program to run out of memory.

As for the speed problem, I think it will be something with the file and the
OS rather than with Java.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 
H

Hugo

<snip>
<more snip>
<more more snip>

Thank you for your answer.
If my code works for you, it seems that I may have miss something in
the code which calls this method. I will check that. On the other
hand, I don't understant why say that this method will always return
204800 characters long strings. I mean, in the while loop, I check if
the read character is an end-of-line or not. So my array of characters
is not always full. Is there something I don't understand here? If I
initialize my array at 204800 characters, does it mean the string I
will construct from it will contain 204800 charcters, even if the
array is not full??

Thanks a lot for your answers.

Hugo.
 
M

Matt Humphrey

Hugo said:
<snip>
<more snip>
<more more snip>

Thank you for your answer.
If my code works for you, it seems that I may have miss something in
the code which calls this method. I will check that. On the other
hand, I don't understant why say that this method will always return
204800 characters long strings. I mean, in the while loop, I check if
the read character is an end-of-line or not. So my array of characters
is not always full. Is there something I don't understand here? If I
initialize my array at 204800 characters, does it mean the string I
will construct from it will contain 204800 charcters, even if the
array is not full??

Arrays are fixed-length and always have the declared length.

Cheers,
Matt Humphrey (e-mail address removed) http://www.iviz.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top