Read a file line by line with a maximum number of characters per line

Discussion in 'Java' started by Hugo, Oct 14, 2004.

  1. Hugo

    Hugo Guest

    Hello,

    I want to read a file line by line. I first used the readLine() method
    which returns a string, but if the line contains too much characters
    I'm ending with an OutOfMemory exception. I could use the
    read(buffer[], maxChars) method, but this method does not take in
    account the end of the line. So my buffer could contain more than a
    single line. I would like to use benefit of both methods, meaning
    using a method which return a string representing a file's line (like
    readLine()), but with a maximum characters per line (like
    read(buffer[], maxChars)).

    I tried to read the file character by character searching for "\r\n"
    and with a maximum number of method calls but it takes too much time
    to read the file.

    Could someone have an idea how to proceed ?

    Thanks a lot.

    Hugo
     
    Hugo, Oct 14, 2004
    #1
    1. Advertising

  2. Re: Read a file line by line with a maximum number of charactersper line

    Hugo wrote:
    > I tried to read the file character by character searching for "\r\n"
    > and with a maximum number of method calls but it takes too much time
    > to read the file.


    That's unlikely. If readLine() was fast enough for you, char-by-char
    reading should be fast enough for you, too. Because this is what
    readLine() does internally to find the line ends.

    > Could someone have an idea how to proceed ?


    I suggest you revise your code.

    /Thomas
     
    Thomas Weidenfeller, Oct 14, 2004
    #2
    1. Advertising

  3. "Hugo" <> wrote in message
    news:...
    > Hello,
    >
    > I want to read a file line by line. I first used the readLine() method
    > which returns a string, but if the line contains too much characters
    > I'm ending with an OutOfMemory exception. I could use the
    > read(buffer[], maxChars) method, but this method does not take in
    > account the end of the line. So my buffer could contain more than a
    > single line. I would like to use benefit of both methods, meaning
    > using a method which return a string representing a file's line (like
    > readLine()), but with a maximum characters per line (like
    > read(buffer[], maxChars)).
    >
    > I tried to read the file character by character searching for "\r\n"
    > and with a maximum number of method calls but it takes too much time
    > to read the file.
    >
    > Could someone have an idea how to proceed ?


    There's an implicit contradiction in what you're asking for. You want to
    process the data by lines, but some lines are too big to be processed.
    You're going to have to give up one of these. Before you make that decision,
    however, you should ask yourself a couple of questions.

    1) Is the OutOfMemory exception really being caused by an input line that is
    too large? Will such lines be common or expected and must your program
    defend against them? Are the lines supposed to be less than a particular
    length such that a very long one constitutes an invalid input file?

    2) How important is it that your data be processed by lines? Are you
    scanning for something in particular? or are you just counting lines as you
    go? Is each line parsed independently or scanned for data? As in part 1,
    will there never be valid data after a particular length?

    3) You say that reading and searching is too slow, but are you using a
    BufferedReader? Also, what do you mean by "slow" as your tests that run
    using readline simply fail with an exception, perhaps the file is so large
    that "slow" is normal.

    I would guess that, realistically, you're going to have to give up the idea
    of processing data by lines in order to protect your program from input
    files that consists of 2.4Gb of data with no carriage returns at all.

    To do this you have to change your input system so that it is not line
    oriented but that is uses some other structure such as words or phrases,
    etc. You say you've tried but that it takes too much time to search for the
    end of line. Consider this: the readline method must also search (stop at)
    the end of line and if it can do it with reasonable performance so can
    you--the answer is probably in how you buffer the data. I would recommend
    you look at a design centered on reading (buffering) a large chunk and
    tokenizing it according to whatever you're looking for. This tokenizer
    would refill the buffer when it gets low and handle the two unpleasant cases
    of a line (or whatever you're looking for) either spanning multiple blocks
    or there being several within one block. It may be possible for your
    tokenizer to read simlpy read a character a time from a BufferedReader and
    for you to scan for what you're looking for.

    Cheers,
    Matt Humphrey http://www.iviz.com/
     
    Matt Humphrey, Oct 14, 2004
    #3
  4. Hugo

    Will Hartung Guest

    "Thomas Weidenfeller" <> wrote in message
    news:ckm497$68o$...
    > Hugo wrote:
    > > I tried to read the file character by character searching for "\r\n"
    > > and with a maximum number of method calls but it takes too much time
    > > to read the file.

    >
    > That's unlikely. If readLine() was fast enough for you, char-by-char
    > reading should be fast enough for you, too. Because this is what
    > readLine() does internally to find the line ends.


    Really?

    I would think since they're using a buffered reader, they'd load blocks of
    data in big gulps and then scan it. That's what I would do.

    Regards,

    Will Hartung
    ()
     
    Will Hartung, Oct 14, 2004
    #4
  5. Re: Read a file line by line with a maximum number of charactersper line

    Matt Humphrey wrote:

    <lots of good advice snipped>

    I would like to add: If you are looking for line endings,
    remember that BufferedReader.readLine accepts line endings
    of any of the following sequences:
    "\n"
    "\r"
    "\r\n"

    I advise that you try and emulate this.
    It will save you much grief one day.

    Steve
     
    Steve Horsley, Oct 14, 2004
    #5
  6. Re: Read a file line by line with a maximum number of charactersper line

    Will Hartung wrote:
    > Really?


    I suggest you read the source code.

    > I would think since they're using a buffered reader, they'd load blocks of
    > data in big gulps and then scan it.


    They read once char after the other from the buffer, check it, and run a
    small state machine to handle \r\n.

    /Thomas
     
    Thomas Weidenfeller, Oct 15, 2004
    #6
  7. Hugo

    Hugo Guest

    "Matt Humphrey" <> wrote in message news:<>...
    > "Hugo" <> wrote in message
    > news:...
    > > Hello,
    > >
    > > I want to read a file line by line. I first used the readLine() method
    > > which returns a string, but if the line contains too much characters
    > > I'm ending with an OutOfMemory exception. I could use the
    > > read(buffer[], maxChars) method, but this method does not take in
    > > account the end of the line. So my buffer could contain more than a
    > > single line. I would like to use benefit of both methods, meaning
    > > using a method which return a string representing a file's line (like
    > > readLine()), but with a maximum characters per line (like
    > > read(buffer[], maxChars)).
    > >
    > > I tried to read the file character by character searching for "\r\n"
    > > and with a maximum number of method calls but it takes too much time
    > > to read the file.
    > >
    > > Could someone have an idea how to proceed ?

    >
    > There's an implicit contradiction in what you're asking for. You want to
    > process the data by lines, but some lines are too big to be processed.
    > You're going to have to give up one of these. Before you make that decision,
    > however, you should ask yourself a couple of questions.
    >
    > 1) Is the OutOfMemory exception really being caused by an input line that is
    > too large? Will such lines be common or expected and must your program
    > defend against them? Are the lines supposed to be less than a particular
    > length such that a very long one constitutes an invalid input file?


    When my file is about 3MBytes on only one line, I get an OutOfMemory
    error.
    These large lines are not exepected and do not correspond to a normal
    behaviour, but it may happen and I must protect my system against
    them.

    > 2) How important is it that your data be processed by lines? Are you
    > scanning for something in particular? or are you just counting lines as you
    > go? Is each line parsed independently or scanned for data? As in part 1,
    > will there never be valid data after a particular length?


    It is important to be processed by line because the user may want to
    look for a particular keyword at a particular position in the file
    (column, line). Yes, each read line is sent to a scanner one after the
    other.

    > 3) You say that reading and searching is too slow, but are you using a
    > BufferedReader? Also, what do you mean by "slow" as your tests that run
    > using readline simply fail with an exception, perhaps the file is so large
    > that "slow" is normal.


    Yes, I am using a BufferedReader. Slow means 45 minutes for a 3MBytes
    file when I read it char by char without using readLine() !!

    > I would guess that, realistically, you're going to have to give up the idea
    > of processing data by lines in order to protect your program from input
    > files that consists of 2.4Gb of data with no carriage returns at all.
    >
    > To do this you have to change your input system so that it is not line
    > oriented but that is uses some other structure such as words or phrases,
    > etc. You say you've tried but that it takes too much time to search for the
    > end of line. Consider this: the readline method must also search (stop at)
    > the end of line and if it can do it with reasonable performance so can
    > you--the answer is probably in how you buffer the data. I would recommend
    > you look at a design centered on reading (buffering) a large chunk and
    > tokenizing it according to whatever you're looking for. This tokenizer
    > would refill the buffer when it gets low and handle the two unpleasant cases
    > of a line (or whatever you're looking for) either spanning multiple blocks
    > or there being several within one block. It may be possible for your
    > tokenizer to read simlpy read a character a time from a BufferedReader and
    > for you to scan for what you're looking for.
    >
    > Cheers,
    > Matt Humphrey http://www.iviz.com/



    Here the code I use to read a my file char by char with a maximum
    number of read charachters :

    private String readLineWithMaxSize(BufferedReader br) throws
    IOException {
    String finalLine = null;
    int readCharacter = -1;
    char[] lineChars = new char[204800];
    boolean bufferFull = false;
    if (br != null) {
    int index = 0;
    readCharacter = br.read();
    // If the read character does not correspond to a new line
    or to
    // an end of file, we treat it.
    while (readCharacter != -1 && readCharacter != '\r' &&
    readCharacter != '\n') {
    // if the buffer is not full, we add the character to
    the array of characters
    if (!bufferFull) {
    lineChars[index] = (char) readCharacter;
    index++;
    bufferFull = index >= lineChars.length;
    }
    readCharacter = br.read();
    }
    // If the read character is \r and the next one is \n, we
    skip it.
    if (readCharacter == '\r') {
    br.mark(2);
    int nextReadCharacter = br.read();
    if (nextReadCharacter != '\n') {
    br.reset();
    }
    }
    // We construct a string representing the line from the
    buffer of
    // characters read
    if (index != 0) {
    finalLine = new String(lineChars);
    } else if (readCharacter == '\r' || readCharacter == '\n')
    {
    finalLine = "";
    }
    }
    return finalLine;
    }
     
    Hugo, Oct 15, 2004
    #7
  8. On Thu, 14 Oct 2004 22:30:28 +0100, Steve Horsley wrote:

    > Matt Humphrey wrote:
    >
    > <lots of good advice snipped>
    >
    > I would like to add: If you are looking for line endings,
    > remember that BufferedReader.readLine accepts line endings
    > of any of the following sequences:
    > "\n"
    > "\r"
    > "\r\n"
    >
    > I advise that you try and emulate this.
    > It will save you much grief one day.


    (This is because the various platforms Java runs on haven't, historically,
    agreed on a line terminator/separator.)

    Just out of curiousity, and because I'm about to go to bed and therefore
    don't want to start coding, how many lines is the pathological sequence:

    "\r\r\n\r\r\n\n\r\r\n"?
    ()(??)()(??)()()(??) <-- helpful markers

    --
    Some say the Wired doesn't have political borders like the real world,
    but there are far too many nonsense-spouting anarchists or idiots who
    think that pranks are a revolution.
     
    Owen Jacobson, Oct 15, 2004
    #8
  9. "Hugo" <> wrote in message
    news:...
    > "Matt Humphrey" <> wrote in message

    news:<>...
    > > "Hugo" <> wrote in message
    > > news:...
    > > > Hello,


    <snip>

    <more snip>

    > Here the code I use to read a my file char by char with a maximum
    > number of read charachters :
    >
    > private String readLineWithMaxSize(BufferedReader br) throws
    > IOException {
    > String finalLine = null;
    > int readCharacter = -1;
    > char[] lineChars = new char[204800];
    > boolean bufferFull = false;
    > if (br != null) {
    > int index = 0;
    > readCharacter = br.read();
    > // If the read character does not correspond to a new line
    > or to
    > // an end of file, we treat it.
    > while (readCharacter != -1 && readCharacter != '\r' &&
    > readCharacter != '\n') {
    > // if the buffer is not full, we add the character to
    > the array of characters
    > if (!bufferFull) {
    > lineChars[index] = (char) readCharacter;
    > index++;
    > bufferFull = index >= lineChars.length;
    > }
    > readCharacter = br.read();
    > }
    > // If the read character is \r and the next one is \n, we
    > skip it.
    > if (readCharacter == '\r') {
    > br.mark(2);
    > int nextReadCharacter = br.read();
    > if (nextReadCharacter != '\n') {
    > br.reset();
    > }
    > }
    > // We construct a string representing the line from the
    > buffer of
    > // characters read
    > if (index != 0) {
    > finalLine = new String(lineChars);
    > } else if (readCharacter == '\r' || readCharacter == '\n')
    > {
    > finalLine = "";
    > }
    > }
    > return finalLine;
    > }


    I compiled your code and it ran fine for me. I wrote a program that creates
    a test file with 1 short line, a line of 3.5Mb and a final short line. Your
    code above on my 1.7Ghz Windows 2000 machine with Java 1.4.2_03 with no
    special memory expansion -Xmx set runs in less than a second. I wrote a
    similar version based on StringBuffer that returns the complete 3.4Mb string
    and it works perfectly fine also. Note that your code above has a serious
    problem--every string it returns will be 204800 characters long. You won't
    need many of these for your program to run out of memory.

    As for the speed problem, I think it will be something with the file and the
    OS rather than with Java.

    Cheers,
    Matt Humphrey http://www.iviz.com/
     
    Matt Humphrey, Oct 15, 2004
    #9
  10. Hugo

    Hugo Guest

    <snip>
    <more snip>
    <more more snip>

    Thank you for your answer.
    If my code works for you, it seems that I may have miss something in
    the code which calls this method. I will check that. On the other
    hand, I don't understant why say that this method will always return
    204800 characters long strings. I mean, in the while loop, I check if
    the read character is an end-of-line or not. So my array of characters
    is not always full. Is there something I don't understand here? If I
    initialize my array at 204800 characters, does it mean the string I
    will construct from it will contain 204800 charcters, even if the
    array is not full??

    Thanks a lot for your answers.

    Hugo.
     
    Hugo, Oct 18, 2004
    #10
  11. "Hugo" <> wrote in message
    news:...
    > <snip>
    > <more snip>
    > <more more snip>
    >
    > Thank you for your answer.
    > If my code works for you, it seems that I may have miss something in
    > the code which calls this method. I will check that. On the other
    > hand, I don't understant why say that this method will always return
    > 204800 characters long strings. I mean, in the while loop, I check if
    > the read character is an end-of-line or not. So my array of characters
    > is not always full. Is there something I don't understand here? If I
    > initialize my array at 204800 characters, does it mean the string I
    > will construct from it will contain 204800 charcters, even if the
    > array is not full??


    Arrays are fixed-length and always have the declared length.

    Cheers,
    Matt Humphrey http://www.iviz.com/
     
    Matt Humphrey, Oct 18, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Christian Gröbner [MVP]

    Maximum Characters per Website

    Christian Gröbner [MVP], Feb 8, 2006, in forum: ASP .Net
    Replies:
    6
    Views:
    527
    Trevor Benedict R
    Feb 8, 2006
  2. Peter
    Replies:
    1
    Views:
    2,378
    John B. Matthews
    Jan 19, 2010
  3. Ken Fine
    Replies:
    2
    Views:
    215
    Ken Fine
    Feb 5, 2004
  4. Randy Kramer
    Replies:
    2
    Views:
    428
    Randy Kramer
    Jan 12, 2007
  5. phanhuyich
    Replies:
    4
    Views:
    324
Loading...

Share This Page