Reading huge text files one line at a time....

Discussion in 'Java' started by Brock Heinz, Nov 23, 2004.

  1. Brock Heinz

    Brock Heinz Guest

    Hello All,

    I've done quite a bit of research on this one and I'm still stumped.
    I have an application that reads a text file (up to 100MB in size) one
    line at at time, converts the line to XML using Castor (each line is a
    specific record) and then sends a JMS message for that line. After
    validating the file one line at a time (never reading the entire
    contents into memory), I am then confident I can perform the Castor
    transformation / send operation. I'm doing something like the
    following:

    BufferedReader reader = new BufferedReader(new FileReader(validFile));
    //for each line in the file
    for (String line; (line = reader.readLine()) != null;) {
    //perform transformation and send
    IMessage message = transformer.createMessage(line, msgSelector);
    sendMessage(message);
    messageSentCount++;
    //perform cleanup / logging every 500th message
    if (messageSentCount % 500 == 0) {
    log.debug("sent message: "+messageSentCount);
    log.debug(" - Garbage collecting.");
    try {
    this.finalize();
    } catch (Throwable t) {
    log.warn("Could not finalize - keep on reading anyhow");
    }
    }
    }
    reader.close();


    Does anyone see any problems with reading the files one line at a time
    in this manner (using the readLine() method)? I seem to hit an
    OutofMemoryException right around line 315,000. Is the readLine()
    method interally not efficient to use?

    In the archives I've seen the approach of reading chunks of the file
    with a buffer, and then determining each line by seaching for carriage
    returns or line breaks. Anyone have any thoughts on this?

    Any help would be greatly appreciated.

    Thanks,
    Brock
     
    Brock Heinz, Nov 23, 2004
    #1
    1. Advertising

  2. Brock Heinz

    thirdrock Guest

    Brock Heinz wrote:

    > Hello All,


    >
    > BufferedReader reader = new BufferedReader(new FileReader(validFile));
    > //for each line in the file
    > for (String line; (line = reader.readLine()) != null;) {
    > //perform transformation and send
    > IMessage message = transformer.createMessage(line, msgSelector);


    What object type is transformer?

    > sendMessage(message);
    > messageSentCount++;
    > //perform cleanup / logging every 500th message
    > if (messageSentCount % 500 == 0) {
    > log.debug("sent message: "+messageSentCount);
    > log.debug(" - Garbage collecting.");
    > try {
    > this.finalize();

    What is this?
    Where is 'message' garbage collected?

    > } catch (Throwable t) {
    > log.warn("Could not finalize - keep on reading anyhow");
    > }
    > }
    > }
    > reader.close();
    >
    >
    > Does anyone see any problems with reading the files one line at a time
    > in this manner (using the readLine() method)? I seem to hit an
    > OutofMemoryException right around line 315,000.


    That would tend to indicate that you are running out of memory.

    > Is the readLine()
    > method interally not efficient to use?

    What makes you think it is the readline() method that is sucking up all
    of the memory?

    >
    > In the archives I've seen the approach of reading chunks of the file
    > with a buffer, and then determining each line by seaching for carriage
    > returns or line breaks.


    That will only help once you have determined that readline() is the
    cause of the problem.

    Ian
     
    thirdrock, Nov 23, 2004
    #2
    1. Advertising

  3. Brock Heinz

    EricF Guest

    In article <>, (Brock Heinz) wrote:
    >Hello All,
    >
    >I've done quite a bit of research on this one and I'm still stumped.
    >I have an application that reads a text file (up to 100MB in size) one
    >line at at time, converts the line to XML using Castor (each line is a
    >specific record) and then sends a JMS message for that line. After
    >validating the file one line at a time (never reading the entire
    >contents into memory), I am then confident I can perform the Castor
    >transformation / send operation. I'm doing something like the
    >following:
    >
    >BufferedReader reader = new BufferedReader(new FileReader(validFile));
    >//for each line in the file
    >for (String line; (line = reader.readLine()) != null;) {
    > //perform transformation and send
    > IMessage message = transformer.createMessage(line, msgSelector);
    > sendMessage(message);
    > messageSentCount++;
    > //perform cleanup / logging every 500th message
    > if (messageSentCount % 500 == 0) {
    > log.debug("sent message: "+messageSentCount);
    > log.debug(" - Garbage collecting.");
    > try {
    > this.finalize();
    > } catch (Throwable t) {
    > log.warn("Could not finalize - keep on reading anyhow");
    > }
    > }
    >}
    >reader.close();
    >
    >
    >Does anyone see any problems with reading the files one line at a time
    >in this manner (using the readLine() method)? I seem to hit an
    >OutofMemoryException right around line 315,000. Is the readLine()
    >method interally not efficient to use?
    >
    >In the archives I've seen the approach of reading chunks of the file
    >with a buffer, and then determining each line by seaching for carriage
    >returns or line breaks. Anyone have any thoughts on this?
    >
    >Any help would be greatly appreciated.
    >
    >Thanks,
    >Brock


    I don't think the problem is with readline. You have a memory leak.

    Is the finalize call really doing anything?

    Try setting any variables to null when you are thru with them at the end of
    the for loop. Particulalry message.

    Eric
     
    EricF, Nov 23, 2004
    #3
  4. "Brock Heinz" <> schreef in bericht
    news:...
    > Hello All,
    >
    > I've done quite a bit of research on this one and I'm still stumped.
    > I have an application that reads a text file (up to 100MB in size) one
    > line at at time, converts the line to XML using Castor (each line is a
    > specific record) and then sends a JMS message for that line. After
    > validating the file one line at a time (never reading the entire
    > contents into memory), I am then confident I can perform the Castor
    > transformation / send operation. I'm doing something like the
    > following:
    >
    > BufferedReader reader = new BufferedReader(new FileReader(validFile));
    > //for each line in the file
    > for (String line; (line = reader.readLine()) != null;) {
    > //perform transformation and send
    > IMessage message = transformer.createMessage(line, msgSelector);
    > sendMessage(message);
    > messageSentCount++;
    > //perform cleanup / logging every 500th message
    > if (messageSentCount % 500 == 0) {
    > log.debug("sent message: "+messageSentCount);
    > log.debug(" - Garbage collecting.");
    > try {
    > this.finalize();
    > } catch (Throwable t) {
    > log.warn("Could not finalize - keep on reading anyhow");
    > }
    > }
    > }
    > reader.close();


    What happens with the IMessage object after it is sent?
     
    Boudewijn Dijkstra, Nov 23, 2004
    #4
  5. Brock Heinz wrote:

    > I've done quite a bit of research on this one and I'm still stumped.
    > I have an application that reads a text file (up to 100MB in size) one
    > line at at time, converts the line to XML using Castor (each line is a
    > specific record) and then sends a JMS message for that line. After
    > validating the file one line at a time (never reading the entire
    > contents into memory), I am then confident I can perform the Castor
    > transformation / send operation. I'm doing something like the
    > following:


    I'm not much interested in analyzing "something like" what you're doing,
    as there is a reasonably good chance that the ways it differs from what
    you are *actually* doing include the source of your problem. Post a
    compilable example that exhibits the (mis-)behavior that is troubling you.

    > BufferedReader reader = new BufferedReader(new FileReader(validFile));
    > //for each line in the file
    > for (String line; (line = reader.readLine()) != null;) {
    > //perform transformation and send
    > IMessage message = transformer.createMessage(line, msgSelector);
    > sendMessage(message);
    > messageSentCount++;
    > //perform cleanup / logging every 500th message
    > if (messageSentCount % 500 == 0) {
    > log.debug("sent message: "+messageSentCount);
    > log.debug(" - Garbage collecting.");
    > try {
    > this.finalize();


    Even though I'm not very keen to analyze your code, I can't help
    commenting on this. You should _never_ invoke an object's finalize()
    method from user code. It is for the use of the GC. If you have
    cleanup code that you want to execute periodically then put it in its
    own method; it is OK for finalize() to invoke such a method, if need be.
    (It is better, however, to not rely on the finalizer for anything.)
    At best, putting such code into finalize() is potentially confusing.
    Overriding finalize() at all has an effect on GC of instances of the
    relevant class, although how serious the implications are will depend on
    a wide variety of factors.

    > } catch (Throwable t) {
    > log.warn("Could not finalize - keep on reading anyhow");
    > }


    And I have to comment on that, too. It's almost never a good idea to
    write such generic catch blocks. That will catch all manner or checked
    and unchecked Exceptions, as well as all Errors, and ignore them. At
    the very, very least you should log the Throwable's message. Much
    better, however, is to only catch the specific exceptions that you have
    reason to expect may be thrown. You can be reasonably confident that
    you know how to handle those appropriately, but you have no reason for
    confidence that you know how to handle any other Throwable.

    > }
    > }
    > reader.close();
    >
    >
    > Does anyone see any problems with reading the files one line at a time
    > in this manner (using the readLine() method)? I seem to hit an
    > OutofMemoryException right around line 315,000. Is the readLine()
    > method interally not efficient to use?


    That would be an OutOfMemoryError. If you are getting one then it
    probably means that your program is caching objects (messages, strings,
    something) somehow. It might, however, mean that your input is corrupt,
    and at some point contains a very long sequence of bytes without a line
    delimiter -- the system could be trying to construct a multi-megabyte
    String object or JMS message.

    > In the archives I've seen the approach of reading chunks of the file
    > with a buffer, and then determining each line by seaching for carriage
    > returns or line breaks. Anyone have any thoughts on this?


    Your BufferedReader does that for you already.


    John Bollinger
     
    John C. Bollinger, Nov 23, 2004
    #5
  6. Brock Heinz

    Brock Heinz Guest

    "Boudewijn Dijkstra" <> wrote in message news:<41a2fa8f$0$44097$>...
    > "Brock Heinz" <> schreef in bericht
    > news:...
    > > Hello All,
    > >
    > > I've done quite a bit of research on this one and I'm still stumped.
    > > I have an application that reads a text file (up to 100MB in size) one
    > > line at at time, converts the line to XML using Castor (each line is a
    > > specific record) and then sends a JMS message for that line. After
    > > validating the file one line at a time (never reading the entire
    > > contents into memory), I am then confident I can perform the Castor
    > > transformation / send operation. I'm doing something like the
    > > following:
    > >
    > > BufferedReader reader = new BufferedReader(new FileReader(validFile));
    > > //for each line in the file
    > > for (String line; (line = reader.readLine()) != null;) {
    > > //perform transformation and send
    > > IMessage message = transformer.createMessage(line, msgSelector);
    > > sendMessage(message);
    > > messageSentCount++;
    > > //perform cleanup / logging every 500th message
    > > if (messageSentCount % 500 == 0) {
    > > log.debug("sent message: "+messageSentCount);
    > > log.debug(" - Garbage collecting.");
    > > try {
    > > this.finalize();
    > > } catch (Throwable t) {

    > log.warn("Could not finalize - keep on reading anyhow");
    > > }
    > > }
    > > }
    > > reader.close();

    >



    > What happens with the IMessage object after it is sent?


    The message is set to null in the sendMessage() method.

    My initial thought that the readline() was inefficient compared to
    other I/O strategies, but after running the same test without sending
    any messages it appears as though that is not the source of my memory
    woes... I'll keep digging, and if I turn up anything interesting and
    worth posting - I'll share it here.

    Brock
     
    Brock Heinz, Nov 23, 2004
    #6
  7. Brock Heinz

    Ann Guest


    >
    > My initial thought that the readline() was inefficient compared to
    > other I/O strategies, but after running the same test without sending
    > any messages it appears as though that is not the source of my memory
    > woes... I'll keep digging, and if I turn up anything interesting and
    > worth posting - I'll share it here.
    >
    > Brock


    But since a String is imutible, doesn't Java have to
    create a new String for 'line' each time readline() is
    executed?
     
    Ann, Nov 23, 2004
    #7
  8. Brock Heinz

    Eric Sosman Guest

    Ann wrote:
    >>My initial thought that the readline() was inefficient compared to
    >>other I/O strategies, but after running the same test without sending
    >>any messages it appears as though that is not the source of my memory
    >>woes... I'll keep digging, and if I turn up anything interesting and
    >>worth posting - I'll share it here.
    >>
    >>Brock

    >
    >
    > But since a String is imutible, doesn't Java have to
    > create a new String for 'line' each time readline() is
    > executed?


    Strings are immutable (note the spelling), but
    not immortal. The Strings created by readLine() are
    subject to garbage collection when they are no longer
    referenced, just like any other objects.

    --
     
    Eric Sosman, Nov 23, 2004
    #8
  9. Brock Heinz

    Brock Heinz Guest

    "John C. Bollinger" <> wrote in message > > BufferedReader

    >> try {
    > > this.finalize();

    >


    > Even though I'm not very keen to analyze your code, I can't help
    > commenting on this. You should _never_ invoke an object's finalize()
    > method from user code. It is for the use of the GC. If you have
    > cleanup code that you want to execute periodically then put it in its
    > own method; it is OK for finalize() to invoke such a method, if need be.


    I had considered this, but since the app is running in a J2EE server,
    I wasn't sure what the consequences of calling System.gc() would be.
    Really - by me programatically executing any type of garbage
    collection, I am really just placing a bandaid over a gash.

    > (It is better, however, to not rely on the finalizer for anything.)
    > At best, putting such code into finalize() is potentially confusing.
    > Overriding finalize() at all has an effect on GC of instances of the
    > relevant class, although how serious the implications are will depend on
    > a wide variety of factors.
    >
    > > } catch (Throwable t) {
    > > log.warn("Could not finalize - keep on reading anyhow");
    > > }

    >
    > And I have to comment on that, too. It's almost never a good idea to
    > write such generic catch blocks.


    I agree, but the the finalize() method throws 'Throwable' :)

    This is an instance where regardless of any exceptions occurred from
    trying to 'finalize', I wanted to stay within the for block and
    continue to process the messages.

    > That will catch all manner or checked
    > and unchecked Exceptions, as well as all Errors, and ignore them. At
    > the very, very least you should log the Throwable's message. Much
    > better, however, is to only catch the specific exceptions that you have
    > reason to expect may be thrown.


    Again, I agree. I didn't send you the entire method. The try/catch
    block that I had pasted into my post was nested in a larger try/catch
    where I would catch specific exceptions and I could react accordingly.


    >You can be reasonably confident that
    > you know how to handle those appropriately, but you have no reason for
    > confidence that you know how to handle any other Throwable.
    >
    > > }
    > > }
    > > reader.close();
    > >
    > >
    > > Does anyone see any problems with reading the files one line at a time
    > > in this manner (using the readLine() method)? I seem to hit an
    > > OutofMemoryException right around line 315,000. Is the readLine()
    > > method interally not efficient to use?

    >
    > That would be an OutOfMemoryError. If you are getting one then it
    > probably means that your program is caching objects (messages, strings,
    > something) somehow. It might, however, mean that your input is corrupt,
    > and at some point contains a very long sequence of bytes without a line
    > delimiter -- the system could be trying to construct a multi-megabyte
    > String object or JMS message.


    After more researching into the problem, I finally cornered the issue.
    The true source of the problem wasn't me validating / parsing the
    file. The source of the problem was in the third party messaging
    framework we were using.

    > > In the archives I've seen the approach of reading chunks of the file
    > > with a buffer, and then determining each line by seaching for carriage
    > > returns or line breaks. Anyone have any thoughts on this?

    >
    > Your BufferedReader does that for you already.
    >
    >
    > John Bollinger
    >


    Thanks for the feedback, John!

    Brock
     
    Brock Heinz, Nov 23, 2004
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. xyz
    Replies:
    3
    Views:
    624
  2. Replies:
    3
    Views:
    530
  3. Brandon McGinty

    Reading Huge UnixMailbox Files

    Brandon McGinty, Apr 26, 2011, in forum: Python
    Replies:
    3
    Views:
    224
    Nobody
    Apr 27, 2011
  4. Math55

    Reading huge *.txt files?

    Math55, Oct 7, 2003, in forum: Perl Misc
    Replies:
    12
    Views:
    197
    Tintin
    Oct 8, 2003
  5. Harsh Jha
    Replies:
    8
    Views:
    145
    Irmen de Jong
    Oct 16, 2013
Loading...

Share This Page