Java and huge XML file to be parsed

Discussion in 'Java' started by Katrin Tomanek, Jun 17, 2004.

  1. Hi everybody,

    I've got a really big XML File (about 215 MBytes), which I have to parse.

    So, my question is: what would be the best solution: DOM, SAX, JDOM ???
    Anything else ? And is it possible at all to parse this huge kinda XML
    files ?

    I already tried JDOM, i did set my jvm to 512 MB of RAM, but still after
    one hour I got an out-of-memory exception.

    I thought that maybe SAX might be better, since it is not tree-based.
    What do you think according to 215 MB files ?

    ok, i am happy about every answer and hint i can get, thanx in advance

    Katrin
     
    Katrin Tomanek, Jun 17, 2004
    #1
    1. Advertising

  2. Go for SAX....
     
    Araxes Tharsis, Jun 18, 2004
    #2
    1. Advertising

  3. Katrin Tomanek

    Sudsy Guest

    Katrin Tomanek wrote:
    > Hi everybody,
    >
    > I've got a really big XML File (about 215 MBytes), which I have to parse.


    SAX is really your only option. DOM has to build the document in memory.
    Even if you have a 64-bit processor with GBs of virtual memory...
    SAX is also good if you need to process data "on-the-fly"; DOM requires
    the document to be complete before the parser returns.
    Different tools for different scenarios.
     
    Sudsy, Jun 18, 2004
    #3
  4. Katrin Tomanek () wrote:
    : Hi everybody,

    : I've got a really big XML File (about 215 MBytes), which I have to parse.

    : So, my question is: what would be the best solution: DOM, SAX, JDOM ???
    : Anything else ? And is it possible at all to parse this huge kinda XML
    : files ?

    : I already tried JDOM, i did set my jvm to 512 MB of RAM, but still after
    : one hour I got an out-of-memory exception.

    : I thought that maybe SAX might be better, since it is not tree-based.
    : What do you think according to 215 MB files ?

    : ok, i am happy about every answer and hint i can get, thanx in advance

    I would think that this is exactly the sort of situation for which
    SAX is intended.
     
    Malcolm Dew-Jones, Jun 18, 2004
    #4
  5. Katrin Tomanek

    Roedy Green Guest

    On Thu, 17 Jun 2004 23:49:28 +0200, Katrin Tomanek
    <> wrote or quoted :

    >I've got a really big XML File (about 215 MBytes), which I have to parse.

    u
    ARRGH. That file is probably 20 times the size if would be if stored
    in some sensible format. It will take 100 times a long to parse than
    some sensible binary format.


    PHOOEY ON XML! I knew this insanity would happen.

    See http://mindprod.com/jgloss/xml.html


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 18, 2004
    #5
  6. Katrin Tomanek

    Stefan Ram Guest

    Stefan Ram, Jun 18, 2004
    #6
  7. Katrin Tomanek

    Roedy Green Guest

    On 18 Jun 2004 01:20:16 GMT, -berlin.de (Stefan Ram) wrote
    or quoted :

    >|· It uses HTML's fluffy system of entities such as &nbsp;
    >
    > "&nbsp;" has no specific meaning in XML:


    If you can discern that from that endlessly recursive XML spec, more
    power to you.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 18, 2004
    #7
  8. Katrin Tomanek

    Sudsy Guest

    Roedy Green wrote:
    ....
    > ARRGH. That file is probably 20 times the size if would be if stored
    > in some sensible format. It will take 100 times a long to parse than
    > some sensible binary format.
    >
    >
    > PHOOEY ON XML! I knew this insanity would happen.


    C'mon, Roedy: XML has a place in the overall scheme of things. I
    wouldn't use it for database replication, and 215 MB seems a tad
    excessive, but at least it's a lingua franca for inter-connected
    systems. We can be free of the bonds of proprietary formats and
    encoded approaches like EDI. Try modifying those with a simple
    text editor!
     
    Sudsy, Jun 18, 2004
    #8
  9. Katrin Tomanek

    Roedy Green Guest

    On Thu, 17 Jun 2004 22:25:00 -0400, Sudsy <>
    wrote or quoted :

    >Try modifying those with a simple
    >text editor!


    Why use a ancient tool like that? It is like doing data entry with
    NOTEPAD. For heaven sake. Surely we could create editor that
    created, edited and searched a compact XML-like representation that
    made it IMPOSSIBLE to create syntax errors and almost correct data.

    It is not as though we failed to notice what a MESS HTML became from
    lack of such a representation. The idiots took the worst features of
    HTML.

    It is amazing that such a IDIOTIC format caught on.

    It is proof of man's attraction to the trashy -- along with McDonald's
    fast food success.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 18, 2004
    #9
  10. Katrin Tomanek

    Sudsy Guest

    Roedy Green wrote:
    > On Thu, 17 Jun 2004 22:25:00 -0400, Sudsy <>
    > wrote or quoted :
    >>Try modifying those with a simple
    >>text editor!

    > Why use a ancient tool like that? It is like doing data entry with
    > NOTEPAD. For heaven sake. Surely we could create editor that
    > created, edited and searched a compact XML-like representation that
    > made it IMPOSSIBLE to create syntax errors and almost correct data.


    Again, look to the genesis of the specification. While nobody can be
    reasonably expected to mentally decode base64 content, the basis for
    XML is that it is human-readable. As such, it is editable using the
    most basic tools.
    You seem to be promoting tools which operate at a much higher level
    rather than the LCD (lowest-common denominator).
    But can everyone afford to shell-out for the latest version of
    Macromedia Flash MX? It's priced between US$500-700, depending on
    whether you choose the basic or "Professional" version. Should you
    expect everyone to pony-up that kind of money?
    Ever look at how much it costs to create/serve RealPlayer or
    QuickTime streaming?
    If you want to make bags of money and promote your own proprietary
    format/protocol (kind of reminds me of the M$ "commoditization" of
    established network protocols) then be my guest.
    I stand by my assertion: XML provides a platform-neutral exchange
    framework.
    FWIW, Web Services (and, by definition, SOA) utilizes a foundation
    of XML.
    So although you might detest it, XML has a place in the "bigger
    picture" and is one of the prime candidates for bridging to the
    "dark side", also known as .NET (tm, sm, whatever...)
     
    Sudsy, Jun 18, 2004
    #10
  11. Thanx for your answers folks,

    I will go for SAX, or at least try ;-)

    Katrin
     
    Katrin Tomanek, Jun 18, 2004
    #11
  12. Katrin Tomanek

    Tim Ward Guest

    "Katrin Tomanek" <> wrote in message
    news:cat3l9$3ia$05$-online.com...
    >
    > I've got a really big XML File (about 215 MBytes), which I have to parse.


    Why is it in XML, how often does it change, and what do you have to do with
    it when you've parsed it (and other such problem scoping questions, such as,
    why are you assuming that the solution is some Java code)? Once that is
    known there's a whole range of possible solutions including but not limited
    to:

    (1) Java and SAX.
    (2) Convert it to a proper database first, then do the queries in SQL.
    (3) A DOM approach in C++.
    (4) ...

    --
    Tim Ward
    Brett Ward Limited - www.brettward.co.uk
     
    Tim Ward, Jun 18, 2004
    #12
  13. Re: Java and huge XML file to be parsed (new problems, now with SAX)

    Hi again,

    ....coming up with new problems.

    after most of the people told me to solve the problem with SAX, i did
    that and got a new problem.
    i have a very simple SAXParser with a DefaultHandler, nothing special.
    when i just try to go through the whole 215 mb file I get an error which
    sounds like this in english:
    org.xml.sax.SAXParseException: The Parser has reached the
    (critical/boundary) value of "64.000" for the extension of the entity
    which was set by the application.

    (sorry for the bad translation, for some strange reason i get a german
    error message saying:
    org.xml.sax.SAXParseException: Der Parser hat den von der Anwendung
    gesetzten Grenzwert "64.000" für die Erweiterung der Entität erreicht.)

    does anyone have an idea what this means, how i could change this value
    and why this error occures ?

    thx again...
    Katrin


    > Hi everybody,
    >
    > I've got a really big XML File (about 215 MBytes), which I have to parse.
    >
    > So, my question is: what would be the best solution: DOM, SAX, JDOM ???
    > Anything else ? And is it possible at all to parse this huge kinda XML
    > files ?
    >
    > I already tried JDOM, i did set my jvm to 512 MB of RAM, but still after
    > one hour I got an out-of-memory exception.
    >
    > I thought that maybe SAX might be better, since it is not tree-based.
    > What do you think according to 215 MB files ?
    >
    > ok, i am happy about every answer and hint i can get, thanx in advance
    >
    > Katrin
     
    Katrin Tomanek, Jun 18, 2004
    #13
  14. Katrin Tomanek

    Guest

    Re: Java and huge XML file to be parsed (new problems, now with SAX)

    Katrin Tomanek wrote:
    > Hi again,
    >
    > org.xml.sax.SAXParseException: The Parser has reached the
    > (critical/boundary) value of "64.000" for the extension of the entity
    > which was set by the application.
    >
    > (sorry for the bad translation, for some strange reason i get a german
    > error message saying:
    > org.xml.sax.SAXParseException: Der Parser hat den von der Anwendung
    > gesetzten Grenzwert "64.000" für die Erweiterung der Entität erreicht.)
    >
    > does anyone have an idea what this means, how i could change this value
    > and why this error occures ?


    Peace be unto you.




    You XML file has entities.

    Solution:

    java -Xms512m -Xmx512m -DentityExpansionLimit=512000 ThreadMessages
    Author: DrClap
    http://forum.java.sun.com/thread.jsp?forum=34&thread=515796&tstart=60&trange=15

    or

    System.setProperty("entityExpansionLimit", "512000");
    Author: jatiin
    http://forum.java.sun.com/thread.jsp?forum=34&thread=515796&tstart=60&trange=15

    "The entityExpansionLimit system property lets existing applications
    constrain the total number of entity expansions without recompiling
    the code. The parser throws a fatal error once it has reached the
    entity expansion limit. (By default, no limit is set.)

    To set the entity expansion limit using the system property, use
    an option like the following on the java command line:
    -DentityExpansionLimit=100000"
    http://java.sun.com/webservices/docs/1.2/jaxp/ReleaseNotes.html

    Have a good evening.
     
    , Jun 19, 2004
    #14
  15. Katrin Tomanek

    Roedy Green Guest

    On Mon, 21 Jun 2004 07:14:59 GMT-5, (Dale
    King) wrote or quoted :

    >
    >And what specifically is wrong with allowing someone to edit it
    >with the simplest of tools? That isn't even an option with a
    >binary format.


    Because that introduces the option of error. If you use the proper
    tool you don't litter the Internet with malformed files.

    Look at the mess HTML is in because we allow hand editing and
    publishing. If HTML had to go through a processor before being
    published it would be very unlikely you would have malformed published
    files, and browsers would not have to deal with such crap.

    You don't use notepad to edit your Oracle files. You should not be
    using it on any other form of structured data either. It like using a
    word processor to do your accounting. You defeat the possible error
    checking.



    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 21, 2004
    #15
  16. Katrin Tomanek

    Roedy Green Guest

    On Mon, 21 Jun 2004 07:32:12 GMT-5, (Dale
    King) wrote or quoted :

    >I find that to be a gross exaggeration, but neither of us has
    >hard data. I would also say that the development time for coding
    >the parser and editor for a binary format is 100 times that of
    >using XML.


    Of course, BUT in a sane world XML would be a binary format and there
    would be generic parsers available. Then you would solve three of
    XML's biggest problems:

    1. fluffiness.
    2. malformed files being passed around.
    3. complicated parsers just to read it. You want something much faster
    and simpler for handheld units.


    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 21, 2004
    #16
  17. Katrin Tomanek

    Sudsy Guest

    Roedy Green wrote:
    <snip>
    > You don't use notepad to edit your Oracle files. <snip>


    You're right: I use vi.
     
    Sudsy, Jun 21, 2004
    #17
  18. Katrin Tomanek

    Stan Berka Guest

    Looks like StAX would be a good choice. Am not sure where to find an
    implementation, though.

    Stan Berka

    "Tim Ward" <> wrote in message news:<>...
    > "Katrin Tomanek" <> wrote in message
    > news:cat3l9$3ia$05$-online.com...
    > >
    > > I've got a really big XML File (about 215 MBytes), which I have to parse.

    >
    > Why is it in XML, how often does it change, and what do you have to do with
    > it when you've parsed it (and other such problem scoping questions, such as,
    > why are you assuming that the solution is some Java code)? Once that is
    > known there's a whole range of possible solutions including but not limited
    > to:
    >
    > (1) Java and SAX.
    > (2) Convert it to a proper database first, then do the queries in SQL.
    > (3) A DOM approach in C++.
    > (4) ...
     
    Stan Berka, Jun 21, 2004
    #18
  19. Roedy Green sez:
    > On Mon, 21 Jun 2004 07:14:59 GMT-5, (Dale
    > King) wrote or quoted :
    >
    >>
    >>And what specifically is wrong with allowing someone to edit it
    >>with the simplest of tools? That isn't even an option with a
    >>binary format.

    >
    > Because that introduces the option of error. If you use the proper
    > tool you don't litter the Internet with malformed files.
    >
    > Look at the mess HTML is in because we allow hand editing and
    > publishing. If HTML had to go through a processor before being
    > published it would be very unlikely you would have malformed published
    > files, and browsers would not have to deal with such crap.


    LOL. Roedy, either you've never looked at output of any "HTML processor",
    or you're posting from a parallel universe.

    Dima
    --
    ....the mainstream products of major vendors largely ignore these demonstrated
    technologies... [Instead, their customers] are left with several ineffective
    solutions collected under marketing titles like "defense in depth".
    -- Thirty Years Later: Lessons from the Multics Security Evalution
     
    Dimitri Maziuk, Jun 22, 2004
    #19
  20. Katrin Tomanek

    Roedy Green Guest

    On Tue, 22 Jun 2004 17:04:30 +0000 (UTC), Dimitri Maziuk
    <dima@127.0.0.1> wrote or quoted :

    >
    >LOL. Roedy, either you've never looked at output of any "HTML processor",
    >or you're posting from a parallel universe.


    You are missing my point. I believe that both XML and HTML, the thing
    actually posted should be binary formats. No one would ever read or
    edit them directly, guaranteed to meet the spec, preparsed. Anything
    hand-coded with notepad is guaranteed to have some errors. Even
    though I validate my HTML daily, you will always find some HTML errors
    in there, and also some quasi errors that I tell the verifier to
    ignore. My site is very clean compared with most.

    See http://mindprod.com/jgloss/xml.html and
    http://mindprod.com/projects/htmlcompactor.html for the sort of
    formats I had in mind.


    When you want to view the HTML/XML you use a viewer or editor.
    Tradionalists could fluff it up to something like conventional HTML or
    XML for viewing. I would prefer something more graphic like a JTree or
    WYSIWYG

    How many of you are old enough to remember Wordstar. It was
    conceptually easy to understand because you embedded visible tags in
    your text. Then Word came along and hit the tags, and just let you
    think in terms of the final outcome. It drove everyone mad at first
    since Word did such a bad job of the internal tags, but in the long
    run the impossibility of getting invalid or unbalanced tags won out.

    XML is just about data, so you don't have that same problem. With
    HTML it would a lot easier to collapse and clean up a preparsed tree.

    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 22, 2004
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. YuliaG
    Replies:
    2
    Views:
    455
    Arnaud Berger
    Apr 4, 2005
  2. Failure
    Replies:
    1
    Views:
    1,120
    Failure
    Sep 7, 2003
  3. Xenia
    Replies:
    4
    Views:
    440
    Xenia
    Nov 25, 2003
  4. Replies:
    3
    Views:
    509
  5. linearfusion
    Replies:
    2
    Views:
    123
    linearfusion
    Jun 27, 2006
Loading...

Share This Page