large xml file...

Discussion in 'Java' started by boris, Aug 23, 2011.

  1. boris

    boris Guest

    hi all,
    I need to process large xml file and dump some documents to a different
    file based on content of some elements.

    let's say I need to check content of <text3> and dump the whole <doc> to
    a different file:

    <doc>
    <text1>
    <text2>
    <text3> ... etc

    </doc>

    I'm trying to do this using sax. Are there any examples how to do this?
    Is using sax ok for this task?
    thanks.
    boris, Aug 23, 2011
    #1
    1. Advertising

  2. boris

    Ian Shef Guest

    boris <> wrote in news:j2uqp4$n8h$1
    @speranza.aioe.org:

    > hi all,
    > I need to process large xml file and dump some documents to a different
    > file based on content of some elements.
    >
    > let's say I need to check content of <text3> and dump the whole <doc> to
    > a different file:
    >
    > <doc>
    > <text1>
    > <text2>
    > <text3> ... etc
    >
    > </doc>
    >
    > I'm trying to do this using sax. Are there any examples how to do this?
    > Is using sax ok for this task?
    > thanks.
    >
    >
    >


    What you are asking is unclear to me.
    Do you mean that <text3> will determine whether you dump the whole <doc> to
    another file?
    Do you mean that <text3> will determine what file the whole <doc> will be
    dumped to?
    Or do you mean that the whole <doc> will be dumped to some other file, and
    while you are at it, <text3> will also be checked and reported in some way?

    Can you read the "large xml file" twice?
    Can you put the whole "large xml file" (or at least the part preceeding
    <text3>) into memory?
    Can you copy the "large xml file" to another file while it is being
    processed?

    Sorry about the questions, but I need clarification. I have used SAX and
    may be able to provide enlightenment. SAX has its uses, but is not so good
    when 'memory' is involved unless _you_ provide the memory. SAX appears to
    excel when processing can take place in a single pass with very little
    lokking backwards. Consequently, it does not use as much memory as some
    other methods.
    Ian Shef, Aug 23, 2011
    #2
    1. Advertising

  3. boris

    boris Guest

    On 08/22/2011 08:43 PM, Ian Shef wrote:
    > boris<> wrote in news:j2uqp4$n8h$1
    > @speranza.aioe.org:
    >
    >> hi all,
    >> I need to process large xml file and dump some documents to a different
    >> file based on content of some elements.
    >>
    >> let's say I need to check content of<text3> and dump the whole<doc> to
    >> a different file:
    >>
    >> <doc>
    >> <text1>
    >> <text2>
    >> <text3> ... etc
    >>
    >> </doc>
    >>
    >> I'm trying to do this using sax. Are there any examples how to do this?
    >> Is using sax ok for this task?
    >> thanks.
    >>
    >>
    >>

    >
    > What you are asking is unclear to me.
    > Do you mean that<text3> will determine whether you dump the whole<doc> to
    > another file?
    > Do you mean that<text3> will determine what file the whole<doc> will be
    > dumped to?
    > Or do you mean that the whole<doc> will be dumped to some other file, and
    > while you are at it,<text3> will also be checked and reported in some way?
    >
    > Can you read the "large xml file" twice?
    > Can you put the whole "large xml file" (or at least the part preceeding
    > <text3>) into memory?
    > Can you copy the "large xml file" to another file while it is being
    > processed?
    >
    > Sorry about the questions, but I need clarification. I have used SAX and
    > may be able to provide enlightenment. SAX has its uses, but is not so good
    > when 'memory' is involved unless _you_ provide the memory. SAX appears to
    > excel when processing can take place in a single pass with very little
    > lokking backwards. Consequently, it does not use as much memory as some
    > other methods.
    >


    > Do you mean that<text3> will determine whether you dump the
    >whole<doc> to
    > another file?

    yes


    > Can you read the "large xml file" twice?

    I would like to read it once.

    > Can you put the whole "large xml file" (or at least the part >preceeding
    > <text3>) into memory?

    no.
    boris, Aug 23, 2011
    #3
  4. boris

    boris Guest

    > On 08/22/2011 08:43 PM, Ian Shef wrote:

    > > Can you put the whole "large xml file" (or at least the part >preceeding
    > > <text3>) into memory?

    > no.


    No, I can load the whole file. 1 doc is not a problem...
    boris, Aug 23, 2011
    #4
  5. boris

    Arne Vajhøj Guest

    On 8/22/2011 8:05 PM, boris wrote:
    > I need to process large xml file and dump some documents to a different
    > file based on content of some elements.
    >
    > let's say I need to check content of <text3> and dump the whole <doc> to
    > a different file:
    >
    > <doc>
    > <text1>
    > <text2>
    > <text3> ... etc
    >
    > </doc>
    >
    > I'm trying to do this using sax. Are there any examples how to do this?
    > Is using sax ok for this task?


    SAX or StAX seems as the most obvious choices given the context.

    Any textbook SAX example should lead you to working code.

    I can post some code, but I doubt that it will show anything
    various books and tutorials does not.

    Arne
    Arne Vajhøj, Aug 23, 2011
    #5
  6. boris

    Ian Shef Guest

    boris <> wrote in
    news:j2utnu$t1q$:

    >> On 08/22/2011 08:43 PM, Ian Shef wrote:

    >
    >> > Can you put the whole "large xml file" (or at least the part
    >> > >preceeding <text3>) into memory?

    >> no.

    >
    > No, I can load the whole file. 1 doc is not a problem...
    >
    >
    >
    >


    As you are processing, you can save the XML yourself (e.g. as a List of
    String_s).

    Based on the result of evaluating <text3>, you can choose to:

    Open an output file, copy the List of String_s to the output file, and copy
    any succeeding XML to the output file, or discard the List and discontinue
    processing.

    Alternatively, you can save the XML to a file as you process it. When you
    evaluate <text3>, you can choose to continue saving to the file, or delete
    the file and discontinue processing.
    Ian Shef, Aug 23, 2011
    #6
  7. boris

    boris Guest

    On 08/22/2011 09:59 PM, Arne Vajhøj wrote:
    > On 8/22/2011 8:05 PM, boris wrote:
    >> I need to process large xml file and dump some documents to a different
    >> file based on content of some elements.
    >>
    >> let's say I need to check content of <text3> and dump the whole <doc> to
    >> a different file:
    >>
    >> <doc>
    >> <text1>
    >> <text2>
    >> <text3> ... etc
    >>
    >> </doc>
    >>
    >> I'm trying to do this using sax. Are there any examples how to do this?
    >> Is using sax ok for this task?

    >
    > SAX or StAX seems as the most obvious choices given the context.
    >
    > Any textbook SAX example should lead you to working code.
    >
    > I can post some code, but I doubt that it will show anything
    > various books and tutorials does not.
    >
    > Arne
    >
    >

    I tried to accumulate the whole xml(<doc>...</doc>) as string using
    sax, but in this case all special characters are processed by parser
    and are just characters and not "predefined entities" like &quot;

    Using stax, I get correct xml, if I print events right away, but I if I
    store them in collection and print them later , I don't get the same result.
    boris, Aug 24, 2011
    #7
  8. boris <> wrote:
    > Using stax, I get correct xml, if I print events right away, but I if I
    > store them in collection and print them later , I don't get the same result.


    That sounds more like a bug in your code for "storing" and "printing later"
    than a problem with stax itself. ;)
    Andreas Leitgeb, Aug 24, 2011
    #8
  9. boris

    Arne Vajhøj Guest

    On 8/24/2011 2:40 PM, boris wrote:
    > On 08/22/2011 09:59 PM, Arne Vajhøj wrote:
    >> On 8/22/2011 8:05 PM, boris wrote:
    >>> I need to process large xml file and dump some documents to a different
    >>> file based on content of some elements.
    >>>
    >>> let's say I need to check content of <text3> and dump the whole <doc> to
    >>> a different file:
    >>>
    >>> <doc>
    >>> <text1>
    >>> <text2>
    >>> <text3> ... etc
    >>>
    >>> </doc>
    >>>
    >>> I'm trying to do this using sax. Are there any examples how to do this?
    >>> Is using sax ok for this task?

    >>
    >> SAX or StAX seems as the most obvious choices given the context.
    >>
    >> Any textbook SAX example should lead you to working code.
    >>
    >> I can post some code, but I doubt that it will show anything
    >> various books and tutorials does not.


    > I tried to accumulate the whole xml(<doc>...</doc>) as string using sax,
    > but in this case all special characters are processed by parser
    > and are just characters and not "predefined entities" like &quot;
    >
    > Using stax, I get correct xml, if I print events right away, but I if I
    > store them in collection and print them later , I don't get the same
    > result.


    Any correct XML parser should convert the XML &quot; to a " in
    a Java String.

    Any correct XML formatter/serializer should convert it back again
    when generating new XML.

    Arne
    Arne Vajhøj, Aug 25, 2011
    #9
  10. Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:

    > Any correct XML parser should convert the XML &quot; to a " in
    > a Java String.
    >
    > Any correct XML formatter/serializer should convert it back again
    > when generating new XML.


    I think any sane XML serializer should not output " as &quot; in
    text content.

    --
    Stanimir
    Stanimir Stamenkov, Aug 25, 2011
    #10
  11. On 25/08/2011 05:57, Stanimir Stamenkov wrote:
    > Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:
    >
    >> Any correct XML parser should convert the XML &quot; to a " in
    >> a Java String.
    >>
    >> Any correct XML formatter/serializer should convert it back again
    >> when generating new XML.

    >
    > I think any sane XML serializer should not output " as &quot; in text
    > content.
    >


    If you use an XML parser to read '<foo delimiter="&quot;">...' you will
    get a structure with an attribute with a value of '"'.

    If you serialise that structure back to XML again, I would hope to get
    '<foo delimiter="&quot;">...' again. Am I wrong?

    --
    RGB
    RedGrittyBrick, Aug 25, 2011
    #11
  12. Thu, 25 Aug 2011 10:39:17 +0100, /RedGrittyBrick/:
    > On 25/08/2011 05:57, Stanimir Stamenkov wrote:
    >> Wed, 24 Aug 2011 19:10:26 -0400, /Arne Vajhøj/:
    >>
    >>> Any correct XML parser should convert the XML &quot; to a " in
    >>> a Java String.
    >>>
    >>> Any correct XML formatter/serializer should convert it back again
    >>> when generating new XML.

    >>
    >> I think any sane XML serializer should not output " as &quot; in text
    >> content.

    >
    > If you use an XML parser to read '<foo delimiter="&quot;">...' you
    > will get a structure with an attribute with a value of '"'.
    >
    > If you serialise that structure back to XML again, I would hope to
    > get '<foo delimiter="&quot;">...' again. Am I wrong?


    The serializer may choose (or be configured) to output:

    <foo delimiter='"'>...

    But my point was text content, not attribute values:

    <foo>&quot;</foo>

    an then:

    <foo>"</foo>

    --
    Stanimir
    Stanimir Stamenkov, Aug 26, 2011
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tarun Jain
    Replies:
    3
    Views:
    402
    Tarun Jain
    Dec 2, 2003
  2. jkflens
    Replies:
    2
    Views:
    1,466
    jkflens
    May 30, 2006
  3. Ketchup
    Replies:
    1
    Views:
    241
    Jan Tielens
    May 25, 2004
  4. Erik Wasser
    Replies:
    5
    Views:
    449
    Peter J. Holzer
    Mar 5, 2006
  5. Replies:
    5
    Views:
    872
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page