programming: SAX and get content between open and close tag?

Discussion in 'XML' started by Rui Maciel, Jul 6, 2006.

  1. Rui Maciel

    Rui Maciel Guest

    Is it possible to, using the SAX approach, extract the XML content between
    an opening and closing tag as if it was a continuous string of text?

    For example, let's say we have the following document:

    <first>
    <second>
    <alpha>foo</alpha>
    <beta>bar</beta>
    </second>
    </first>

    Is it possible to directly extract the content between <first> and </first>
    as if it was a text string?


    Thanks in advance
    Rui Maciel
    --
    Running Kubuntu 6.06 with KDE 3.5.3 and proud of it.
    jabber:
    Rui Maciel, Jul 6, 2006
    #1
    1. Advertising

  2. Rui Maciel wrote:

    > <first>
    > <second>
    > <alpha>foo</alpha>
    > <beta>bar</beta>
    > </second>
    > </first>
    >
    > Is it possible to directly extract the content between <first> and </first>
    > as if it was a text string?


    I think it's not possible. Do you expect the tags inside
    <first> to appear as text also ? Or do you expect the
    character data between the tags to appear only ? What
    _exactly_ do you expect to be the result of your example ?
    Juergen Kahrs, Jul 6, 2006
    #2
    1. Advertising

  3. Rui Maciel

    Rui Maciel Guest

    Juergen Kahrs wrote:

    > I think it's not possible. Do you expect the tags inside
    > <first> to appear as text also ? Or do you expect the
    > character data between the tags to appear only ? What
    > exactly do you expect to be the result of your example ?


    What I had in mind was to extract the literal text which is enclosed in the
    <first> and </first> tags, where the child tags would appear also as if
    they were text. To put it in other words, extract the XML subsection
    enclosed by the <first> and </first> tags.

    Is it possible?


    Thanks and best regards
    Rui Maciel
    --
    Running Kubuntu 6.06 with KDE 3.5.3 and proud of it.
    jabber:
    Rui Maciel, Jul 6, 2006
    #3
  4. Rui Maciel wrote:
    > Is it possible to directly extract the content between <first> and </first>
    > as if it was a text string?


    Not using standard SAX. Run those events back through a SAX serializer
    to regenerate the text from them.
    Joe Kesselman, Jul 6, 2006
    #4
  5. Rui Maciel

    Guest

    Joe Kesselman wrote:
    > Not using standard SAX. Run those events back through a SAX serializer
    > to regenerate the text from them.


    I see what you mean. But that seems to be a bit redundant, doesn't it?
    I mean, run a XML text through a parser, decompose it and then generate
    the exact same information from he parser's information... It looks
    like too much trouble just to end up practically where we were before.
    It would be a lot simpler if it was possible to extract the original
    content which is enclosed by certain tags.


    Rui Maciel
    , Jul 6, 2006
    #5
  6. wrote:
    > It would be a lot simpler if it was possible to extract the original
    > content which is enclosed by certain tags.


    The parser has to grovel through all the bytes anyway, to make sure it
    has found the correct matching close-tag.

    And this is a relatively uncommon case. Normally if folks are reading an
    XML document at all, it's because they want its meaning, not its markup.
    (For example, note that the meaning of the text is indeterminate without
    knowing what namespace declarations it inherits from its surrounding
    context.)

    There are special cases where this could be useful... but SAX is
    designed for the most general cases.
    Joe Kesselman, Jul 6, 2006
    #6
  7. Rui Maciel

    Greger Guest

    wrote:

    >
    > Joe Kesselman wrote:
    >> Not using standard SAX. Run those events back through a SAX serializer
    >> to regenerate the text from them.

    >
    > I see what you mean. But that seems to be a bit redundant, doesn't it?
    > I mean, run a XML text through a parser, decompose it and then generate
    > the exact same information from he parser's information... It looks
    > like too much trouble just to end up practically where we were before.
    > It would be a lot simpler if it was possible to extract the original
    > content which is enclosed by certain tags.
    >
    >
    > Rui Maciel

    http://www.saxproject.org/quickstart.html
    for java, what language do you use?
    --
    Qx RSS Reader 1.2.6a released
    RSS Reader for Linux.
    http://www.gregerhaga.net/qxrss-1.2.6-dox
    Greger, Jul 6, 2006
    #7
  8. Rui Maciel

    Guest

    Greger wrote:

    > http://www.saxproject.org/quickstart.html
    > for java, what language do you use?


    I'm using C++ at the moment with Qt's XML library.

    That site seems rather nice. I'll read it to see if I can finally get a
    hang of this XML parsing thing.


    Thanks for your help
    Rui Maciel
    , Jul 6, 2006
    #8
  9. Rui Maciel

    William Park Guest

    Rui Maciel <> wrote:
    > Juergen Kahrs wrote:
    >
    > > I think it's not possible. Do you expect the tags inside
    > > <first> to appear as text also ? Or do you expect the
    > > character data between the tags to appear only ? What
    > > exactly do you expect to be the result of your example ?

    >
    > What I had in mind was to extract the literal text which is enclosed in the
    > <first> and </first> tags, where the child tags would appear also as if
    > they were text. To put it in other words, extract the XML subsection
    > enclosed by the <first> and </first> tags.
    >
    > Is it possible?


    If <first> tag is not nested, then treat the XML file as long string.
    So, find the first <first>, then find the first </first>. Otherwise,
    you have to do some bookkeeping.

    --
    William Park <>, Toronto, Canada
    ThinFlash: Linux thin-client on USB key (flash) drive
    http://home.eol.ca/~parkw/thinflash.html
    BashDiff: Super Bash shell
    http://freshmeat.net/projects/bashdiff/
    William Park, Jul 6, 2006
    #9
  10. William Park wrote:
    > If <first> tag is not nested, then treat the XML file as long string.
    > So, find the first <first>, then find the first </first>. Otherwise,
    > you have to do some bookkeeping.


    In other words, text-based rather than XML-based processing, the
    "desperate PERL hacker" solution. Doable. Ugly. Sometimes worth
    considering, but often means you're asking the wrong questions or
    optimizing the wrong things.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Jul 7, 2006
    #10
  11. Malcolm Dew-Jones wrote:
    > You have to provide the my_print_xxx_as_text routines, and of course the
    > above is completely pseudo code, but I think you might get the idea.


    That's the "reserialize SAX events into text form" solution, which Rui
    was objecting to.


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Jul 7, 2006
    #11
  12. Joe Kesselman () wrote:
    : William Park wrote:
    : > If <first> tag is not nested, then treat the XML file as long string.
    : > So, find the first <first>, then find the first </first>. Otherwise,
    : > you have to do some bookkeeping.

    : In other words, text-based rather than XML-based processing, the
    : "desperate PERL hacker" solution. Doable. Ugly. Sometimes worth
    : considering, but often means you're asking the wrong questions or
    : optimizing the wrong things.

    No, I think he means that your sax event handler code does something like
    the following

    global variable first_depth=0;

    sub start_element( the_element_as_an_object )
    {
    if (the_element_as_an_object->its_name = 'first')
    {
    first_depth ++;
    }

    if (first_depth > 0)
    {
    my_print_element_as_text( the_element_as_an_object );
    }
    }

    sub end_element( the_element_end_as_an_object )
    {
    if (first_depth > 0)
    {
    my_print_element_end_as_text( the_element_end_as_an_object );
    }

    if (the_element_end_as_an_object->its_name = 'first')
    {
    first_depth --;
    }

    }

    sub handle_everything_else( the_thing_as_an_object)
    {
    if (first_depth > 0)
    {
    my_print_thing_as_text( the_thing_as_an_object );
    }
    }


    You have to provide the my_print_xxx_as_text routines, and of course the
    above is completely pseudo code, but I think you might get the idea.
    Malcolm Dew-Jones, Jul 7, 2006
    #12
  13. Rui Maciel

    Greger Guest

    wrote:

    >
    > Greger wrote:
    >
    >> http://www.saxproject.org/quickstart.html
    >> for java, what language do you use?

    >
    > I'm using C++ at the moment with Qt's XML library.
    >
    > That site seems rather nice. I'll read it to see if I can finally get a
    > hang of this XML parsing thing.
    >
    >
    > Thanks for your help
    > Rui Maciel

    I have never used sax myself, using libxml2 tree in my project, but what
    you'ld probably need to do is to "trigger" the function that processes the
    contents of a tag when the tagtype you are looking for occurs.
    Better:see the Qt documentation, I am sure there are simple ways to achieve
    what you try to do.
    --
    Qx RSS Reader 1.2.6a released
    RSS Reader for Linux.
    http://www.gregerhaga.net/qxrss-1.2.6-dox
    Greger, Jul 7, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. shruds
    Replies:
    1
    Views:
    752
    John C. Bollinger
    Jan 27, 2006
  2. Naren
    Replies:
    0
    Views:
    570
    Naren
    May 11, 2004
  3. Iñaki Baz Castillo
    Replies:
    7
    Views:
    821
    Iñaki Baz Castillo
    Jan 12, 2010
  4. M Wells
    Replies:
    0
    Views:
    131
    M Wells
    Oct 6, 2004
  5. Iulian Ilea
    Replies:
    1
    Views:
    292
    pcx99
    Dec 21, 2006
Loading...

Share This Page