Search for string, then extract entire XML element where it appears. How?

Discussion in 'XML' started by mandibdc@gmail.com, Jun 30, 2006.

  1. Guest

    I need to extract some elements from a very large XML file. Because of
    the size, I'd like to work with it on my Linux machine as a text file.

    Basically, I am going to have a list of specific strings I'm searching
    for. For each string, I need to search through the XML file, and when
    I find that string (in the tag <code>), copy the entire <item> XML
    element that the code appears in, into another text file.

    The XML document is comprised of a bunch of <item> elements:

    <?xml version="1.0" encoding="UTF-8"?>
    <item>
    <property1>100</property1>
    <property2>
    <id>0</id>
    <code>ThisIsTheStringINeedToMatch</code>
    </property2>
    <keyword>
    <value>value1</value>
    <value>value2</value>
    </keyword>
    <color>
    <type>21</type>
    <shade>1</shade>
    </color>
    </item>

    How would you approach this? I can write a script to find each code,
    but I'm not sure how to then search forwards/backwards to extract the
    DNA element.

    Thanks!

    M
     
    , Jun 30, 2006
    #1
    1. Advertising

  2. Re: Search for string, then extract entire XML element where it appears.How?

    wrote:
    > Basically, I am going to have a list of specific strings I'm searching
    > for. For each string, I need to search through the XML file, and when
    > I find that string (in the tag <code>), copy the entire <item> XML
    > element that the code appears in, into another text file.
    >
    > How would you approach this?


    Using which tool?

    In XPath, including XSLT, use ancestor::item to find the enclosing item
    element.

    If you're operating on the DOM APIs, simply iterate your way up the
    parents looking for that item element... or use the filtered traversal
    mechanisms, if your DOM supports them.

    If you're working in SAX... SAX can't run backward, so it's up to you to
    do some sort of buffering so you can re-scan once you recognize the item
    as being one you're interested in.
     
    Joe Kesselman, Jun 30, 2006
    #2
    1. Advertising

  3. Guest

    Re: Search for string, then extract entire XML element where it appears. How?

    I was hoping to just write a text parsing script using perl, for
    example...

    But I'm open to suggestions as to how most effectively to extract data
    from this large file.

    Joe Kesselman wrote:
    > wrote:
    > > Basically, I am going to have a list of specific strings I'm searching
    > > for. For each string, I need to search through the XML file, and when
    > > I find that string (in the tag <code>), copy the entire <item> XML
    > > element that the code appears in, into another text file.
    > >
    > > How would you approach this?

    >
    > Using which tool?
    >
    > In XPath, including XSLT, use ancestor::item to find the enclosing item
    > element.
    >
    > If you're operating on the DOM APIs, simply iterate your way up the
    > parents looking for that item element... or use the filtered traversal
    > mechanisms, if your DOM supports them.
    >
    > If you're working in SAX... SAX can't run backward, so it's up to you to
    > do some sort of buffering so you can re-scan once you recognize the item
    > as being one you're interested in.
     
    , Jun 30, 2006
    #3
  4. Re: Search for string, then extract entire XML element where it appears.How?

    wrote:
    > I was hoping to just write a text parsing script using perl, for
    > example...


    Can't help; I'm not a perl user, and I tend not to reinvent wheels
    unless necessary.
     
    Joe Kesselman, Jun 30, 2006
    #4
  5. Re: Search for string, then extract entire XML element where it appears.How?

    wrote:

    > I was hoping to just write a text parsing script using perl, for
    > example...
    >
    > But I'm open to suggestions as to how most effectively to extract data
    > from this large file.



    I think Joe Kesselman summarized your set of
    options really comprehensively. Look at the
    data and decide which kind of output you need.
    You mentioned that (in case of a match), you
    need the whole element. Do you need the element
    exactly, with all possible sub-elements to
    arbitrary depth ?

    If the tree hierarchy is rather flat, then you
    could use a SAX-like parser, as describe by Joe.
    SAX-like parsers are available for most languages,
    even Perl, bash, and gawk (which I prefer).
     
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Jun 30, 2006
    #5
  6. Re: Search for string, then extract entire XML element where it appears.How?

    If it's a particularly huge file, I'd go with the buffed-SAX
    semi-streaming solution. (Or, possibly, StAX -- which is a sort of cross
    between SAX and DOM intended for this sort of chunk-at-a-time processing.)

    Iterate through the document. For each item element, build an in-memory,
    check its <code>, output it if it's one you want, and discard it so.
    This way you don't have to keep the whole source document in memory at
    once. As a refinement, for even better efficiencly, optimize this by
    discarding the partly-built subtree (and events until it ends) as soon
    as you see that the <code> isn't one you're looking for.

    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
     
    Joe Kesselman, Jul 1, 2006
    #6
  7. Peter Flynn Guest

    Re: Search for string, then extract entire XML element where it appears.How?

    wrote:
    > I was hoping to just write a text parsing script using perl, for
    > example...


    Don't. There are subtleties about the way in which XML is formed
    which will conspire to bite you in the ass if you use a non-XML
    language.

    Using Perl with one of the several XML APIs is fine, of course.

    > But I'm open to suggestions as to how most effectively to extract data
    > from this large file.


    How large is large? XSLT runs pretty fast on a modern system, and what
    you want to do isn't exactly rocket science (or if it is, I know any
    number of unemployed rocket scientists who can do it for you :)

    This seems to do the job:

    <?xml version="1.0" encoding="iso-8859-1"?>
    <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">

    <xsl:eek:utput method="xml"/>

    <xsl:template match="items">
    <items>
    <xsl:apply-templates/>
    </items>
    </xsl:template>

    <xsl:template match="item">
    <xsl:if test="contains(property2/code,'Match')">
    <xsl:copy-of select="."/>
    </xsl:if>
    </xsl:template>

    </xsl:stylesheet>

    ///Peter
    --
    XML FAQ: http://xml.silmaril.ie/
     
    Peter Flynn, Jul 3, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ETL
    Replies:
    9
    Views:
    811
  2. Replies:
    5
    Views:
    6,190
    Andy Dingley
    May 10, 2006
  3. Replies:
    1
    Views:
    17,338
    bruce barker \(sqlwork.com\)
    Aug 4, 2006
  4. HANM
    Replies:
    2
    Views:
    722
    Joseph Kesselman
    Jan 29, 2008
  5. Replies:
    0
    Views:
    293
Loading...

Share This Page