Searching XML

Discussion in 'XML' started by Nash Kabbara, Oct 26, 2004.

  1. Nash Kabbara

    Nash Kabbara Guest

    Hi all,

    I just finished writing a log reader that reads xml logs (about 1 to 2 MB
    large). At the command line you can specify the file name, the name of the
    element and it's value like so: logreader log.txt MyElement myvalue

    In retrospect, I've noticed that it takes a long time to process. The time
    is spent on comparing the value of all tags named MyElement to myvalue.
    Namely:     

    NodeList nodeList = m_document.getElementsByTagName(MyElement);
    for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
          if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
    //getTextNode merely return the text value of the node
          {
           counter++;
           tempIndex[arrIndex++] = index;
          }
     
    This takes around 20 seconds to complete processing. So my question is, is
    there some way where I can extract xml elements based on the element value.
    For example XPATH allows you to chose elements based to attribute value, so
    I'm wondering, is there a similar mechanism that allows you to grab
    elements based on their value?


    Thanks.
     
    Nash Kabbara, Oct 26, 2004
    #1
    1. Advertising

  2. Nash Kabbara

    Jeff Kish Guest

    On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <> wrote:

    >Hi all,
    >
    > I just finished writing a log reader that reads xml logs (about 1 to 2 MB
    >large). At the command line you can specify the file name, the name of the
    >element and it's value like so: logreader log.txt MyElement myvalue
    >
    > In retrospect, I've noticed that it takes a long time to process. The time
    >is spent on comparing the value of all tags named MyElement to myvalue.
    >Namely:     
    >
    >NodeList nodeList = m_document.getElementsByTagName(MyElement);
    >for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
    >      if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
    > //getTextNode merely return the text value of the node
    >      {
    >       counter++;
    >       tempIndex[arrIndex++] = index;
    >      }

    >This takes around 20 seconds to complete processing. So my question is, is
    >there some way where I can extract xml elements based on the element value.
    >For example XPATH allows you to chose elements based to attribute value, so
    >I'm wondering, is there a similar mechanism that allows you to grab
    >elements based on their value?
    >
    >
    >Thanks.

    Here is a query that selects data based on element values...

    This XQuery (taken from a tutorial on the internet..don't recall exact doc/url):

    for $b in document("books.xml")//book
    where some $a in $b/author
    satisfies ($a/last="Stevens" and $a/first="W.")
    return $b/title

    returns these results:

    <title>TCP/IP Illustrated</title>,
    <title>Advanced Programming in the UNIX Environment</title>


    Using this data:

    <bib>
    <book year="1994">
    <title>TCP/IP Illustrated</title>
    <author><last>Stevens</last><first>W.</first></author>
    <publisher>Addison-Wesley</publisher>
    <price>65.95</price>
    </book>

    <book year="1992">
    <title>Advanced Programming in the UNIX Environment</title>
    <author><last>Stevens</last><first>W.</first></author>
    <publisher>Addison-Wesley</publisher>
    <price>65.95</price>
    </book>

    <book year="2000">
    <title>Data on the Web</title>
    <author><last>Abiteboul</last><first>Serge</first></author>
    <author><last>Buneman</last><first>Peter</first></author>
    <author><last>Suciu</last><first>Dan</first></author>
    <publisher>Morgan Kaufmann Publishers</publisher>
    <price>65.95</price>
    </book>

    <book year="1999">
    <title>The Economics of Technology andContent for Digital TV</title>
    <editor><last>Gerbarg</last>
    <first>Darcy</first>
    <affiliation>CITI</affiliation>
    </editor>
    <publisher>Kluwer Academic Publishers</publisher>
    <price>129.95</price>
    </book>

    </bib>

    HTH
     
    Jeff Kish, Oct 26, 2004
    #2
    1. Advertising

  3. Nash Kabbara

    Andy Dingley Guest

    On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <>
    wrote:

    >This takes around 20 seconds to complete processing.


    I'm not surprised ! getElementsByTagName is always slow, but it's
    also inefficient here because it's having to look everywhere in the
    structure to find elements to test their names. If you can improve
    the search by looking for elements as children or grand-children,
    rather than searching everywhere for them, then this can be a good
    tweak.

    XML is often incredibly powerful, but this excess power can lead to
    inefficiencies if it's being used "by default" when you didn't really
    need it.

    > So my question is, is
    >there some way where I can extract xml elements based on the element value.


    Yes, XPath ! Just use "//MyElementName"

    Or if MyElementName is supplied by the users, then use a [...]
    predicate and the local-name() function to get the name of the
    element, then compare it to the value of an element name supplied as a
    parameter.

    <xsl:param name="elmName" >MyElementName</xsl:param>
    ...
    //*[local-name() = string($elmName)]


    XQuery (and various other incarnations) will do it too, and with
    better performance. However it's sometimes hard to find XQuery
    features in an environment, but most will have XSLT and XPath
    available.
     
    Andy Dingley, Oct 26, 2004
    #3
  4. Nash Kabbara

    Jeff Kish Guest

    On Tue, 26 Oct 2004 12:09:25 +0100, Andy Dingley <>
    wrote:

    >On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <>
    >wrote:
    >
    >>This takes around 20 seconds to complete processing.

    >
    >I'm not surprised ! getElementsByTagName is always slow, but it's
    >also inefficient here because it's having to look everywhere in the
    >structure to find elements to test their names. If you can improve
    >the search by looking for elements as children or grand-children,
    >rather than searching everywhere for them, then this can be a good
    >tweak.
    >
    >XML is often incredibly powerful, but this excess power can lead to
    >inefficiencies if it's being used "by default" when you didn't really
    >need it.
    >
    >> So my question is, is
    >>there some way where I can extract xml elements based on the element value.

    >
    >Yes, XPath ! Just use "//MyElementName"
    >
    >Or if MyElementName is supplied by the users, then use a [...]
    >predicate and the local-name() function to get the name of the
    >element, then compare it to the value of an element name supplied as a
    >parameter.
    >
    ><xsl:param name="elmName" >MyElementName</xsl:param>
    > ...
    >//*[local-name() = string($elmName)]
    >
    >
    >XQuery (and various other incarnations) will do it too, and with
    >better performance. However it's sometimes hard to find XQuery
    >features in an environment, but most will have XSLT and XPath
    >available.

    I like Andy's answer better.
    Jeff Kish
     
    Jeff Kish, Oct 26, 2004
    #4
  5. Nash Kabbara

    Nash Kabbara Guest

    Hi Andy,

    Thanks for the response. Actually the lag is not in getElementsByTagName,
    but by the loop I have that compares the values of the tags with what the
    user is looking for (myvalue). So I was wondering if there's a built in
    mechanism that pulls elements based on their Value. When I say "Value" I
    mean their content, not their name. i.e <Element>value</Element>. Sorry for
    not being clear. It seems your examples of xpath get elements base on their
    name, but not value.


    Nash
    Andy Dingley wrote:

    > On Tue, 26 Oct 2004 03:47:50 -0500, Nash Kabbara <>
    > wrote:
    >
    >>This takes around 20 seconds to complete processing.

    >
    > I'm not surprised ! getElementsByTagName is always slow, but it's
    > also inefficient here because it's having to look everywhere in the
    > structure to find elements to test their names. If you can improve
    > the search by looking for elements as children or grand-children,
    > rather than searching everywhere for them, then this can be a good
    > tweak.
    >
    > XML is often incredibly powerful, but this excess power can lead to
    > inefficiencies if it's being used "by default" when you didn't really
    > need it.
    >
    >> So my question is, is
    >>there some way where I can extract xml elements based on the element
    >>value.

    >
    > Yes, XPath ! Just use "//MyElementName"
    >
    > Or if MyElementName is supplied by the users, then use a [...]
    > predicate and the local-name() function to get the name of the
    > element, then compare it to the value of an element name supplied as a
    > parameter.
    >
    > <xsl:param name="elmName" >MyElementName</xsl:param>
    > ...
    > //*[local-name() = string($elmName)]
    >
    >
    > XQuery (and various other incarnations) will do it too, and with
    > better performance. However it's sometimes hard to find XQuery
    > features in an environment, but most will have XSLT and XPath
    > available.
     
    Nash Kabbara, Oct 26, 2004
    #5
  6. Nash Kabbara

    Andy Dingley Guest

    On Tue, 26 Oct 2004 10:09:27 -0500, Nash Kabbara <>
    wrote:

    > Thanks for the response. Actually the lag is not in getElementsByTagName,
    >but by the loop I have that compares the values of the tags with what the
    >user is looking for (myvalue).


    I don't recognise the coding platform - what is it ?

    There's a lot you can do to improve that loop.
    - Use an iterator not an array index
    - Be suspicious of that .getlength() method, especially in an array
    bound. Is that a per-iteration overhead you've given yourself ?
    - never trim() when you can rtrim()
    - Never trim() when you can use a space-ignoring comparison instead.

    The trouble with much XML optimisation is that it becomes sensitive to
    the data you feed it. Do you have a lot of matching elements to walk
    through, or is finding the set of elements the main problem ?


    > So I was wondering if there's a built in
    >mechanism that pulls elements based on their Value. When I say "Value" I
    >mean their content, not their name. i.e <Element>value</Element>.


    Yes, XPath !

    Use a similar predicate, "//*[string (.) = $elmContents]"

    string() is optional (because in this context it's the default
    behaviour) but it's good practice to use it in situations like this,
    because it makes reading your code a lot clearer in the future.

    --
    Smert' spamionam
     
    Andy Dingley, Oct 26, 2004
    #6
  7. I think youre coding in Java,

    It is better to use SAX: Simple Api for XML.
    You then dont have to load the entire DOM,
    and you can do some optimizations.

    SAX is a good choice if it is not too complex what you want to do.

    Greetz
    Tjerk

    Nash Kabbara wrote:
    > Hi all,
    >
    > I just finished writing a log reader that reads xml logs (about 1 to 2 MB
    > large). At the command line you can specify the file name, the name of the
    > element and it's value like so: logreader log.txt MyElement myvalue
    >
    > In retrospect, I've noticed that it takes a long time to process. The time
    > is spent on comparing the value of all tags named MyElement to myvalue.
    > Namely:
    >
    > NodeList nodeList = m_document.getElementsByTagName(MyElement);
    > for(int index =0, arrIndex = 0; index < nodeList.getLength(); index++)
    > if(getTextNode(nodeList.item(index)).trim().equals(myvalue))
    > //getTextNode merely return the text value of the node
    > {
    > counter++;
    > tempIndex[arrIndex++] = index;
    > }
    >
    > This takes around 20 seconds to complete processing. So my question is, is
    > there some way where I can extract xml elements based on the element value.
    > For example XPATH allows you to chose elements based to attribute value, so
    > I'm wondering, is there a similar mechanism that allows you to grab
    > elements based on their value?
    >
    >
    > Thanks.
     
    Tjerk Wolterink, Oct 26, 2004
    #7
  8. Nash Kabbara

    Jeff Kish Guest

    <snip>
    >Yes, XPath !
    >
    >Use a similar predicate, "//*[string (.) = $elmContents]"
    >
    >string() is optional (because in this context it's the default
    >behaviour) but it's good practice to use it in situations like this,
    >because it makes reading your code a lot clearer in the future.

    <snip>
    lots of good info in this thread!
    Yes, Sax if you don't need to load your entire object in memory.

    Oh.. regarding xquery..

    for $b in document("books.xml")//*[.="TCP/IP Illustrated"]
    return
    <temp>{string($b/.), name($b/.)}</temp>

    {-- results in this output
    <temp>TCP/IP Illustrated title</temp>
    --}

    Jeff Kish
     
    Jeff Kish, Oct 26, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. helpful sql

    searching for xml nodes

    helpful sql, May 18, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    491
    Kai Brinkmann [MSFT]
    May 18, 2005
  2. sal achhala

    Searching XML files with DOM

    sal achhala, Mar 1, 2004, in forum: Java
    Replies:
    0
    Views:
    321
    sal achhala
    Mar 1, 2004
  3. Sullivan WxPyQtKinter

    Berkeley DB XML vs 4suite for fast searching in XML DB?

    Sullivan WxPyQtKinter, Apr 1, 2006, in forum: Python
    Replies:
    1
    Views:
    320
    Ravi Teja
    Apr 2, 2006
  4. Erik Wasser
    Replies:
    5
    Views:
    524
    Peter J. Holzer
    Mar 5, 2006
  5. stumblng.tumblr
    Replies:
    1
    Views:
    234
    stumblng.tumblr
    Feb 4, 2008
Loading...

Share This Page