Extracting data from xml file

Discussion in 'XML' started by Mag Gam, Mar 3, 2007.

  1. Mag Gam

    Mag Gam Guest

    Hi All,
    I am new to XML, and trying to extract some data from a file.

    The file looks like this:
    <CATALOG>
    <CD>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>10.90</PRICE>
    <YEAR>1985</YEAR>
    </CD>
    <TAPE>
    <TITLE>Empire Burlesque</TITLE>
    <ARTIST>Bob Dylan</ARTIST>
    <COUNTRY>USA</COUNTRY>
    <COMPANY>Columbia</COMPANY>
    <PRICE>6.99</PRICE>
    <YEAR>1985</YEAR>
    <TAPE>
    <CATALOG>

    I am trying to get
    Artist: Bob Dylan
    Company: Columbia
    CD Price: 10.90
    Tape Price: 6.99


    What is the best method to do this? Is there a tool or utility you can
    recommend for Windows?
    Mag Gam, Mar 3, 2007
    #1
    1. Advertising

  2. > What is the best method to do this?

    Lots of tutorials exist on the web. My standard recommended starting
    point: http://www.ibm.com/xml

    (I'd probably hardcode it using DOM or SAX. But it might be easier for a
    novice to write an XSLT stylesheet. There are other tools which might be
    easier again, but they're less well standardized and I hesitate to
    recommend that a novice invest in learning them.)


    --
    () ASCII Ribbon Campaign | Joe Kesselman
    /\ Stamp out HTML e-mail! | System architexture and kinetic poetry
    Joe Kesselman, Mar 3, 2007
    #2
    1. Advertising

  3. Mag Gam

    roy axenov Guest

    On Mar 3, 7:57 pm, "Mag Gam" <> wrote:
    > <CATALOG>
    > <CD>
    > <TITLE>Empire Burlesque</TITLE>
    > <ARTIST>Bob Dylan</ARTIST>
    > <COUNTRY>USA</COUNTRY>
    > <COMPANY>Columbia</COMPANY>
    > <PRICE>10.90</PRICE>
    > <YEAR>1985</YEAR>
    > </CD>
    > <TAPE>
    > <TITLE>Empire Burlesque</TITLE>
    > <ARTIST>Bob Dylan</ARTIST>
    > <COUNTRY>USA</COUNTRY>
    > <COMPANY>Columbia</COMPANY>
    > <PRICE>6.99</PRICE>
    > <YEAR>1985</YEAR>
    > <TAPE>
    > <CATALOG>


    This is not well-formed and therefore not XML. If that's
    your real data, XML tools are quite unlikely to help you.

    Assuming it's just another case of 'oh, for some reason I
    just typed that in instead of using copy-paste'...

    > I am trying to get
    > Artist: Bob Dylan
    > Company: Columbia
    > CD Price: 10.90
    > Tape Price: 6.99


    Another day, another grouping problem...

    > What is the best method to do this? Is there a tool or
    > utility you can recommend for Windows?


    Define 'best'. Define 'utility'. I don't believe there's a
    DWIM-type tool that would automagically, well, do what you
    mean at a click of a button. Therefore, it's a programming
    problem. You could use a DOM or SAX parser in your language
    of choice, as Joseph proposed. Or you could use XSLT. Or
    maybe XQuery or xmlgawk. In case it's XSLT/XQuery, I
    believe there are many GUI tools that might make working
    with the code easier for you; I'm not sure if there are any
    good open source ones, though. If you'd be happy with
    Unix-style small tools, there's a number of open source
    XSLT processors, including Saxon (it's written in Java, so
    it shouldn't be a problem running it on a Windows box),
    xsltproc and xalan (if there are no native ports, Cygwin or
    MinGW will probably save the day). In short, you should
    determine what you want then google for it. Come back with
    specific questions.

    Here's a transformation that does more or less what you
    want with your sample data (after it's been fixed, of
    course):

    <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:key name="id" match="CD|TAPE"
    use="concat(TITLE,ARTIST,COMPANY)"/>
    <xsl:key name="first" match="CD|TAPE"
    use=
    "
    generate-id()=
    generate-id
    (
    key('id',concat(TITLE,ARTIST,COMPANY))[1]
    )
    "/>
    <xsl:eek:utput method="text"/>
    <xsl:template match="@*|node()"/>
    <xsl:template match="/">
    <xsl:apply-templates select="key('first',true())"/>
    </xsl:template>
    <xsl:template match="CD|TAPE">
    <xsl:text>
    </xsl:text>
    <xsl:apply-templates/>
    <xsl:apply-templates
    select="key('id',concat(TITLE,ARTIST,COMPANY))"
    mode="prices"/>
    </xsl:template>
    <xsl:template match="TITLE">
    <xsl:text>Title: </xsl:text>
    <xsl:value-of select="."/>
    <xsl:text>
    </xsl:text>
    </xsl:template>
    <xsl:template match="ARTIST">
    <xsl:text>Artist: </xsl:text>
    <xsl:value-of select="."/>
    <xsl:text>
    </xsl:text>
    </xsl:template>
    <xsl:template match="COMPANY">
    <xsl:text>Company: </xsl:text>
    <xsl:value-of select="."/>
    <xsl:text>
    </xsl:text>
    </xsl:template>
    <xsl:template match="@*|node()" mode="prices"/>
    <xsl:template match="CD|TAPE" mode="prices">
    <xsl:apply-templates mode="prices"/>
    </xsl:template>
    <xsl:template match="CD/PRICE" mode="prices">
    <xsl:text>CD Price: </xsl:text>
    <xsl:value-of select="."/>
    <xsl:text>
    </xsl:text>
    </xsl:template>
    <xsl:template match="TAPE/PRICE" mode="prices">
    <xsl:text>Tape Price: </xsl:text>
    <xsl:value-of select="."/>
    <xsl:text>
    </xsl:text>
    </xsl:template>
    </xsl:stylesheet>

    --
    roy axenov
    roy axenov, Mar 3, 2007
    #3
  4. Mag Gam wrote:
    > Hi All,
    > I am new to XML, and trying to extract some data from a file.
    >
    > The file looks like this:
    > <CATALOG>
    > <CD>
    > <TITLE>Empire Burlesque</TITLE>
    > <ARTIST>Bob Dylan</ARTIST>
    > <COUNTRY>USA</COUNTRY>
    > <COMPANY>Columbia</COMPANY>
    > <PRICE>10.90</PRICE>
    > <YEAR>1985</YEAR>
    > </CD>
    > <TAPE>
    > <TITLE>Empire Burlesque</TITLE>
    > <ARTIST>Bob Dylan</ARTIST>
    > <COUNTRY>USA</COUNTRY>
    > <COMPANY>Columbia</COMPANY>
    > <PRICE>6.99</PRICE>
    > <YEAR>1985</YEAR>
    > <TAPE>
    > <CATALOG>


    The last two last are not correct (closing tags should begin with /).

    > I am trying to get
    > Artist: Bob Dylan
    > Company: Columbia
    > CD Price: 10.90
    > Tape Price: 6.99
    >
    >
    > What is the best method to do this? Is there a tool or utility you can
    > recommend for Windows?


    One of the many tools that can solve the problem is XMLgawk:

    http://home.vrweb.de/~juergen.kahrs/gawk/XML/


    The following script solves your problem.

    @load xml
    XMLCHARDATA { data = $0 }
    XMLENDELEM == "ARTIST" && index(XMLPATH, "CD") { print "Artist:", data}
    XMLENDELEM == "COMPANY" && index(XMLPATH, "CD") { print "Company:", data}
    XMLENDELEM == "PRICE" && index(XMLPATH, "CD") { print "CD Price:", data}
    XMLENDELEM == "PRICE" && index(XMLPATH, "TAPE") { print "Tape Price:", data}

    Invoke the script like this and it will produce the
    following output:

    xgawk -f catalog.awk catalog.xml
    Artist: Bob Dylan
    Company: Columbia
    CD Price: 10.90
    Tape Price: 6.99
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Mar 3, 2007
    #4
  5. Mag Gam

    Mag Gam Guest

    On Mar 3, 2:51 pm, J├╝rgen Kahrs <>
    wrote:
    > Mag Gam wrote:
    > > Hi All,
    > > I am new to XML, and trying to extract some data from a file.

    >
    > > The file looks like this:
    > > <CATALOG>
    > > <CD>
    > > <TITLE>Empire Burlesque</TITLE>
    > > <ARTIST>Bob Dylan</ARTIST>
    > > <COUNTRY>USA</COUNTRY>
    > > <COMPANY>Columbia</COMPANY>
    > > <PRICE>10.90</PRICE>
    > > <YEAR>1985</YEAR>
    > > </CD>
    > > <TAPE>
    > > <TITLE>Empire Burlesque</TITLE>
    > > <ARTIST>Bob Dylan</ARTIST>
    > > <COUNTRY>USA</COUNTRY>
    > > <COMPANY>Columbia</COMPANY>
    > > <PRICE>6.99</PRICE>
    > > <YEAR>1985</YEAR>
    > > <TAPE>
    > > <CATALOG>

    >
    > The last two last are not correct (closing tags should begin with /).
    >
    > > I am trying to get
    > > Artist: Bob Dylan
    > > Company: Columbia
    > > CD Price: 10.90
    > > Tape Price: 6.99

    >
    > > What is the best method to do this? Is there a tool or utility you can
    > > recommend for Windows?

    >
    > One of the many tools that can solve the problem is XMLgawk:
    >
    > http://home.vrweb.de/~juergen.kahrs/gawk/XML/
    >
    > The following script solves your problem.
    >
    > @load xml
    > XMLCHARDATA { data = $0 }
    > XMLENDELEM == "ARTIST" && index(XMLPATH, "CD") { print "Artist:", data}
    > XMLENDELEM == "COMPANY" && index(XMLPATH, "CD") { print "Company:", data}
    > XMLENDELEM == "PRICE" && index(XMLPATH, "CD") { print "CD Price:", data}
    > XMLENDELEM == "PRICE" && index(XMLPATH, "TAPE") { print "Tape Price:", data}
    >
    > Invoke the script like this and it will produce the
    > following output:
    >
    > xgawk -f catalog.awk catalog.xml
    > Artist: Bob Dylan
    > Company: Columbia
    > CD Price: 10.90
    > Tape Price: 6.99



    Thanks everyone!
    I am very new to XML and trying to learn my ropes.

    Roy:
    I have yet to try your XSL solution. I will try it. The XML code was
    not valid, I know. I used it for an example.
    Lets assume this is my new .xml file: http://msdn2.microsoft.com/en-us/library/ms762271.aspx
    (made some slight modifications, like added 2 authors)

    <?xml version="1.0"?>
    <catalog>
    <book id="bk101">
    <author>Gambardella, Matthew</author>
    <author>II Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <price>44.95</price>
    <publish_date>2000-10-01</publish_date>
    <description>An in-depth look at creating applications
    with XML.</description>
    </book>
    <book id="bk102">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-12-16</publish_date>
    <description>A former architect battles corporate zombies,
    an evil sorceress, and her own childhood to become queen
    of the world.</description>
    </book>
    <book id="bk103">
    <author>Corets, Eva</author>
    <title>Maeve Ascendant</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-11-17</publish_date>
    <description>After the collapse of a nanotechnology
    society in England, the young survivors lay the
    foundation for a new society.</description>
    </book>
    <book id="bk104">
    <author>Corets, Eva</author>
    <title>Oberon's Legacy</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2001-03-10</publish_date>
    <description>In post-apocalypse England, the mysterious
    agent known only as Oberon helps to create a new life
    for the inhabitants of London. Sequel to Maeve
    Ascendant.</description>
    </book>
    <book id="bk105">
    <author>Corets, Eva</author>
    <title>The Sundered Grail</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2001-09-10</publish_date>
    <description>The two daughters of Maeve, half-sisters,
    battle one another for control of England. Sequel to
    Oberon's Legacy.</description>
    </book>
    <book id="bk106">
    <author>Randall, Cynthia</author>
    <title>Lover Birds</title>
    <genre>Romance</genre>
    <price>4.95</price>
    <publish_date>2000-09-02</publish_date>
    <description>When Carla meets Paul at an ornithology
    conference, tempers fly as feathers get ruffled.</description>
    </book>
    <book id="bk107">
    <author>Thurman, Paula</author>
    <title>Splish Splash</title>
    <genre>Romance</genre>
    <price>4.95</price>
    <publish_date>2000-11-02</publish_date>
    <description>A deep sea diver finds true love twenty
    thousand leagues beneath the sea.</description>
    </book>
    <book id="bk108">
    <author>Knorr, Stefan</author>
    <title>Creepy Crawlies</title>
    <genre>Horror</genre>
    <price>4.95</price>
    <publish_date>2000-12-06</publish_date>
    <description>An anthology of horror stories about roaches,
    centipedes, scorpions and other insects.</description>
    </book>
    <book id="bk109">
    <author>Kress, Peter</author>
    <title>Paradox Lost</title>
    <genre>Science Fiction</genre>
    <price>6.95</price>
    <publish_date>2000-11-02</publish_date>
    <description>After an inadvertant trip through a Heisenberg
    Uncertainty Device, James Salway discovers the problems
    of being quantum.</description>
    </book>
    <book id="bk110">
    <author>O'Brien, Tim</author>
    <title>Microsoft .NET: The Programming Bible</title>
    <genre>Computer</genre>
    <price>36.95</price>
    <publish_date>2000-12-09</publish_date>
    <description>Microsoft's .NET initiative is explored in
    detail in this deep programmer's reference.</description>
    </book>
    <book id="bk111">
    <author>O'Brien, Tim</author>
    <title>MSXML3: A Comprehensive Guide</title>
    <genre>Computer</genre>
    <price>36.95</price>
    <publish_date>2000-12-01</publish_date>
    <description>The Microsoft MSXML3 parser is covered in
    detail, with attention to XML DOM interfaces, XSLT processing,
    SAX and more.</description>
    </book>
    <book id="bk112">
    <author>Galos, Mike</author>
    <title>Visual Studio 7: A Comprehensive Guide</title>
    <genre>Computer</genre>
    <price>49.95</price>
    <publish_date>2001-04-16</publish_date>
    <description>Microsoft Visual Studio 7 is explored in depth,
    looking at how Visual Basic, Visual C++, C#, and ASP+ are
    integrated into a comprehensive development
    environment.</description>
    </book>
    </catalog>

    How would I get 'Book Title' and 'Book Author' ?

    TIA
    Mag Gam, Mar 4, 2007
    #5
  6. Mag Gam

    git Guest

    On Sat, 03 Mar 2007 09:57:38 -0800, Mag Gam wrote:

    > Hi All,
    > I am new to XML, and trying to extract some data from a file.
    >
    > The file looks like this:
    > <CATALOG>
    > <CD>
    > <TITLE>Empire Burlesque</TITLE>
    > <ARTIST>Bob Dylan</ARTIST>
    > <COUNTRY>USA</COUNTRY>
    > <COMPANY>Columbia</COMPANY>
    > <PRICE>10.90</PRICE>
    > <YEAR>1985</YEAR>
    > </CD>
    > <TAPE>
    > <TITLE>Empire Burlesque</TITLE>
    > <ARTIST>Bob Dylan</ARTIST>
    > <COUNTRY>USA</COUNTRY>
    > <COMPANY>Columbia</COMPANY>
    > <PRICE>6.99</PRICE>
    > <YEAR>1985</YEAR>
    > <TAPE>
    > <CATALOG>
    >
    > I am trying to get
    > Artist: Bob Dylan
    > Company: Columbia
    > CD Price: 10.90
    > Tape Price: 6.99
    >
    >
    > What is the best method to do this? Is there a tool or utility you can
    > recommend for Windows?


    On windows, for someone who just wants to get on with the job rather than
    learn xslt or xpath, I would recommend coding it all in JScript (or
    vbscript). Use use the MS XML parse that comes with windows and walk over
    the DOM to find the data you want.

    I am working on examples of this technique on my blog/site:

    http://nerds-central.blogspot.com/2007/01/creating-xml-viewer-with-jscript-exsead.html

    http://nerds-central.blogspot.com/2007/01/nerds-central-gets-ajax-atom-feed.html
    (I promise that I will write the follow up to that second article real
    soon! And I am working VBScript examples as well).

    Feel free to join the Nerds-Central email group to ask more questions if
    you like the method:
    http://tech.groups.yahoo.com/group/nerds-central/

    Cheers

    AJ


    --
    Cubical Land:
    www.cubicalland.com
    Nerds-Central:
    nerds-central.blogspot.com
    git, Mar 4, 2007
    #6
  7. Mag Gam wrote:

    > How would I get 'Book Title' and 'Book Author' ?


    Use this XMLgawk script:

    @load xml
    XMLCHARDATA { data = $0 }
    XMLENDELEM == "author" { author = data }
    XMLENDELEM == "title" { title = data }
    XMLENDELEM == "book" { print author, title}


    And you will get the following output from the XML
    data that you posted:

    xgawk -f catalog2.awk catalog2.xml

    II Gambardella, Matthew XML Developer's Guide
    Ralls, Kim Midnight Rain
    Corets, Eva Maeve Ascendant
    Corets, Eva Oberon's Legacy
    Corets, Eva The Sundered Grail
    Randall, Cynthia Lover Birds
    Thurman, Paula Splish Splash
    Knorr, Stefan Creepy Crawlies
    Kress, Peter Paradox Lost
    O'Brien, Tim Microsoft .NET: The Programming Bible
    O'Brien, Tim MSXML3: A Comprehensive Guide
    Galos, Mike Visual Studio 7: A Comprehensive Guide
    =?ISO-8859-1?Q?J=FCrgen_Kahrs?=, Mar 4, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ken
    Replies:
    8
    Views:
    6,846
    Patrick TJ McPhee
    Nov 30, 2003
  2. Toto
    Replies:
    5
    Views:
    649
  3. Rodney
    Replies:
    4
    Views:
    444
    Rodney
    Dec 30, 2005
  4. Debbiedo
    Replies:
    4
    Views:
    394
    roy axenov
    May 12, 2007
  5. Replies:
    5
    Views:
    83
    Chris Angelico
    May 14, 2014
Loading...

Share This Page