A Unique XML Parsing Problem

D

Devon

I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
<track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
<sections>
<section start="0.00000" duration="7.35887"/>
<section start="7.35887" duration="13.03414"/>
<section start="20.39301" duration="8.73030"/>
</sections>
<segments>
<segment start="0.00000" duration="0.56000">
<loudness>
<dB time="0">-60.000</dB>
<dB time="0.45279" type="max">-59.897</dB>
</loudness>
<pitches>
<pitch class="0">0.589</pitch>
<pitch class="1">0.446</pitch>
<pitch class="2">0.518</pitch>
<pitch class="3">1.000</pitch>
<pitch class="4">0.850</pitch>
<pitch class="5">0.414</pitch>
<pitch class="6">0.326</pitch>
<pitch class="7">0.304</pitch>
<pitch class="8">0.415</pitch>
<pitch class="9">0.566</pitch>
<pitch class="10">0.353</pitch>
<pitch class="11">0.350</pitch>

I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing. And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

Thanks,

-Devon
 
C

Chris Rebert

I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
   <track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
       <sections>
           <section start="0.00000" duration="7.35887"/>
           <section start="7.35887" duration="13.03414"/>
           <section start="20.39301" duration="8.73030"/>
       </sections>
       <segments>
           <segment start="0.00000" duration="0.56000">
               <loudness>
                   <dB time="0">-60.000</dB>
                   <dB time="0.45279" type="max">-59.897</dB>
               </loudness>
               <pitches>
                   <pitch class="0">0.589</pitch>
                   <pitch class="1">0.446</pitch>
                   <pitch class="2">0.518</pitch>
                   <pitch class="3">1.000</pitch>
                   <pitch class="4">0.850</pitch>
                   <pitch class="5">0.414</pitch>
                   <pitch class="6">0.326</pitch>
                   <pitch class="7">0.304</pitch>
                   <pitch class="8">0.415</pitch>
                   <pitch class="9">0.566</pitch>
                   <pitch class="10">0.353</pitch>
                   <pitch class="11">0.350</pitch>

I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction.

ElementTree is a good way to go for XML parsing:
http://docs.python.org/library/xml.etree.elementtree.html
http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/

And for CSV writing there's obviously:
http://docs.python.org/library/csv.html
And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

You'll probably want to look at namedtuple
(http://docs.python.org/library/collections.html#collections.namedtuple
) or the "bunch" recipe (google for "Python bunch").

Cheers,
Chris
 
L

Lawrence D'Oliveiro

In message
Devon said:
I have heard about Beautiful Soup but never used it.

BeautifulSoup is intended for HTML parsing. It is, or was, particularly good
at dealing with badly-formed HTML, as commonly found on lots of websites. I
think more recently some libraries changed out from under it, so I’m not
sure if this is still true.

XML is (officially) much more anal in its compliance, so you shouldn’t need
to bend over backwards to parse it.
 
S

Stefan Behnel

Devon, 24.10.2010 01:40:
I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
> [...]
I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing.

That chapter is mostly out of date, and BeautifulSoup is certainly not the
right tool for dealing with XML, both for performance and compliance
reasons. If you need performance, as you stated above, look at cElementTree
in the stdlib.

And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

If the rules are as simple as that (i.e. tag name + attribute name), it'll
be easy going with ElementTree. Don't put too much effort into separating
the data from the XML format, though. XML parsing is fast and has the clear
advantage over CSV files that the data is safely stored in a well defined,
expressive format, including character encoding and named data fields.

Stefan
 
P

Piet van Oostrum

Devon said:
I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
<track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
<sections>
<section start="0.00000" duration="7.35887"/>
<section start="7.35887" duration="13.03414"/>
<section start="20.39301" duration="8.73030"/>
</sections>
<segments>
<segment start="0.00000" duration="0.56000">
<loudness>
<dB time="0">-60.000</dB>
<dB time="0.45279" type="max">-59.897</dB>
</loudness>
<pitches>
<pitch class="0">0.589</pitch>
<pitch class="1">0.446</pitch>
<pitch class="2">0.518</pitch>
<pitch class="3">1.000</pitch>
<pitch class="4">0.850</pitch>
<pitch class="5">0.414</pitch>
<pitch class="6">0.326</pitch>
<pitch class="7">0.304</pitch>
<pitch class="8">0.415</pitch>
<pitch class="9">0.566</pitch>
<pitch class="10">0.353</pitch>
<pitch class="11">0.350</pitch>

You could use XSLT to get the data. For example this xslt script extracts duration, tempo and time signature into a comma separated list.

<xsl:stylesheet version="1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:eek:utput method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/analysis/track">
<xsl:value-of select="concat(@duration, ',', @tempo, ',',
@timeSignature)" /><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

With xsltproc song.xsl song*.xml you would get your output.
No python necessary. Or if you would like to use it inside a Python program, use lxml to call the xslt processor, or just XPath to extract the values and format them with Python.
 
L

Lawrence D'Oliveiro

Piet van Oostrum said:
With xsltproc song.xsl song*.xml you would get your output.
No python necessary.

Is that supposed to be some kind of advantage?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,074
Latest member
StanleyFra

Latest Threads

Top