A Unique XML Parsing Problem

Devon · Oct 24, 2010

I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
<track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
<sections>
<section start="0.00000" duration="7.35887"/>
<section start="7.35887" duration="13.03414"/>
<section start="20.39301" duration="8.73030"/>
</sections>
<segments>
<segment start="0.00000" duration="0.56000">
<loudness>
<dB time="0">-60.000</dB>
<dB time="0.45279" type="max">-59.897</dB>
</loudness>
<pitches>
<pitch class="0">0.589</pitch>
<pitch class="1">0.446</pitch>
<pitch class="2">0.518</pitch>
<pitch class="3">1.000</pitch>
<pitch class="4">0.850</pitch>
<pitch class="5">0.414</pitch>
<pitch class="6">0.326</pitch>
<pitch class="7">0.304</pitch>
<pitch class="8">0.415</pitch>
<pitch class="9">0.566</pitch>
<pitch class="10">0.353</pitch>
<pitch class="11">0.350</pitch>

I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing. And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

Thanks,

-Devon

Chris Rebert · Oct 24, 2010

I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
Â Â <track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
Â Â Â Â <sections>
Â Â Â Â Â Â <section start="0.00000" duration="7.35887"/>
Â Â Â Â Â Â <section start="7.35887" duration="13.03414"/>
Â Â Â Â Â Â <section start="20.39301" duration="8.73030"/>
Â Â Â Â </sections>
Â Â Â Â <segments>
Â Â Â Â Â Â <segment start="0.00000" duration="0.56000">
Â Â Â Â Â Â Â Â <loudness>
Â Â Â Â Â Â Â Â Â Â <dB time="0">-60.000</dB>
Â Â Â Â Â Â Â Â Â Â <dB time="0.45279" type="max">-59.897</dB>
Â Â Â Â Â Â Â Â </loudness>
Â Â Â Â Â Â Â Â <pitches>
Â Â Â Â Â Â Â Â Â Â <pitch class="0">0.589</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="1">0.446</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="2">0.518</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="3">1.000</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="4">0.850</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="5">0.414</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="6">0.326</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="7">0.304</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="8">0.415</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="9">0.566</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="10">0.353</pitch>
Â Â Â Â Â Â Â Â Â Â <pitch class="11">0.350</pitch>

I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction.

ElementTree is a good way to go for XML parsing:
http://docs.python.org/library/xml.etree.elementtree.html
http://effbot.org/zone/element-index.htm
http://codespeak.net/lxml/

And for CSV writing there's obviously:
http://docs.python.org/library/csv.html

And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

You'll probably want to look at namedtuple
(http://docs.python.org/library/collections.html#collections.namedtuple
) or the "bunch" recipe (google for "Python bunch").

Cheers,
Chris

Lawrence D'Oliveiro · Oct 24, 2010

In message

Devon said:
I have heard about Beautiful Soup but never used it.

BeautifulSoup is intended for HTML parsing. It is, or was, particularly good
at dealing with badly-formed HTML, as commonly found on lots of websites. I
think more recently some libraries changed out from under it, so Iâ€™m not
sure if this is still true.

XML is (officially) much more anal in its compliance, so you shouldnâ€™t need
to bend over backwards to parse it.

Stefan Behnel · Oct 24, 2010

Devon, 24.10.2010 01:40:

I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
> [...]
I am a statistician and therefore used to data being stored in CSV-
like files, with each row being a datapoint, and each column being a
feature. I would like to parse the data out of these XML files and
write them out into a CSV file. Any help would be greatly appreciated.
Mostly I am looking for a point in the right direction. I have heard
about Beautiful Soup but never used it. I am currently reading Dive
Into Python's chapters on HTML and XML parsing.

That chapter is mostly out of date, and BeautifulSoup is certainly not the
right tool for dealing with XML, both for performance and compliance
reasons. If you need performance, as you stated above, look at cElementTree
in the stdlib.

And I am also more
concerned about how to use the tags in the XML files to build feature
names so I do not have to hard code them. For example, the first
feature given by the above code would be "track duration" with a value
of 29.12331

If the rules are as simple as that (i.e. tag name + attribute name), it'll
be easy going with ElementTree. Don't put too much effort into separating
the data from the XML format, though. XML parsing is fast and has the clear
advantage over CSV files that the data is safely stored in a well defined,
expressive format, including character encoding and named data fields.

Stefan

Piet van Oostrum · Oct 24, 2010

Devon said:
I must quickly and efficiently parse some data contained in multiple
XML files in order to perform some learning algorithms on the data.
Info:

I have thousands of files, each file corresponds to a single song.
Each XML file contains information extracted from the song (called
features). Examples include tempo, time signature, pitch classes, etc.
An example from the beginning of one of these files looks like:

<analysis decoder="Quicktime" version="0x7608000">
<track duration="29.12331" endOfFadeIn="0.00000"
startOfFadeOut="29.12331" loudness="-12.097" tempo="71.031"
tempoConfidence="0.386" timeSignature="4"
timeSignatureConfidence="0.974" key="11" keyConfidence="1.000"
mode="0" modeConfidence="1.000">
<sections>
<section start="0.00000" duration="7.35887"/>
<section start="7.35887" duration="13.03414"/>
<section start="20.39301" duration="8.73030"/>
</sections>
<segments>
<segment start="0.00000" duration="0.56000">
<loudness>
<dB time="0">-60.000</dB>
<dB time="0.45279" type="max">-59.897</dB>
</loudness>
<pitches>
<pitch class="0">0.589</pitch>
<pitch class="1">0.446</pitch>
<pitch class="2">0.518</pitch>
<pitch class="3">1.000</pitch>
<pitch class="4">0.850</pitch>
<pitch class="5">0.414</pitch>
<pitch class="6">0.326</pitch>
<pitch class="7">0.304</pitch>
<pitch class="8">0.415</pitch>
<pitch class="9">0.566</pitch>
<pitch class="10">0.353</pitch>
<pitch class="11">0.350</pitch>

You could use XSLT to get the data. For example this xslt script extracts duration, tempo and time signature into a comma separated list.

<xsl:stylesheet version="1.0"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl

utput method="text"/>
<xsl:strip-space elements="*"/>
<xsl:template match="/analysis/track">
<xsl:value-of select="concat(@duration, ',', @tempo, ',',
@timeSignature)" /><xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

With xsltproc song.xsl song*.xml you would get your output.
No python necessary. Or if you would like to use it inside a Python program, use lxml to call the xslt processor, or just XPath to extract the values and format them with Python.

Lawrence D'Oliveiro · Oct 25, 2010

Piet van Oostrum said:
With xsltproc song.xsl song*.xml you would get your output.
No python necessary.

Is that supposed to be some kind of advantage?

Only one table shows up with the information	2	Mar 29, 2023
Building several parsing modules	1	Mar 18, 2007
optimize XML parsing	2	Jun 12, 2007
XML Parsing Problem in Internet Explorer	1	Oct 11, 2008
A data transformation framework. A presentation inviting commentary.	0	Aug 21, 2013
Parsing XML into PHP to insert into a MySQL DB	0	Oct 13, 2006
Advice: XML vs. mySQL	4	Apr 7, 2009
Parsing XML and storing attributes in MySQL using Perl	5	Jun 12, 2006

A Unique XML Parsing Problem

Devon

Chris Rebert

Lawrence D'Oliveiro

Stefan Behnel

Piet van Oostrum

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads