xsltproc question - I am clueless and a newbie, so don't be too roughon me!

G

Glen Millard

Okay, I have an XML file that I get from a provider:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="basic.xsl"?>
<content>
<status>ok</status>
<faxes>
<faxid>21404974</faxid>
<date>2012-01-05 07:34:10</date>
<source>5194852368</source>
<destination>8885216725</destination>
<status>read</status>
<faxid>23223059</faxid>
<date>2012-03-01 07:27:52</date>
<source>5194211862</source>
<destination>8885216725</destination>
<status>new</status>
<faxid>23223164</faxid>
<date>2012-03-01 07:29:45</date>
<source>5194211862</source>
<destination>8885210692</destination>
<status>new</status>
<faxid>23224287</faxid>
<date>2012-03-01 07:51:07</date>
<source>8885216725</source>
<destination>8885210692</destination>
<status>new</status>
</faxes></content>

I want to be able to import/parse this into a MySQL database. However, I need to reformat it so that it is 'database-centric'. I was going to use a simple script to use find/replace with a sed and regular expressions - which works.

However, I am going to need to do this multiple times per day and figured that an xslt processor would be more efficient.

So, can someone get me started? I think that the root element being <database></database> would be a start.

I am just kind of clueless on this - I am not looking for someone to do it for me, just a little hand-holding on how to create an xsl stylesheet to rename the elements.

This is the type of format that I want to achieve:


<database>
<status>ok</status>
<faxes>
<row>
<faxid>21404974</faxid>
<date>2012-01-05 07:34:10</date>
<source>5194852368</source>
<destination>8885216725</destination>
<status>read</status>
</row>
<row>
<faxid>23223059</faxid>
<date>2012-03-01 07:27:52</date>
<source>5194211862</source>
<destination>8885216725</destination>
<status>new</status>
</row>
<row>
<faxid>23223164</faxid>
<date>2012-03-01 07:29:45</date>
<source>5194211862</source>
<destination>8885210692</destination>
<status>new</status>
</row>
<row>
<faxid>23224287</faxid>
<date>2012-03-01 07:51:07</date>
<source>8885216725</source>
<destination>8885210692</destination>
<status>new</status>
</row>
</faxes>
</database>

This way, I can use a parser XML:parser in Perl to bring it into a MySQL database.

So, I guess substituting/replacing tags is what I need to do.

I guess I just don't understand the syntax of xsl stylesheets.

Thanks - Glen
 
S

Simon Wright

Glen Millard said:
Okay, I have an XML file that I get from a provider:
I want to be able to import/parse this into a MySQL database. However,
I need to reformat it so that it is 'database-centric'. I was going to
use a simple script to use find/replace with a sed and regular
expressions - which works.
So, can someone get me started? I think that the root element being
<database></database> would be a start.

This should be a start - it doesn't do the <status> element (so your
first 'ok' comes out in the wrong place), and I've only done a couple of
the elements. Also, it relies on your provider (who should really know
better and put in the <row> (or <fax>) elements in the first place!) not
to omit any elements; if one of the <source> elements was missing, all
the <sources>s would be one row out from then on.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="xml" encoding="iso-8859-1" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/content/faxes">
<xsl:element name="database">
<xsl:element name="faxes">
<xsl:for-each select="faxid">
<xsl:variable name="i" select="position()"/>
<xsl:element name="row">
<xsl:element name="faxid">
<xsl:value-of select="."/>
</xsl:element>
<xsl:element name="date">
<xsl:value-of select="../date[$i]"/>
</xsl:element>
</xsl:element>
</xsl:for-each>
</xsl:element>
</xsl:element>
</xsl:template>

</xsl:stylesheet>
 
J

Joe Kesselman

Haven't checked the logic, but I'd note that by using literal result
elements you can simplify that slightly. Also, I'd leave out the
encoding and let it stay in UTF8 unless you have some specific reason
for doing otherwise. Finally, you might want to explicitly process only
the <faxes> elements, and make sure anything else doesn't contribute, by
adding a root template which selects only those for processing.

But, yeah, whoever created that XML in the first place should be ashamed
of themselves for not structuring it better.


<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/">
<xsl:apply-templates select="/content/faxes"/>
</xsl:template>

<xsl:template match="/content/faxes">
<database>
<faxes>
<xsl:for-each select="faxid">
<xsl:variable name="i" select="position()"/>
<row>
<faxid>
<xsl:value-of select="."/>
</faxid>
<date>
<xsl:value-of select="../date[$i]"/>
</date>
</row>
</xsl:for-each>
</faxes>
</database>
</xsl:template>

</xsl:stylesheet>


--
Joe Kesselman,
http://www.love-song-productions.com/people/keshlam/index.html

{} ASCII Ribbon Campaign | "may'ron DaroQbe'chugh vaj bIrIQbej" --
/\ Stamp out HTML mail! | "Put down the squeezebox & nobody gets hurt."
 
S

Simon Wright

Joe Kesselman said:
Haven't checked the logic, but I'd note that by using literal result
elements you can simplify that slightly. Also, I'd leave out the
encoding and let it stay in UTF8 unless you have some specific reason
for doing otherwise. Finally, you might want to explicitly process
only the <faxes> elements, and make sure anything else doesn't
contribute, by adding a root template which selects only those for
processing.

All good points; the encoding change was a copy-and-paste thing.

OP wanted the content/status to be copied too, so I tried

....
<xsl:template match="/">
<database>
<xsl:apply-templates select="/content/status"/>
<xsl:apply-templates select="/content/faxes"/>
</database>
</xsl:template>

<xsl:template match="/content/status">
<status><xsl:value-of select="."/></status>
</xsl:template>

<xsl:template match="/content/faxes">
<faxes>
....
 
A

Alain Ketterlin

Glen Millard said:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/css" href="basic.xsl"?>
<content>
<status>ok</status>
<faxes>
<faxid>21404974</faxid>
<date>2012-01-05 07:34:10</date>
<source>5194852368</source>
<destination>8885216725</destination>
<status>read</status>
<faxid>23223059</faxid>
<date>2012-03-01 07:27:52</date>
<source>5194211862</source>
<destination>8885216725</destination>
<status>new</status> [...]
</faxes></content>
[to]

<database>
<status>ok</status>
<faxes>
<row>
<faxid>21404974</faxid>
<date>2012-01-05 07:34:10</date>
<source>5194852368</source>
<destination>8885216725</destination>
<status>read</status>
</row>
<row>
<faxid>23223059</faxid>
<date>2012-03-01 07:27:52</date>
<source>5194211862</source>
<destination>8885216725</destination>
<status>new</status>
</row> [...]
</faxes>
</database>

XPath also has the nice concept of "axis", which lets you traverse the
tree along various, well, axes. In your case, it means that whenever
you've found a <faxid>, you can find the following <date> etc. by
moving along the "following-sibling" axis. In your case, you can write:

<xsl:template match="faxid">
<xsl-variable name="date" select="./following-sibling::date[1]"/>
<xsl-variable name="source" select="./following-sibling::source[1]"/>

<row>
<faxid><xsl-value-of select="."/><faxid>
<date><xsl:value-of select="$date"/></date>
<source><xsl:value-of select="$source"/></source>
...
</row>
</xsl:template>

The stylesheet then only needs to apply this templates to all <faxid>
elements.

As others have said, the format of your input document should be
changed. All solutions in this thread would stop working if some <faxid>
"qualifiers" (like <date>) become optional.

-- Alain.
 
G

Glen Millard

Sheldon;

That is a HUGE help!

Now that I see how the syntax works, I think I can take it from there. I don't know if the status tag is relevant, but I will check with my client.

I think I get the idea - thank you much again.

Glen

Glen Millard said:
Okay, I have an XML file that I get from a provider:
I want to be able to import/parse this into a MySQL database. However,
I need to reformat it so that it is 'database-centric'. I was going to
use a simple script to use find/replace with a sed and regular
expressions - which works.
So, can someone get me started? I think that the root element being
<database></database> would be a start.

This should be a start - it doesn't do the <status> element (so your
first 'ok' comes out in the wrong place), and I've only done a couple of
the elements. Also, it relies on your provider (who should really know
better and put in the <row> (or <fax>) elements in the first place!) not
to omit any elements; if one of the <source> elements was missing, all
the <sources>s would be one row out from then on.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="xml" encoding="iso-8859-1" indent="yes"/>
<xsl:strip-space elements="*"/>

<xsl:template match="/content/faxes">
<xsl:element name="database">
<xsl:element name="faxes">
<xsl:for-each select="faxid">
<xsl:variable name="i" select="position()"/>
<xsl:element name="row">
<xsl:element name="faxid">
<xsl:value-of select="."/>
</xsl:element>
<xsl:element name="date">
<xsl:value-of select="../date[$i]"/>
</xsl:element>
</xsl:element>
</xsl:for-each>
</xsl:element>
</xsl:element>
</xsl:template>

</xsl:stylesheet>
 
G

Glen Millard

Okay, now for my next dumb question - ha.

When I get the XML document delivered, this is EXACTLY what it looks like.

<?xml version="1.0" encoding="UTF-8"?>
<content>
<status>ok</status><faxes><faxid>21404974</faxid><date>2012-01-05 07:34:10</date><source>5194852368</source><destination>8885216725</destination><status>read</status><faxid>23223059</faxid><date>2012-03-01 07:27:52</date><source>5194211862</source><destination>8885216725</destination><status>new</status><faxid>23223164</faxid><date>2012-03-01 07:29:45</date><source>5194211862</source><destination>8885210692</destination><status>new</status><faxid>23224287</faxid><date>2012-03-01 07:51:07</date><source>8885216725</source><destination>8885210692</destination><status>new</status></faxes></content>

Now, again, I am not asking for anyone to do my work for me, but what wouldbe the best way (besides using an editor or fancy shell script).

Looking at the provider's API calls, there does not seem to be any way to format it.

So, I pose this question also - what would be the best way to format this XML document so that it is something workable?

Thanks again, everyone.

Glen
 
S

Simon Wright

Glen said:
Okay, now for my next dumb question - ha.

When I get the XML document delivered, this is EXACTLY what it looks
like.

<?xml version="1.0" encoding="UTF-8"?>
<content>
<status>ok</status><faxes><faxid>21404974</faxid><date>2012-01-05 07:34:10</date><source>5194852368</source><destination>8885216725</destination><status>read</status><faxid>23223059</faxid><date>2012-03-01 07:27:52</date><source>5194211862</source><destination>8885216725</destination><status>new</status><faxid>23223164</faxid><date>2012-03-01 07:29:45</date><source>5194211862</source><destination>8885210692</destination><status>new</status><faxid>23224287</faxid><date>2012-03-01 07:51:07</date><source>8885216725</source><destination>8885210692</destination><status>new</status></faxes></content>

Now, again, I am not asking for anyone to do my work for me, but what
would be the best way (besides using an editor or fancy shell script).

Looking at the provider's API calls, there does not seem to be any way
to format it.

So, I pose this question also - what would be the best way to format
this XML document so that it is something workable?

The XSLT processor shouldn't care about line breaks or other white space
in the input; the collapsed input above produces the same output as the
version in your original post.
 
M

Martin Honnen

Glen said:
So, I pose this question also - what would be the best way to format this XML document so that it is something workable?

Well if you load such a document into a browser like IE or Firefox or
Opera or Chrome then they show it in a formatted way (they show the
document tree where you can collapse or expand levels).

And XML editors or plugins for editors usually have an option to indent
a document.

If you want to do it yourself with XSLT then running the document
through a stylesheet doing

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="xml" indent="yes"/>

<xsl:template match="/">
<xsl:copy-of select="."/>
</xsl:template>

</xsl:stylesheet>

is also a way.
 
G

Glen Millard

Hey all - I discovered an easier way around this. A utility called tidy - it is even in the Linux distro.

It takes my XML file and formats it quite nicely upon download.

This is one of the things I was looking for.

Thanks again

Glen
 
P

Peter Flynn

Haven't checked the logic, but I'd note that by using literal result
elements you can simplify that slightly. Also, I'd leave out the
encoding and let it stay in UTF8 unless you have some specific reason
for doing otherwise. Finally, you might want to explicitly process only
the <faxes> elements, and make sure anything else doesn't contribute, by
adding a root template which selects only those for processing.

But, yeah, whoever created that XML in the first place should be ashamed
of themselves for not structuring it better.

Amen. Unfortunately many people don't think before creating XML.

[OP]
You are not the clueless one: that is your client, unfortunately.

When I get a large quantity (or many repeat instances) of files like
this, where the markup itself is regular enough to be SGML, but the
design defective, I resort to invoking omissibility. This SGML document:

<!doctype content [
<!element content - - (status,faxes)>
<!element status - - (#pcdata)>
<!element faxes - - (fax)+>
<!element fax o o (faxid,date,source,destination,status)>
<!element faxid - - (#pcdata)>
<!element date - - (#pcdata)>
<!element source - - (#pcdata)>
<!element destination - - (#pcdata)>
]>
<content>
<status>ok</status>
<faxes>
<faxid>21404974</faxid>
<date>2012-01-05 07:34:10</date>
<source>5194852368</source>
<destination>8885216725</destination>
<status>read</status>
<faxid>23223059</faxid>
<date>2012-03-01 07:27:52</date>
<source>5194211862</source>
<destination>8885216725</destination>
<status>new</status>
<faxid>23223164</faxid>
<date>2012-03-01 07:29:45</date>
<source>5194211862</source>
<destination>8885210692</destination>
<status>new</status>
<faxid>23224287</faxid>
<date>2012-03-01 07:51:07</date>
<source>8885216725</source>
<destination>8885210692</destination>
<status>new</status>
</faxes></content>

can be processed with sgmlnorm to normalize it so that the missing <fax>
and </fax> elements are inserted. Usually this is only worth doing for a
workflow, so that it will create fully-normalized SGML which an XML
processor will accept.

That's fine. XML doesn't need to be pretty-printed unless you want to
show it to a human. As Martin indicated, there are ways to pretty-print
it if you need (and from a later post you discovered Tidy). But it's
usually more effective to concentrate on making the markup processable
rather than on making it look attractive.

///Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top