XSL for removing words less than 4 letters in a sitemap

O

Olagato

I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
< <Paths, extreme, player</
</ </url>
<url>
<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-
wall</loc>
< <edge, wall</ </ </url>
</urlset>

I mean, I need a template for creating a <tag which
contents all the words from <loc> tag with words of more than 3
letters.
 
M

Martin Honnen

Olagato said:
I need to transform this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-
wall</loc>
</url>
</urlset>

into this:

<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<<Paths, extreme, player</
</</url>
<url>
<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-
wall</loc>
<<edge, wall</</</url>
</urlset>

I mean, I need a template for creating a <tag which
contents all the words from <loc> tag with words of more than 3
letters.

Do you want to use XSLT 2.0 or 1.0?
What about words like 'localhost' or 'index', how do you decide that
those are not taken?

Here is an XSLT 2.0 stylesheet that should show you an approach using
the tokenize method:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://example.com/2008/news"
xmlns:sm="http://www.google.com/schemas/sitemap/0.84"
exclude-result-prefixes="sm"
version="2.0">

<xsl:eek:utput method="xml" indent="yes"/>

<xsl:strip-space elements="*"/>

<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
< < <xsl:value-of
select="for $s in tokenize(sm:loc, '/')[position() &gt; 5]
return tokenize($s, '[\-/]')[string-length(.) &gt; 3]"
separator=", "/>
</ </ </xsl:copy>
</xsl:template>

</xsl:stylesheet>

Result with Saxon 9 when run against your posted input sample (with a
'root' element added and a namespace choosen for the 'news' prefix) is

<root>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>

<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</loc>
<xmlns:news="http://example.com/2008/news">
<extreme, player</ </ </url>
<url>

<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-wall</loc>
<xmlns:news="http://example.com/2008/news">
<edge, wall</ </ </url>
</urlset>
</root>
 
O

Olagato

Olagato said:
I need to transform this:
into this:
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index./Books/Paths-for-the-
extreme-player</loc>
<<Paths, extreme, player</
</</url>
<url>
<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-
wall</loc>
<<edge, wall</</</url>
</urlset>
I mean, I need a template for creating a <tag which
contents all the words from <loc> tag with words of more than 3
letters.

Do you want to use XSLT 2.0 or 1.0?
What about words like 'localhost' or 'index', how do you decide that
those are not taken?

Here is an XSLT 2.0 stylesheet that should show you an approach using
the tokenize method:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://example.com/2008/news"
xmlns:sm="http://www.google.com/schemas/sitemap/0.84"
exclude-result-prefixes="sm"
version="2.0">

<xsl:eek:utput method="xml" indent="yes"/>

<xsl:strip-space elements="*"/>

<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
<<<xsl:value-of
select="for $s in tokenize(sm:loc, '/')[position() &gt; 5]
return tokenize($s, '[\-/]')[string-length(.) &gt; 3]"
separator=", "/>
</</</xsl:copy>
</xsl:template>

</xsl:stylesheet>

Result with Saxon 9 when run against your posted input sample (with a
'root' element added and a namespace choosen for the 'news' prefix) is

<root>
<urlset xmlns="http://www.google.com/schemas/sitemap/0.84">
<url>

<loc>http://localhost/index.php/index./Paths-for-the-extreme-player</loc>
<xmlns:news="http://example.com/2008/news">
<extreme, player</</</url>
<url>

<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-wall</loc>
<xmlns:news="http://example.com/2008/news">
<edge, wall</</</url>
</urlset>
</root>


Thank you for your help, Martin.
Do you want to use XSLT 2.0 or 1.0?
I'm using XSLT 1.0
What about words like 'localhost' or 'index', how do you decide that those are not taken?
It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')

I'm trying your XSL from PHP without success:
<?php
header('Content-Type: application/xhtml+xml; charset=utf-8');
$xml = new DOMDocument;
$xml->load('original_news.xml');

$xsl = new DOMDocument('1.0','UTF-8');
$xsl->load('news_to_google_markup.xsl');

try{
$proc = new XSLTProcessor();
$proc->importStylesheet($xsl);
$newXml = $proc->transformToXML($xml);
echo $newXml;
}catch(Exception $pEx){
return $pEx->getMessage();
}
?>

1- original_news.xml is:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-transform type="text/xsl" href="news_to_google_markup.xsl"?>
<urlset xmlns:sm="http://www.google.com/schemas/sitemap/0.84">
<url>
<loc>http://localhost/index.php/index/Paths-for-the-extreme-player</
loc>
</url>
<url>
<loc>http://localhost/index.php/index.php/Games/The-edge-of-the-
wall</loc>
</url>
</urlset>

2- and your XSL that I've renamed as "news_to_google_markup.xsl" is:
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://example.com/2008/news"
xmlns:sm="http://www.google.com/schemas/sitemap/0.84"
exclude-result-prefixes="sm"
version="2.0">

<xsl:eek:utput method="xml" indent="yes"/>

<xsl:strip-space elements="*"/>

<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>

<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
< < <xsl:value-of
select="for $s in tokenize(sm:loc, '/')[position() &gt;
5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]"
separator=", "/>
</ </ </xsl:copy>
</xsl:template>
</xsl:stylesheet>

3- Error reported in PHP is:
<b>Warning</b>: XSLTProcessor::importStylesheet() [<a
href='function.XSLTProcessor-importStylesheet'>function.XSLTProcessor-
importStylesheet</a>]: Invalid expression in <b>C:\Webs\...\htdocs
\sitemap\index.php</b> on line <b>18</b><br />

4- line 18 is:
$proc->importStylesheet($xsl);

Maybe an invalid XSL version or namespace on header but I dont't know
how to resolve this.
Any idea will be appreciated.
 
M

Martin Honnen

Olagato said:
I'm using XSLT 1.0

It's not a problem now. Maybe a sentence like next:
translate( translate( substring-after( sm:loc, 'http://localhost/
index.php/index.php/') ,'-', ',') ,'/',',')

I'm trying your XSL from PHP without success:

PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.
 
O

Olagato

PHP only supports XSLT 1.0 so my posted stylesheet using XSLT and XPath
2.0 functionality does not work with PHP's XSLT processor.

Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.
 
M

Martin Honnen

Olagato said:
Your posted version in 1.0 functionality seems to be quite difficult
to implement because of lack of advanced functions (at least for a xsl
newbie like me) So my only alternative would be to use a XSLT
processor. I'll try Xalan on server: http://xalan.apache.org/
Any other idea using XSLT 1.0 will be appreciated.

Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to use http://www.exslt.org/str/functions/tokenize/index.html
 
O

Olagato

Xalan does not do XSLT 2.0 so if you want to use XSLT 2.0 then try Saxon
(http://saxon.sourceforge.net/) or Gestalt
(http://gestalt.sourceforge.net/) or AltovaXML
(http://www.altova.com/altovaxml.html).

If you want to use PHP then I think PHP supports EXSLT so you could try
to usehttp://www.exslt.org/str/functions/tokenize/index.html

Thank you very much, Martin
It's now working fine with Altova XML Spy and Saxon9 as external XSLT
parser:
http://216.239.59.104/search?q=cach...spy&hl=es&ct=clnk&cd=6&gl=es&client=firefox-a

There are only 2 little issues left:

My XML input is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php/ezwebin_site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://localhost/index.php/index.php/ezwebin_site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
</url>
</urlset>

Your XSLT 2.0 is:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9" exclude-result-
prefixes="sm" version="2.0">
<xsl:eek:utput method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
< < <xsl:value-of select="sm:lastmod"/>
</ < <xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</ </ </xsl:copy>
</xsl:template>
</xsl:stylesheet>

The output is:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://localhost/index.php/index.php/ezwebin_site/Rutas-de-
verano-en-España</loc>
<lastmod>2008-03-13</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
<xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
< <Rutas, verano, España</ </ </url>
<url>
<loc>http://localhost/index.php/index.php/ezwebin_site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<lastmod>2008-02-12</lastmod>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
<xmlns:news="http://www.google.com/schemas/sitemap-news/
0.9">
< <Rutas, Camino, Santiago, rt</
</ </url>
</urlset>

But I need an output like defined by News Sitemap Protocol:
http://www.google.com/support/webmasters/bin/answer.py?answer=42738

So there are 2 things left:
1- <lastmod> tags should dissapear from <url> outputs because a
<tag has been defined already.
2- xmlns:news namespace should dissapear from <tags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"> tag in the header.

A good output file would be:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9">
<url>
<loc>http://localhost/index.php/index.php/ezwebin_site/Rutas-de-
verano-en-España</loc>
<changefreq>daily</changefreq>
<priority>0.8</priority>
< < <Rutas, verano, España</ </ </url>
<url>
<loc>http://localhost/index.php/index.php/ezwebin_site/Rutas/El-
Camino-de-Santiago-en-el-Sobrarbe</loc>
<changefreq>weekly</changefreq>
<priority>0.7</priority>
< < <Rutas, Camino, Santiago, rt</
</ </url>
</urlset>

Any idea ?
 
M

Martin Honnen

Olagato said:
So there are 2 things left:
1- <lastmod> tags should dissapear from <url> outputs because a
<tag has been defined already.
2- xmlns:news namespace should dissapear from <tags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"> tag in the header.

Both are easy adaptions, you need to use a predicate
[not(self::sm:lastmod)] and you can use xsl:namespace to make sure a
namespace declaration is created on the root element:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
exclude-result-prefixes="sm"
version="2.0">
<xsl:eek:utput method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:urlset">
<xsl:copy>
<xsl:namespace name="news"
select="'http://www.google.com/schemas/sitemap-news/0.9'"/>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()[not(self::sm:lastmod)]"/>
< < <xsl:value-of select="sm:lastmod"/>
</ < <xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</ </ </xsl:copy>
</xsl:template>
</xsl:stylesheet>
 
O

Olagato

Olagato said:
So there are 2 things left:
1- <lastmod> tags should dissapear from <url> outputs because a
<tag has been defined already.
2- xmlns:news namespace should dissapear from <tags and it
should be taken to the <urlset xmlns="http://www.sitemaps.org/schemas/
sitemap/0.9"> tag in the header.

Both are easy adaptions, you need to use a predicate
[not(self::sm:lastmod)] and you can use xsl:namespace to make sure a
namespace declaration is created on the root element:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:news="http://www.google.com/schemas/sitemap-news/0.9"
xmlns:sm="http://www.sitemaps.org/schemas/sitemap/0.9"
exclude-result-prefixes="sm"
version="2.0">
<xsl:eek:utput method="xml" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@* | node()">
<xsl:copy>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:urlset">
<xsl:copy>
<xsl:namespace name="news"
select="'http://www.google.com/schemas/sitemap-news/0.9'"/>
<xsl:apply-templates select="@* | node()"/>
</xsl:copy>
</xsl:template>
<xsl:template match="sm:url">
<xsl:copy>
<xsl:apply-templates select="@* | node()[not(self::sm:lastmod)]"/>
<<<xsl:value-of select="sm:lastmod"/>
</<<xsl:value-of select="for $s in tokenize(sm:loc, '/')[position()
&gt; 5]
return tokenize($s, '[\-/]')[string-length(.)
&gt; 3]" separator=", "/>
</</</xsl:copy>
</xsl:template>
</xsl:stylesheet>


Solved !!! It works fine now.
I have modified line:
for $s in tokenize(sm:loc, '/')[position()&gt; 5]
to this one:
for $s in tokenize(sm:loc, '/')[position()&gt; 6]
in order to delete "ezwebin_site" word within <tag.
Thank you very much Martin, people like you make the "world wide web"
to be a better place.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top