Issue with XML::XSLT and html encoded characters (á etc)

R

Rwag

Hello,

I am migrating a perl production system from Solaris to Linux. My
knowlege of Perl and processing of XML is brief.

The issue we have is that we need to process data files. In those
files are text such as "Chávez", where the acute-a is html
encoded.

After processing with a stylesheet and XML::XSLT the text "Chávez"
becomes "Ch㡶z".

Is there a way to fix this with XML::XSLT? We would like to have the
result stay as "Chávez" or become "Chávez".


This is the stylesheet.
===========================================
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/REC-html40">

<xsl:eek:utput method="html" encoding="UTF-16"
cdata-section-elements="parag"/>

<xsl:template match="/">
<xsl:apply-templates select="//body" />
<xsl:apply-templates select="//infobox" />
</xsl:template>

<xsl:template match="body">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="divvy"><P><SPAN
CLASS="divvy"><xsl:apply-templates/></SPAN></P></xsl:template>

<xsl:template
match="infobox"><P><xsl:apply-templates/></P></xsl:template>

</xsl:stylesheet>
===========================================


Our test script is as follows.
===========================================
use XML::XSLT;
($xmlfile, $xslfile) = @ARGV;
my $parser = XML::XSLT->new ($xslfile, warnings => 1);
$parser->transform($xmlfile);
print $parser->toString;
$parser->dispose ();
===========================================

Any assistance is very much appreciated.
 
R

robic0

Hello,

I am migrating a perl production system from Solaris to Linux. My
knowlege of Perl and processing of XML is brief.

The issue we have is that we need to process data files. In those
files are text such as "Chávez", where the acute-a is html
encoded.

After processing with a stylesheet and XML::XSLT the text "Chávez"
becomes "Ch㡶z".

Is there a way to fix this with XML::XSLT? We would like to have the
result stay as "Chávez" or become "Ch&aacute;vez". ^^^^^
$RxENTITY = qr/\s+($Name)|(?:%\s+($Name))\s+(.*?)/s;
# 1 1 ( 2 2) 3 3
$Entities = "(?:amp)|(?:gt)|(?:lt)|(?:apos)|(?:quot)|(?:#(?:([0-9]+)|(x[0-9a-fA-F]+)))"; # cat more entities
# ( #( 4 4|5 5))
$RxEntConv = qr/(.*?)(&|%)($Entities);/s;
#

Are you saying it can't be parsed? Writing seems to be your problem.
 
R

robic0

Hello,

I am migrating a perl production system from Solaris to Linux. My
knowlege of Perl and processing of XML is brief.

The issue we have is that we need to process data files. In those
files are text such as "Chávez", where the acute-a is html
encoded.

After processing with a stylesheet and XML::XSLT the text "Chávez"
becomes "Ch㡶z".

Is there a way to fix this with XML::XSLT? We would like to have the
result stay as "Chávez" or become "Ch&aacute;vez". ^^^^^
$RxENTITY = qr/\s+($Name)|(?:%\s+($Name))\s+(.*?)/s;
# 1 1 ( 2 2) 3 3
$Entities = "(?:amp)|(?:gt)|(?:lt)|(?:apos)|(?:quot)|(?:#(?:([0-9]+)|(x[0-9a-fA-F]+)))"; # cat more entities
# ( #( 4 4|5 5))
$RxEntConv = qr/(.*?)(&|%)($Entities);/s;
#

Are you saying it can't be parsed? Writing seems to be your problem.
In the parser I wrote, the modified equivalent is passed to the callback,
but the unmodified original string can be obtained from an object method as well.
This is incase you want to do reparse, or just write out new raw xhtml.
 
R

robic0

^^^^^^^
You can always declare '&aacute' in an !ENTITY statement, then the substitution
will happen. They must be declared at the top of the file below the DTD and before the body.
Your problem is your writer. Like I said I haven't done xslt yet.
In your writer you must use the api's to declare entities as well.

Use these links and search around for xslt standards and appropriate software.
The kind of question you ask will not be answered on this forum.
If the problem gets too big, you can email me for consultation.
I think my email is here somewhere but if you can't find it its (e-mail address removed).
Make sure it is not spam or viri since filters are on.

http://www.w3.org/TR/xml11
http://www.w3schools.com/tags/tag_meta.asp
http://www.w3.org/TR/html4/strict.dtd
 
M

Matt Garrish

Rwag said:
Hello,

I am migrating a perl production system from Solaris to Linux. My
knowlege of Perl and processing of XML is brief.

The issue we have is that we need to process data files. In those
files are text such as "Chávez", where the acute-a is html
encoded.

After processing with a stylesheet and XML::XSLT the text "Chávez"
becomes "Ch㡶z".

Is there a way to fix this with XML::XSLT? We would like to have the
result stay as "Chávez" or become "Ch&aacute;vez".


This is the stylesheet.
===========================================
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns="http://www.w3.org/TR/REC-html40">

<xsl:eek:utput method="html" encoding="UTF-16"
cdata-section-elements="parag"/>

<xsl:template match="/">
<xsl:apply-templates select="//body" />
<xsl:apply-templates select="//infobox" />
</xsl:template>

<xsl:template match="body">
<xsl:apply-templates/>
</xsl:template>

<xsl:template match="divvy"><P><SPAN
CLASS="divvy"><xsl:apply-templates/></SPAN></P></xsl:template>

<xsl:template
match="infobox"><P><xsl:apply-templates/></P></xsl:template>

</xsl:stylesheet>
===========================================


Our test script is as follows.
===========================================
use XML::XSLT;
($xmlfile, $xslfile) = @ARGV;
my $parser = XML::XSLT->new ($xslfile, warnings => 1);
$parser->transform($xmlfile);
print $parser->toString;
$parser->dispose ();
===========================================

It's great that you gave enough info to run the transformation (and I mean
that in all honesty), but nowhere in the above do you show how you're
re-encoding the output of the transform. When I run your code on a file with
"á" in it, I get an acute accented a. And just for fun I ran the string
through the HTML::Entities encode_entities function and got back &aacute; so
I'm not sure how you're getting your output.

Matt
 
R

robic0

It's great that you gave enough info to run the transformation (and I mean
that in all honesty), but nowhere in the above do you show how you're
re-encoding the output of the transform. When I run your code on a file with
"á" in it, I get an acute accented a. And just for fun I ran the string
through the HTML::Entities encode_entities function and got back &aacute; so
I'm not sure how you're getting your output.

Matt
Ahh, for your elegence I would kill for. You are the 'man'...
 
B

Bart Van der Donck

Rwag said:
The issue we have is that we need to process data files. In those
files are text such as "Chávez", where the acute-a is html
encoded.

After processing with a stylesheet and XML::XSLT the text "Chávez"
becomes "Ch㡶z".

Is there a way to fix this with XML::XSLT? We would like to have the
result stay as "Chávez" or become "Ch&aacute;vez".


This is the stylesheet.
[...]

Our test script is as follows.
===========================================
use XML::XSLT;
($xmlfile, $xslfile) = @ARGV;
my $parser = XML::XSLT->new ($xslfile, warnings => 1);
$parser->transform($xmlfile);
print $parser->toString;
$parser->dispose ();

A route from

Chávez

to

Ch㡶z

is beyond my understanding. The &#....; notation is only possible to
define a single character. Are you sure the missing 've' in the second
string is not a typo ?

One possible workaround is to prepare your XML file before processing.
That way, you're not passing the numeric code points (á) but plain
characters (á) or HTML entities (&aacute;). XML::XSLT might behave
nicer on those.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;

my $orig = 'Chávez';
my $chars = decode_entities($orig);
my $htmlent = encode_entities($chars);

print "Orig string: $orig\n";
print "Decoded to chars: $chars\n";
print "Encoded to html entities: $htmlent";

Hope this helps,
 
R

Rwag

Hi Bart,

The output I sent was from some tests we had run.

There are two options from what I have seen. The first is declaring
named entities and using those for the encoding, the &aacute; for
example. This is a definite option for us.

The one we are going with right now is to modify the xsl. For
following is an example.

====
<xsl:template match="parag"><P><xsl:value-of select="."
disable-output-escaping="yes"/></P></xsl:template>
====

Declaring that there is no escapement of characters probably helped.
We also HAD to make sure our output is enclosed in HTML <P> tags.
Otherwise we got the same escape issues as before wen trying to output
html.

I hope this is of use to anyone trying XML::XSLT parsing.
 
R

robic0

Hi Bart,

The output I sent was from some tests we had run.

There are two options from what I have seen. The first is declaring
named entities and using those for the encoding, the &aacute; for

I don't see anything special here. Declairing !ENTITY is a valid declaration.
example. This is a definite option for us.

The one we are going with right now is to modify the xsl. For
following is an example.

====
<xsl:template match="parag"><P><xsl:value-of select="."
disable-output-escaping="yes"/></P></xsl:template>
====
Nothing special here,

xsl:template
P
xsl:value-of select="."
disable-output-escaping="yes"/
/P
/xsl:template

Whats your point?
VandderDick doesen't know anything I don't know.
But you don't respond to me. I don't really care.
It looks though that you are in desperate need of help!
I don't know all of it yet, but I know %95 of it.
And I've developed a parse that will read parse your source,
posted it here, RXParse. I'm writing a followup that will write mods
while parsing input.
"xsl:any" is a simple mod. XSLT is actually extremely simple.
There are many "hard" things to do in xml both in the reading and writing.
Don't think for a minute that one is mutually exclusive to the other.
I'm in a position to write inclusively many extrapolations.
If you had noticed, I posted RXParse on this site. That represents
a high performance Perl only control of xml, et all, that does not exist.
The master is typing for many times now. Yet you do not address him!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top