LibXML UTF8 - Input is not proper UTF-8, indicate encoding !


V

Vlajko Knezic

Not so sure what is going on here but is something to do with the way UTF8
is handled in Perl and/or LibXML



The sctript below:

- accepts a value from a form text field;

- builds XML document around it,

- deparses the document to the string using toString(),

- parses the string into the XML document using parse_string()

- transforms XML document into HTML document using XSL
transformation



Everything works well until UTF8 character is entered in the text field (for
example é) . In that case when trying to run parse_string() code crashes
with the message:

=====================================================================

:2: parser error : Input is not proper UTF-8, indicate encoding
!<test><test_text>abcé</test_text></test> ^:2: error:
Bytes: 0xE9 0x3C 0x2F 0x74<test><test_text>abcé</test_text></test>
^ at C:/_work/vsurvey/site/test1.cgi line
24=====================================================================



I know that the code below does not make much sense but this is an
abstraction of the much more complex code. Environment is Perl 5.8; Apache;
Windows XP.



Hints and/or explanation what was coded wrong and how should it be fixed are
very much appreciated.



Vlajko Knezic,

Toronto, Ontario



---------------------------------------------------------------------------------------------------------------------

test.cgi



#! c:/Perl/bin/Perl.exe



use CGI;

use XML::LibXML;

use XML::LibXSLT;

use CGI::Carp qw( fatalsToBrowser );

use Encode;



my $mDocument = XML::LibXML::Document-> new();

my $parser = XML::LibXML->new();



$mDocument->setEncoding("UTF8");

my $mCGI = new CGI;

print $mCGI->header;

my $mTest_text = $mCGI->param('test');;



my $mTest = $mDocument-> createElement("test");

my $mTestText = $mDocument-> createElement("test_text");

$mTestText->appendTextNode($mTest_text);

$mTest->appendChild($mTestText);

$mDocument->setDocumentElement( $mTest );

$mDocument->setEncoding("UTF8");

my $mTestXML = $mDocument->toString();

my $mParsedTestXML = $parser->parse_string($mTestXML);



my $mParsedXMLXSL = $parser->parse_file('test.xsl');

my $mParserXSL = XML::LibXSLT->new();

my $mParsedXSL = $mParserXSL->parse_stylesheet($mParsedXMLXSL);

my $mPageHTML = $mParsedXSL->transform($mParsedTestXML);

my $mPrintPageHTML = $mParsedXSL->output_string($mPageHTML);

print $mPrintPageHTML;



test.xsl



<?xml version="1.0"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

<xsl:eek:utput method="html" encoding="UTF-8" indent="yes"
omit-xml-declaration="yes"/>

<xsl:template match="//test">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>

</head>

<html>

<body>

<xsl:value-of select="test_text"/>

<form name="test" type="post" target="_self">

<input type="text" name="test" /><input type="submit" name="button"/>

</form>

</body>

</html>

</xsl:template>

</xsl:stylesheet>
 
Ad

Advertisements

B

Brian McCauley

Vlajko said:
Not so sure what is going on here but is something to do with the way UTF8
is handled in Perl and/or LibXML

I've seen something similar - I fixed it by performing an explicit
utf8::upgrade() on every string before passing it to any of the LibXML
library methods.

If that doesn't help then it looks to me like this could be an issue
with CGI.pm and/or your web browser.
Everything works well until UTF8 character is entered in the text field (for
example é) .

I suspect this is not what is happening. It appears that the browser is
sending the form sumbission data using some other encoding (e.g. Latin1(
and Perl's CGI.pm is assuming it's UTF8 thus generating an invalid utf8
string.
<input type="text" name="test" />

Since your text field does not specify an encoding the browser is free
to choose any it likes but the recommendation is that it should choose
the one used to encode the document containing the form. I notice your
document contains:
<meta http-equiv="Content-Type" content="text/html;
charset=utf-8"/>

I would not expect Content-Type to be setable via <meta> but then again
I may be wrong.

For more informed discussion about this I suggest you
go to a group where discussion of how browsers handle HTML forms is
on-topic.
 
Ad

Advertisements

W

Wes Groleau

Brian said:
document contains:


I would not expect Content-Type to be setable via <meta> but then again
I may be wrong.

That tag works on all my web pages, on numerous Mac and windoze browsers.

--
Wes Groleau

Truth often suffers more from the heat of its defenders
than from the arguments of its opposers.
-- William Penn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top