UTF-8 problem

T

Todor Vachkov

Hello all,

I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
Thus I got this error message:
Entity: line 315442: parser error : Input is not proper UTF-8, indicate
encoding !
Bytes: 0xE2 0x26 0x6C 0x74

I thought the solution would be:
open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
my $parser = XML::LibXML->new();
my $dom = $parser->parse_fh($fh);
my $root = $dom->getDocumentElement;

but this produce a long long list (maybe for each parsed character in the xml file) of error messages :
utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429. ..
..
..
utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
Segmentation fault

The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
I just want to let the modul to parse the xml file, which is really large (over 20MB)
and has being exported from another software. Thus I haven't any influence what comes into it.

I hope you can help me! Thanks in advance!

Greetings Todor
 
A

A. Sinan Unur

Hello all,

I'm trying to convert an exported xml file into a perl data structre
with the XML::LibXML modul. Thus I got this error message:


I thought the solution would be:

The file contents are not UTF-8. Specify the real encoding.

Sinan

PS: Avoid RxParse
 
T

Ted Zlatanov

TV> Hello all,
TV> I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
TV> Thus I got this error message:

TV> I thought the solution would be:

TV> but this produce a long long list (maybe for each parsed character in the xml file) of error messages :
TV> .
TV> .
TV> .
TV> The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
TV> I just want to let the modul to parse the xml file, which is really large (over 20MB)
TV> and has being exported from another software. Thus I haven't any influence what comes into it.

Can you post with the first 50 lines of the file, or put up a smaller
complete version of it online somewhere we can examine it? Your post
doesn't help at all with finding the problem (we can only guess that
your input file is not valid).

Ted
 
T

Todor Vachkov

Thanks for your replies!

The xml file is really huge - it has 666.025 lines and it is result of an export from a software.

It contents:
- the meta description of the software itself (i am pretty sure that it is conform to UTF-8)
- form inputs made by users. Thus, they fill out the software with information about several
databases.The goal is to have a distributed search engine. (again, I assume that the software
also saves the inputs in UTF-8)
- perl scripts for each database, which are written by various programmers. The scripts are
the interfaces between the databases and the software (the UTF-8 encoding of the scripts is not guaranteed)
All this stuff is contained by the huge XML file.

Parsing the file with XML::LibXML gives:
Entity: line 315442: parser error : Input is not proper UTF-8, indicate                                
encoding !
Bytes: 0xE2 0x26 0x6C 0x74

I've figured out that this are the characters :

* U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
â (Â)

* U+0026 AMPERSAND
&

* U+006C LATIN SMALL LETTER L
l (L)

* U+0074 LATIN SMALL LETTER T
t (T)

Line 315442 looks:
<line>&lt;refpt id=&quot;bafn1&quot;/&gt;&lt;lk refid=&quot;afn1&quot;&gt;&lt;sup&gt;â&lt;/sup&gt;&lt;/lk&gt;</line>
^

The element <line></line> contains a single line from a perl script as mentioned above. The character 0xE2 was the point,
where the parser stopped, at line 315442, it went far enough, almost to the half.

It seems that the perl scripts within are my problem. I'am wondering why this single character is being treated from parser
as a non utf-8 code point? Could I tell the parser somehow to ignore this?

Thanks for your help!

Greetings, Todor
 
M

Martijn Lievaart

Parsing the file with XML::LibXML gives:


I've figured out that this are the characters :

* U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
â (Â)

U+00E2 is Unicode. In utf-8 encoding this would be a two character
sequence. So your input is not proper utf-8.

HTH,
M4
 
T

Todor Vachkov

Martijn said:
U+00E2 is Unicode. In utf-8 encoding this would be a two character
sequence. So your input is not proper utf-8.

Thanks for your posting!

The parser says:
Bytes: 0xE2 0x26 0x6C 0x74
So 0xE2 is meant to be the problematic character.

U+00E2 was not in the error message, I've just pasted the output of my check on linux with:
user@timemashine:~$ unicode 0xe2
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
UTF-8: c3 a2 UTF-16BE: 00e2 Decimal: â
â (Â)
Uppercase: U+00C2
Category: Ll (Letter, Lowercase)
Bidi: L (Left-to-Right)
Decomposition: 0061 0302

Greetings Todor
 
M

Martijn Lievaart

Thanks for your posting!

The parser says:
So 0xE2 is meant to be the problematic character.

U+00E2 was not in the error message, I've just pasted the output of my
check on linux with:
user@timemashine:~$ unicode 0xe2
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX UTF-8: c3 a2
UTF-16BE: 00e2 Decimal: â â (Â)
Uppercase: U+00C2
Category: Ll (Letter, Lowercase)
Bidi: L (Left-to-Right)
Decomposition: 0061 0302

But 0xE2 seems to be the problematic character. It is not utf-8! Your
imputfile seems to be encoded in most probably latin-1 or latin-15, not
utf-8.

M4
 
P

Peter J. Holzer

Hello all,

I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
Thus I got this error message:


I thought the solution would be:

Don't do this. XML-files contain an indication of their encoding, you
should treat them as binary files

open(my $fh, "< :raw" ,'/foodir/export.xml');

and let the XML parser do the rest.

It that doesn't work, the encoding stored in the file is probably
wrong, either because the generating software was buggy or because
someone already incorrectly converted the file. You may have luck by
fixing the encoding (it should be in the first line which looks like
this:

<?xml version="1.0" encoding="UTF-8" ?>

If the encoding is missing, UTF-8 is assumed).
 
T

Todor Vachkov

Peter said:
Don't do this. XML-files contain an indication of their encoding, you
should treat them as binary files

open(my $fh, "< :raw" ,'/foodir/export.xml');

and let the XML parser do the rest.

It that doesn't work, the encoding stored in the file is probably
wrong, either because the generating software was buggy or because
someone already incorrectly converted the file. You may have luck by
fixing the encoding (it should be in the first line which looks like
this:

<?xml version="1.0" encoding="UTF-8" ?>

If the encoding is missing, UTF-8 is assumed).
Thanks for your reply Peter!

I'm using now XML::Smart and so I don't have the UTF-8 problem anymore.
The file has the declaration
<?xml version="1.0" encoding="UTF-8" ?>
As I already mentioned, it contains source code from perl scripts and I
found out that some of them are iso-8859-1 encoded. Especially the german "Umlaute" made some trouble as you know;)

Greetings,
Todor
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top