UTF-8 problem

Discussion in 'Perl Misc' started by Todor Vachkov, Aug 21, 2007.

  1. Hello all,

    I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
    Thus I got this error message:

    >Entity: line 315442: parser error : Input is not proper UTF-8, indicate
    >encoding !
    >Bytes: 0xE2 0x26 0x6C 0x74


    I thought the solution would be:

    >open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
    >my $parser = XML::LibXML->new();
    >my $dom = $parser->parse_fh($fh);
    >my $root = $dom->getDocumentElement;


    but this produce a long long list (maybe for each parsed character in the xml file) of error messages :

    >utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.

    ..
    ..
    ..
    >utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
    >Segmentation fault


    The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
    I just want to let the modul to parse the xml file, which is really large (over 20MB)
    and has being exported from another software. Thus I haven't any influence what comes into it.

    I hope you can help me! Thanks in advance!

    Greetings Todor
    Todor Vachkov, Aug 21, 2007
    #1
    1. Advertising

  2. Todor Vachkov <-berlin.de> wrote in
    news::

    > Hello all,
    >
    > I'm trying to convert an exported xml file into a perl data structre
    > with the XML::LibXML modul. Thus I got this error message:
    >
    >>Entity: line 315442: parser error : Input is not proper UTF-8,
    >>indicate encoding !
    >>Bytes: 0xE2 0x26 0x6C 0x74

    >
    > I thought the solution would be:
    >
    >>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');


    The file contents are not UTF-8. Specify the real encoding.

    Sinan

    PS: Avoid RxParse

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)
    clpmisc guidelines: <URL:http://www.augustmail.com/~tadmc/clpmisc.shtml>
    A. Sinan Unur, Aug 22, 2007
    #2
    1. Advertising

  3. Todor Vachkov

    Ted Zlatanov Guest

    On Wed, 22 Aug 2007 00:23:17 +0200 Todor Vachkov <-berlin.de> wrote:

    TV> Hello all,
    TV> I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
    TV> Thus I got this error message:

    >> Entity: line 315442: parser error : Input is not proper UTF-8, indicate
    >> encoding !
    >> Bytes: 0xE2 0x26 0x6C 0x74


    TV> I thought the solution would be:

    >> open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
    >> my $parser = XML::LibXML->new();
    >> my $dom = $parser->parse_fh($fh);
    >> my $root = $dom->getDocumentElement;


    TV> but this produce a long long list (maybe for each parsed character in the xml file) of error messages :

    >> utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.

    TV> .
    TV> .
    TV> .
    >> utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
    >> Segmentation fault


    TV> The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
    TV> I just want to let the modul to parse the xml file, which is really large (over 20MB)
    TV> and has being exported from another software. Thus I haven't any influence what comes into it.

    Can you post with the first 50 lines of the file, or put up a smaller
    complete version of it online somewhere we can examine it? Your post
    doesn't help at all with finding the problem (we can only guess that
    your input file is not valid).

    Ted
    Ted Zlatanov, Aug 22, 2007
    #3
  4. Thanks for your replies!

    The xml file is really huge - it has 666.025 lines and it is result of an export from a software.

    It contents:
    - the meta description of the software itself (i am pretty sure that it is conform to UTF-8)
    - form inputs made by users. Thus, they fill out the software with information about several
    databases.The goal is to have a distributed search engine. (again, I assume that the software
    also saves the inputs in UTF-8)
    - perl scripts for each database, which are written by various programmers. The scripts are
    the interfaces between the databases and the software (the UTF-8 encoding of the scripts is not guaranteed)
    All this stuff is contained by the huge XML file.

    Parsing the file with XML::LibXML gives:

    >Entity: line 315442: parser error : Input is not proper UTF-8, indicate                                
    >encoding !
    >Bytes: 0xE2 0x26 0x6C 0x74


    I've figured out that this are the characters :

    * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
    â (Â)

    * U+0026 AMPERSAND
    &

    * U+006C LATIN SMALL LETTER L
    l (L)

    * U+0074 LATIN SMALL LETTER T
    t (T)

    Line 315442 looks:
    ><line>&lt;refpt id=&quot;bafn1&quot;/&gt;&lt;lk refid=&quot;afn1&quot;&gt;&lt;sup&gt;â&lt;/sup&gt;&lt;/lk&gt;</line>

    ^

    The element <line></line> contains a single line from a perl script as mentioned above. The character 0xE2 was the point,
    where the parser stopped, at line 315442, it went far enough, almost to the half.

    It seems that the perl scripts within are my problem. I'am wondering why this single character is being treated from parser
    as a non utf-8 code point? Could I tell the parser somehow to ignore this?

    Thanks for your help!

    Greetings, Todor
    Todor Vachkov, Aug 22, 2007
    #4
  5. On Wed, 22 Aug 2007 17:55:50 +0200, Todor Vachkov wrote:

    > Parsing the file with XML::LibXML gives:
    >
    > >Entity: line 315442: parser error : Input is not proper UTF-8,
    > >indicate encoding !
    > >Bytes: 0xE2 0x26 0x6C 0x74

    >
    > I've figured out that this are the characters :
    >
    > * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
    > â (Â)


    U+00E2 is Unicode. In utf-8 encoding this would be a two character
    sequence. So your input is not proper utf-8.

    HTH,
    M4
    Martijn Lievaart, Aug 22, 2007
    #5
  6. Martijn Lievaart wrote:

    >> Parsing the file with XML::LibXML gives:
    >>
    >> >Entity: line 315442: parser error : Input is not proper UTF-8,
    >> >indicate encoding !
    >> >Bytes: 0xE2 0x26 0x6C 0x74

    >>
    >> I've figured out that this are the characters :
    >>
    >> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
    >> â (Â)

    >
    > U+00E2 is Unicode. In utf-8 encoding this would be a two character
    > sequence. So your input is not proper utf-8.


    Thanks for your posting!

    The parser says:
    >Bytes: 0xE2 0x26 0x6C 0x74

    So 0xE2 is meant to be the problematic character.

    U+00E2 was not in the error message, I've just pasted the output of my check on linux with:
    user@timemashine:~$ unicode 0xe2
    U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
    UTF-8: c3 a2 UTF-16BE: 00e2 Decimal: â
    â (Â)
    Uppercase: U+00C2
    Category: Ll (Letter, Lowercase)
    Bidi: L (Left-to-Right)
    Decomposition: 0061 0302

    Greetings Todor
    Todor Vachkov, Aug 22, 2007
    #6
  7. On Wed, 22 Aug 2007 21:52:16 +0200, Todor Vachkov wrote:

    > Martijn Lievaart wrote:
    >
    >>> Parsing the file with XML::LibXML gives:
    >>>
    >>> >Entity: line 315442: parser error : Input is not proper
    >>> >UTF-8, indicate encoding !
    >>> >Bytes: 0xE2 0x26 0x6C 0x74
    >>>
    >>> I've figured out that this are the characters :
    >>>
    >>> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
    >>> â (Â)

    >>
    >> U+00E2 is Unicode. In utf-8 encoding this would be a two character
    >> sequence. So your input is not proper utf-8.

    >
    > Thanks for your posting!
    >
    > The parser says:
    > >Bytes: 0xE2 0x26 0x6C 0x74

    > So 0xE2 is meant to be the problematic character.
    >
    > U+00E2 was not in the error message, I've just pasted the output of my
    > check on linux with:
    > user@timemashine:~$ unicode 0xe2
    > U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX UTF-8: c3 a2
    > UTF-16BE: 00e2 Decimal: â â (Â)
    > Uppercase: U+00C2
    > Category: Ll (Letter, Lowercase)
    > Bidi: L (Left-to-Right)
    > Decomposition: 0061 0302


    But 0xE2 seems to be the problematic character. It is not utf-8! Your
    imputfile seems to be encoded in most probably latin-1 or latin-15, not
    utf-8.

    M4
    Martijn Lievaart, Aug 22, 2007
    #7
  8. On 2007-08-21 22:23, Todor Vachkov <-berlin.de> wrote:
    > Hello all,
    >
    > I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
    > Thus I got this error message:
    >
    >>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
    >>encoding !
    >>Bytes: 0xE2 0x26 0x6C 0x74

    >
    > I thought the solution would be:
    >
    >>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');


    Don't do this. XML-files contain an indication of their encoding, you
    should treat them as binary files

    open(my $fh, "< :raw" ,'/foodir/export.xml');

    and let the XML parser do the rest.

    It that doesn't work, the encoding stored in the file is probably
    wrong, either because the generating software was buggy or because
    someone already incorrectly converted the file. You may have luck by
    fixing the encoding (it should be in the first line which looks like
    this:

    <?xml version="1.0" encoding="UTF-8" ?>

    If the encoding is missing, UTF-8 is assumed).

    --
    _ | Peter J. Holzer | I know I'd be respectful of a pirate
    |_|_) | Sysadmin WSR | with an emu on his shoulder.
    | | | |
    __/ | http://www.hjp.at/ | -- Sam in "Freefall"
    Peter J. Holzer, Aug 25, 2007
    #8
  9. Peter J. Holzer wrote:

    > On 2007-08-21 22:23, Todor Vachkov <-berlin.de> wrote:
    >> Hello all,
    >>
    >> I'm trying to convert an exported xml file into a perl data structre with
    >> the XML::LibXML modul. Thus I got this error message:
    >>
    >>>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
    >>>encoding !
    >>>Bytes: 0xE2 0x26 0x6C 0x74

    >>
    >> I thought the solution would be:
    >>
    >>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');

    >
    > Don't do this. XML-files contain an indication of their encoding, you
    > should treat them as binary files
    >
    > open(my $fh, "< :raw" ,'/foodir/export.xml');
    >
    > and let the XML parser do the rest.
    >
    > It that doesn't work, the encoding stored in the file is probably
    > wrong, either because the generating software was buggy or because
    > someone already incorrectly converted the file. You may have luck by
    > fixing the encoding (it should be in the first line which looks like
    > this:
    >
    > <?xml version="1.0" encoding="UTF-8" ?>
    >
    > If the encoding is missing, UTF-8 is assumed).
    >

    Thanks for your reply Peter!

    I'm using now XML::Smart and so I don't have the UTF-8 problem anymore.
    The file has the declaration
    <?xml version="1.0" encoding="UTF-8" ?>
    As I already mentioned, it contains source code from perl scripts and I
    found out that some of them are iso-8859-1 encoded. Especially the german "Umlaute" made some trouble as you know;)

    Greetings,
    Todor
    Todor Vachkov, Aug 25, 2007
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,061
    Joerg Jooss
    Apr 24, 2004
  2. =?Utf-8?B?QXNoYQ==?=
    Replies:
    3
    Views:
    416
  3. Arifi Koseoglu
    Replies:
    2
    Views:
    953
    Arifi Koseoglu
    Apr 13, 2004
  4. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,302
    P.J. Plauger
    Aug 1, 2006
  5. darrel
    Replies:
    5
    Views:
    465
    =?ISO-8859-1?Q?G=F6ran_Andersson?=
    Apr 14, 2007
Loading...

Share This Page