Intermittent Character Encoding Issues

Discussion in 'Perl Misc' started by David Murray-Rust, Nov 4, 2003.

  1. Hi all,

    Please excuse the long post, but this seems to be a subtle bug, which
    I have been attacking for a while.

    I'm having a problem with character encodings in perl 5.8. The overall
    effect is that certain characters, in particular the UK pound symbol
    are turned into two characters, generally a capital A circumflex
    followed by the correct character. This would appear to be a simple
    character encoding issue, but there are a few caveats:

    - It only happens on one machine. Taking a disk image of the OS and
    running it on different hardware results in a system without the
    problem.

    - It can be intermittent. Two separate instances of (apparently) the
    same problem have been found. The first happened about 1% of the
    time the code was run. The second happened every time the code was
    run.


    More Detail:

    The application is a web based content management system, running
    under apache/mod-perl with a mysql back end. The machine in question
    is running Slackware 9, Perl 5.8.1 and kernel 2.4.20.

    The precise nature of the bug is that a character represented by \243
    (163 decimal) in the iso-8859-1 character set is replaced by two
    octets, \302\243, in some places. It appears that perl is converting
    the data to a unicode representation and forgetting that it has done
    this.

    The first version of the bug was that after the line:

    $contentList = [ join '', @$contentList ] unless $separate;

    certain characters in the entries in @$contentList would be changed to
    two-byte versions. The only happened about 1% of the time this code
    was run. Changing the above line to be:

    unless( $separate )
    {
    my $tmp = "";
    foreach my $contentBit ( @$contentList )
    {
    $tmp .= $contentBit;
    }
    $contentList = [ $tmp ];
    }

    made the problem go away. In this case, the data comes directly from
    the mysql database. It has been verified that the string is encoded
    correctly up until that line, and wrongly afterwards.


    In the second version of the bug, the line:

    return $return . $parent;

    resulted in a string being returned where all the pound signs in
    $return had been altered. If a different string to $parent is
    appended, there is no problem. The current solution is:

    my $tmpParent = encode( "iso-8859-1", $parent );
    return $return . $tmpParent;

    NOTE: the characters which are altered are those in $return, while the
    string whose endcoding I am playing with is in $parent.

    In this case, there is data in $parent which comes via CGI, so I would
    be able to believe an explanation along the lines of "$parent is
    magically recognised as utf8, so when it is added to $return, $return
    is converted to utf8 octets before they are joined", but I would find
    this quite counter intuitive, since as I understand things perl uses
    it's own internal representation for strings, and should only need to
    convert on the way in or out.

    With resepect to machine dependance, it happens on only one machine
    which is running our software. To create a test platform, we took a
    disk image of the system partition, loaded it onto a new machine and
    compiled a new kernel which differed only in network card support.
    This new machine did not fix the problem. As we were originally
    running perl 5.8.0, we tried upgrading to 5.8.1, but this had no
    effect.

    So, to sum up,

    - Can anyone explain what is going on here, the intermittent
    occurences, the machine dependance and the general behaviour?

    - Can anyone suggest a way to avoid these problems?

    (For the record, I've read the perldoc on perlunicode and utf8, lurked
    for a while, read google archives and read a fair amount on character
    encodings)

    Thanks to anyone who's made it this far for your time,
    Dave Murray-Rust
     
    David Murray-Rust, Nov 4, 2003
    #1
    1. Advertising

  2. David Murray-Rust

    Ben Morrow Guest

    David Murray-Rust <> wrote:
    > The first version of the bug was that after the line:
    >
    > $contentList = [ join '', @$contentList ] unless $separate;
    >
    > certain characters in the entries in @$contentList would be changed to
    > two-byte versions. The only happened about 1% of the time this code
    > was run. Changing the above line to be:
    >
    > unless( $separate )
    > {
    > my $tmp = "";
    > foreach my $contentBit ( @$contentList )
    > {
    > $tmp .= $contentBit;
    > }
    > $contentList = [ $tmp ];
    > }
    >
    > made the problem go away. In this case, the data comes directly from
    > the mysql database. It has been verified that the string is encoded
    > correctly up until that line, and wrongly afterwards.


    How perl stores the data internally should be considered none of your
    business. (It is in fact either iso8859-1 or utf8 on ASCII machines,
    with a flag set on each scalar to say which. It is easier, however, to
    regard a text string as being a set of Unicode characters, and not
    worry about how they are represented.) However, it may be that how it
    is stored in your mysql database is confusing perl, if the code you
    are using to interface to the database doesn't correctly decode the
    data into perl's own encoding. In particular, if you use iso8859-1 you
    may get bitten far more irregularly than if you use other encodings.

    Decide on how you are going to encode text in the database: I
    shall assume you wish to use iso8859-1. Now, every piece of textual
    (as opposed to binary) data you write into the database should first
    be converted from a sequence of characters into a sequence of octets,
    using Encode::encode; and every piece of textual data should be
    converted from octets back into character data using
    Encode::decode. So, in the example above, you would write:

    my $tmp = "";
    foreach my $contentBit (@$contentList) {
    $tmp .= decode "iso8859-1", $content_Bit;
    }
    $contentList = [ $tmp ];

    (assuming you didn't decode it closer to where it was read from the
    database).

    > In the second version of the bug, the line:
    >
    > return $return . $parent;
    >
    > resulted in a string being returned where all the pound signs in
    > $return had been altered. If a different string to $parent is
    > appended, there is no problem.


    So what does $parent contain, which causes this problem? And what is
    the result of
    use Encode qw/is_utf8/;
    warn is_utf8($parent) ?
    "\$parent is chars internally" :
    "\$parent is bytes internally";

    ?

    > The current solution is:
    >
    > my $tmpParent = encode( "iso-8859-1", $parent );
    > return $return . $tmpParent;


    This is almost certainly Wrong, as $tmpParent will here be considered
    to be a string of octets rather than a sequence of characters. The
    Right Answer is to make sure $return is considered to be a sequence of
    characters as well.

    > In this case, there is data in $parent which comes via CGI, so I would
    > be able to believe an explanation along the lines of "$parent is
    > magically recognised as utf8, so when it is added to $return, $return
    > is converted to utf8 octets before they are joined", but I would find
    > this quite counter intuitive, since as I understand things perl uses
    > it's own internal representation for strings, and should only need to
    > convert on the way in or out.


    Yup. However, if the module you are using to talk to the database
    and/or Apache hasn't been upgraded to 5.8 yet you will have to do
    those conversions 'at the borders' by hand. Pushing an :encoding layer
    onto your filehandles, perhaps with the 'open' pragma, may help
    automate this; although you are using mod_perl, which relies on tied
    filehandles: I don't know how well these play with PerlIO layers as
    yet. You may want to write a custom 'print', 'readline' &c. that runs
    all input through 'decode' and all output through 'encode'.

    Another thing to watch out for is that if any of your locale variables
    (LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
    be in UTF8 until you disillusion it. This feature has been removed in
    5.8.1, though, so it shouldn't be affecting your problem.

    An alternative solution, if you can afford to treat all data as
    'binary' rather than 'textual', is simply to put

    use bytes;

    at the top of every file :).

    Ben

    --
    Although few may originate a policy, we are all able to judge it.
    - Pericles of Athens, c.430 B.C.
     
    Ben Morrow, Nov 4, 2003
    #2
    1. Advertising

  3. [ this is a repost of a response which I accidentally emailed to Ben.
    Sorry Ben! ]

    In comp.lang.perl.misc, Ben Morrow wrote:

    > How perl stores the data internally should be considered none of your
    > business.

    Amen to that. I would far rather not need to know ;)

    > However, it may be that how it
    > is stored in your mysql database is confusing perl, if the code you
    > are using to interface to the database doesn't correctly decode the
    > data into perl's own encoding. In particular, if you use iso8859-1 you
    > may get bitten far more irregularly than if you use other encodings.


    I agree with this, except that the data from the database has been
    concatenated, regexed etc. with no problem, before a seemingly
    innocent line causes problems.

    [ snip good advice about dealing with encodings on the way into and
    out of the database ]

    >> In the second version of the bug, the line:
    >>
    >> return $return . $parent;
    >>
    >> resulted in a string being returned where all the pound signs in
    >> $return had been altered. If a different string to $parent is
    >> appended, there is no problem.

    >
    > So what does $parent contain, which causes this problem? And what is
    > the result of
    > use Encode qw/is_utf8/;
    > warn is_utf8($parent) ?
    > "\$parent is chars internally" :
    > "\$parent is bytes internally";


    Ah. Here I find that $parent is characters, while $return is bytes.
    This sort of explains things, except that it would mean that:

    - perl sees a sequence of characters and a sequence of bytes being
    concatenated
    - it converts the bytes to characters
    - it then concatenates two character sequences
    - it then forgets that this is now a character sequence, and treats
    the result as bytes.

    This does not seem like good behaviour - I'd be tempted to suggest
    it's a bug.

    Further, what would lead perl to treat one set of data as characters
    and one set as bytes? both strings are valid XML fragments (built up
    using data from CGI). The $parent string which is treated as
    characters contains only [A-Za-z0-9<>-='"/? ], so I can't see any
    reason for perl to suddenly decide it needs to be character data.

    (As a side note, your snippet implies that the is_utf8 flag indicates
    whether the data is to be treated as characters or as bytes, rather
    than indicating whether or not it is characters in the utf8 character
    set - could you clarify? )

    >> The current solution is:
    >>
    >> my $tmpParent = encode( "iso-8859-1", $parent );
    >> return $return . $tmpParent;

    >
    > This is almost certainly Wrong, as $tmpParent will here be considered
    > to be a string of octets rather than a sequence of characters. The
    > Right Answer is to make sure $return is considered to be a sequence of
    > characters as well.


    Yes, that makes sense ;)

    > Yup. However, if the module you are using to talk to the database
    > and/or Apache hasn't been upgraded to 5.8 yet you will have to do
    > those conversions 'at the borders' by hand.


    I will have a look into the modules we're using

    > Another thing to watch out for is that if any of your locale variables
    > (LANG, LC_ALL, etc.) match /utf-?8/i then perl will assume all IO will
    > be in UTF8 until you disillusion it. This feature has been removed in
    > 5.8.1, though, so it shouldn't be affecting your problem.


    Yup, that's why I upgraded :)

    > An alternative solution, if you can afford to treat all data as
    > 'binary' rather than 'textual', is simply to put
    >
    > use bytes;
    >
    > at the top of every file :).


    Except that as soon as I did that, someone would decide we needed
    unicode support ;)

    Thanks for your help,
    dave
     
    David Murray-Rust, Nov 7, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,937
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,454
    Real Gagnon
    Oct 8, 2004
  3. raavi
    Replies:
    2
    Views:
    917
    raavi
    Mar 2, 2006
  4. sy crisp

    mod_perl/cgi character encoding issues

    sy crisp, Jul 29, 2005, in forum: Perl Misc
    Replies:
    1
    Views:
    190
    sy crisp
    Jul 29, 2005
  5. Replies:
    2
    Views:
    391
Loading...

Share This Page