Malformed utf8; where's the null byte coming from?

Discussion in 'Perl Misc' started by bill_mckinnon@interloper.net, Jun 28, 2006.

  1. Guest

    I've spent some time trying to understand Perl's Unicode support and
    its nuances, and I think I actually understand some amount of it. But
    the behavior of this snippet of code is puzzling me at the moment:

    --
    #!/usr/local/bin/perl -w

    use Encode qw(decode);

    $s = decode('utf8', "Version"); # String w/utf8 flag set
    $s =~ s/v\xc3\x83//i;
    --

    Running this with Perl 5.8.6 on Linux (and Windows) produces this
    warning:

    $ ./test.pl
    Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    immediately after start byte 0xc3) in substitution (s///) at ./test.pl
    line 7.
    $

    Granted, what I'm trying to do is to match the literal utf8 bytes
    for a Unicode character against a Unicode string, which may not be a
    reasonable thing to do. But the way this fails doesn't make any sense
    to me; I don't have a null byte after (or before) the \xc3 byte in my
    regex. Also, if the regex string was being upgraded to Unicode
    (presumably from iso-latin-1) I can see it not doing what I intended,
    but this shouldn't cause this error; it should just not match the way I
    want. And then if the \x sequences were taken to be code points instead
    of literal bytes then that's fine...it may not do what I want, but it
    still shouldn't cause this warning.
    Does anyone know why this warning is coming up? It makes me think
    there's more going on under the surface than just an extra iso-latin-1
    -> utf8 conversion. Thanks in advance for any insight. :)

    - Bill

    P.S. - I can do the match I want by using the results of
    encode('utf8', $s) to do the match; since it's a byte
    string everything works fine. But I want to understand
    what the issue was with the warning. :)
     
    , Jun 28, 2006
    #1
    1. Advertising

  2. Ben Morrow Guest

    Quoth :
    > I've spent some time trying to understand Perl's Unicode support and
    > its nuances, and I think I actually understand some amount of it. But
    > the behavior of this snippet of code is puzzling me at the moment:
    >
    > --
    > #!/usr/local/bin/perl -w
    >
    > use Encode qw(decode);
    >
    > $s = decode('utf8', "Version"); # String w/utf8 flag set
    > $s =~ s/v\xc3\x83//i;
    > --
    >
    > Running this with Perl 5.8.6 on Linux (and Windows) produces this
    > warning:
    >
    > $ ./test.pl
    > Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    > immediately after start byte 0xc3) in substitution (s///) at ./test.pl
    > line 7.
    > $


    Some more data points: 5.8.7 i686-linux

    1. There is no need for Encode.

    my $s = "foo";
    utf8::upgrade($s);

    works fine (in the sense that it fails).

    2. It only fails if the first character matches. This makes sense...

    3. It only fails if there are zero-or-one characters after the \xc3.
    Putting a second stops the warning.

    4. It still fails if the \xc3 is the first character (and the string is
    modified to match, obviously).

    5. The match does not have to be at the start of the string.

    6. \xf3 behaves the same way (the number of expected continuation bytes
    doesn't matter).

    I believe this is a bug: anyone else?

    Of course, what you are trying to do is completely wrong :). /v\xf3\x83/
    is a regex which matches three characters, not two. The fact that those
    three, if expressed as bytes in iso8859-1, happen to look like the utf8
    for two characters is irrelevant. It seems perl is having something of
    the same confusion you are :)

    Ben

    --
    I must not fear. Fear is the mind-killer. I will face my fear and
    I will let it pass through me. When the fear is gone there will be
    nothing. Only I will remain.
    Frank Herbert, 'Dune'
     
    Ben Morrow, Jun 28, 2006
    #2
    1. Advertising

  3. Guest

    Ben Morrow wrote:
    > Of course, what you are trying to do is completely wrong :). /v\xf3\x83/
    > is a regex which matches three characters, not two. The fact that those
    > three, if expressed as bytes in iso8859-1, happen to look like the utf8
    > for two characters is irrelevant. It seems perl is having something of
    > the same confusion you are :)


    Yep, agreed...I was initially feeding the s/// data that DIDN'T have
    the utf8 flag set even though it was real utf8 data, and this of course
    works ok. At some point the regex got string data that did have the
    utf8 flag set, and then it didn't work right and got this warning...and
    I wondered what was up with the warning. : )
    Also, interestingly enough the regex was trying to match a utf8 byte
    stream that had been incorrectly interpreted as iso-8859-1 and then
    re-encoded as utf8. : ) Funny how these things happen...

    - Bill
     
    , Jun 28, 2006
    #3
  4. Mumia W. Guest

    wrote:
    > I've spent some time trying to understand Perl's Unicode support and
    > its nuances, and I think I actually understand some amount of it. But
    > the behavior of this snippet of code is puzzling me at the moment:
    >
    > --
    > #!/usr/local/bin/perl -w
    >
    > use Encode qw(decode);
    >
    > $s = decode('utf8', "Version"); # String w/utf8 flag set
    > $s =~ s/v\xc3\x83//i;
    > --
    > [...]


    I was able to eliminate the warning by using "use encoding 'utf8'," but
    there is a problem with the substitution.

    use Encode qw(decode);
    use encoding 'utf8';
    my $s;

    # rx is "vÃ"
    my $rx = qq{"v\xc3\x83"};
    $s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set
    print 'rx : ', $rx, "\n";
    print 'before: ', $s, "\n";
    $s =~ s/v\xc3\x83//i;
    print 'after : ', $s, "\n";

    __END__

    This prints this:
    rx : "vÃ"
    before: V�ersion
    after : �ersion


    Notice that the "�" wasn't substituted even though the 'V' was. Why?
     
    Mumia W., Jun 29, 2006
    #4
  5. Ben Morrow Guest

    Quoth Ben Morrow <>:
    >
    > Quoth :
    > > I've spent some time trying to understand Perl's Unicode support and
    > > its nuances, and I think I actually understand some amount of it. But
    > > the behavior of this snippet of code is puzzling me at the moment:
    > >
    > > --
    > > #!/usr/local/bin/perl -w
    > >
    > > use Encode qw(decode);
    > >
    > > $s = decode('utf8', "Version"); # String w/utf8 flag set
    > > $s =~ s/v\xc3\x83//i;
    > > --
    > >
    > > Running this with Perl 5.8.6 on Linux (and Windows) produces this
    > > warning:
    > >
    > > $ ./test.pl
    > > Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    > > immediately after start byte 0xc3) in substitution (s///) at ./test.pl
    > > line 7.
    > > $

    >
    > Some more data points: 5.8.7 i686-linux
    >
    > 1. There is no need for Encode.
    >
    > my $s = "foo";
    > utf8::upgrade($s);
    >
    > works fine (in the sense that it fails).
    >
    > 2. It only fails if the first character matches. This makes sense...
    >
    > 3. It only fails if there are zero-or-one characters after the \xc3.
    > Putting a second stops the warning.
    >
    > 4. It still fails if the \xc3 is the first character (and the string is
    > modified to match, obviously).
    >
    > 5. The match does not have to be at the start of the string.
    >
    > 6. \xf3 behaves the same way (the number of expected continuation bytes
    > doesn't matter).


    Sorry, one more:

    7. The warning only occurs when the /i flag is used.

    > I believe this is a bug: anyone else?


    Ben

    --
    And if you wanna make sense / Whatcha looking at me for? (Fiona Apple)
    * *
     
    Ben Morrow, Jun 29, 2006
    #5
  6. Ben Morrow Guest

    Quoth "Mumia W." <>:
    > wrote:
    > > I've spent some time trying to understand Perl's Unicode support and
    > > its nuances, and I think I actually understand some amount of it. But
    > > the behavior of this snippet of code is puzzling me at the moment:
    > >
    > > --
    > > #!/usr/local/bin/perl -w
    > >
    > > use Encode qw(decode);
    > >
    > > $s = decode('utf8', "Version"); # String w/utf8 flag set
    > > $s =~ s/v\xc3\x83//i;
    > > --
    > > [...]

    >
    > I was able to eliminate the warning by using "use encoding 'utf8'," but
    > there is a problem with the substitution.
    >
    > use Encode qw(decode);
    > use encoding 'utf8';
    > my $s;
    >
    > # rx is "vÃ"
    > my $rx = qq{"v\xc3\x83"};
    > $s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set


    These two do not match. The regex matches a 3-char string; $s (after
    decoding) has only one char between the V and the e.

    > print 'rx : ', $rx, "\n";
    > print 'before: ', $s, "\n";
    > $s =~ s/v\xc3\x83//i;
    > print 'after : ', $s, "\n";
    >
    > __END__
    >
    > This prints this:
    > rx : "vÃ"
    > before: V�ersion
    > after : �ersion
    >
    > Notice that the "�" wasn't substituted even though the 'V' was. Why?


    Again, I think it's a bug. No substitution should have occurred, as the
    regex didn't match.

    Ben

    --
    I touch the fire and it freezes me, []
    I look into it and it's black.
    Why can't I feel? My skin should crack and peel---
    I want the fire back... Buffy, 'Once More With Feeling'
     
    Ben Morrow, Jun 29, 2006
    #6
  7. Guest

    Ben Morrow wrote:

    > Again, I think it's a bug. No substitution should have occurred, as the
    > regex didn't match.


    Lacking any reasonable explanation to the contrary, this is my
    theory too. : ) It looks like "perlbug" is the recommended way of
    reporting bugs in Perl...I'll try to run through this at some point (I
    should probably confirm it happens on the latest and greatest Perl,
    etc). Thanks for the responses...

    - Bill
     
    , Jun 29, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steve Holden

    PyCon is Coming! PyCon is Coming!

    Steve Holden, Jan 5, 2006, in forum: Python
    Replies:
    0
    Views:
    328
    Steve Holden
    Jan 5, 2006
  2. Tom McGlynn
    Replies:
    4
    Views:
    878
    Mark Space
    Apr 19, 2008
  3. Patricia Shanahan
    Replies:
    0
    Views:
    407
    Patricia Shanahan
    Apr 17, 2008
  4. Tom McGlynn
    Replies:
    2
    Views:
    421
    Andreas Leitgeb
    Apr 18, 2008
  5. gry
    Replies:
    2
    Views:
    802
    Alf P. Steinbach
    Mar 13, 2012
Loading...

Share This Page