Malformed utf8; where's the null byte coming from?

bill_mckinnon · Jun 28, 2006

I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--

Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:

$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$

Granted, what I'm trying to do is to match the literal utf8 bytes
for a Unicode character against a Unicode string, which may not be a
reasonable thing to do. But the way this fails doesn't make any sense
to me; I don't have a null byte after (or before) the \xc3 byte in my
regex. Also, if the regex string was being upgraded to Unicode
(presumably from iso-latin-1) I can see it not doing what I intended,
but this shouldn't cause this error; it should just not match the way I
want. And then if the \x sequences were taken to be code points instead
of literal bytes then that's fine...it may not do what I want, but it
still shouldn't cause this warning.
Does anyone know why this warning is coming up? It makes me think
there's more going on under the surface than just an extra iso-latin-1
-> utf8 conversion. Thanks in advance for any insight.

- Bill

P.S. - I can do the match I want by using the results of
encode('utf8', $s) to do the match; since it's a byte
string everything works fine. But I want to understand
what the issue was with the warning.

Ben Morrow · Jun 28, 2006

Quoth (e-mail address removed):

I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--

Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:

$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$

Some more data points: 5.8.7 i686-linux

1. There is no need for Encode.

my $s = "foo";
utf8::upgrade($s);

works fine (in the sense that it fails).

2. It only fails if the first character matches. This makes sense...

3. It only fails if there are zero-or-one characters after the \xc3.
Putting a second stops the warning.

4. It still fails if the \xc3 is the first character (and the string is
modified to match, obviously).

5. The match does not have to be at the start of the string.

6. \xf3 behaves the same way (the number of expected continuation bytes
doesn't matter).

I believe this is a bug: anyone else?

Of course, what you are trying to do is completely wrong

. /v\xf3\x83/
is a regex which matches three characters, not two. The fact that those
three, if expressed as bytes in iso8859-1, happen to look like the utf8
for two characters is irrelevant. It seems perl is having something of
the same confusion you are

Ben

bill_mckinnon · Jun 28, 2006

Ben said:
Of course, what you are trying to do is completely wrong . /v\xf3\x83/
is a regex which matches three characters, not two. The fact that those
three, if expressed as bytes in iso8859-1, happen to look like the utf8
for two characters is irrelevant. It seems perl is having something of
the same confusion you are

Yep, agreed...I was initially feeding the s/// data that DIDN'T have
the utf8 flag set even though it was real utf8 data, and this of course
works ok. At some point the regex got string data that did have the
utf8 flag set, and then it didn't work right and got this warning...and
I wondered what was up with the warning. : )
Also, interestingly enough the regex was trying to match a utf8 byte
stream that had been incorrectly interpreted as iso-8859-1 and then
re-encoded as utf8. : ) Funny how these things happen...

- Bill

Mumia W. · Jun 29, 2006

I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;

I was able to eliminate the warning by using "use encoding 'utf8'," but
there is a problem with the substitution.

use Encode qw(decode);
use encoding 'utf8';
my $s;

# rx is "vÃƒ"
my $rx = qq{"v\xc3\x83"};
$s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set
print 'rx : ', $rx, "\n";
print 'before: ', $s, "\n";
$s =~ s/v\xc3\x83//i;
print 'after : ', $s, "\n";

__END__

This prints this:
rx : "vÃƒ"
before: Vï¿½ersion
after : ï¿½ersion

Notice that the "ï¿½" wasn't substituted even though the 'V' was. Why?

Ben Morrow · Jun 29, 2006

Quoth Ben Morrow said:
Quoth (e-mail address removed):

Some more data points: 5.8.7 i686-linux

1. There is no need for Encode.

my $s = "foo";
utf8::upgrade($s);

works fine (in the sense that it fails).

2. It only fails if the first character matches. This makes sense...

3. It only fails if there are zero-or-one characters after the \xc3.
Putting a second stops the warning.

4. It still fails if the \xc3 is the first character (and the string is
modified to match, obviously).

5. The match does not have to be at the start of the string.

6. \xf3 behaves the same way (the number of expected continuation bytes
doesn't matter).

Sorry, one more:

7. The warning only occurs when the /i flag is used.

I believe this is a bug: anyone else?

Ben

Ben Morrow · Jun 29, 2006

Quoth "Mumia W. said:
I was able to eliminate the warning by using "use encoding 'utf8'," but
there is a problem with the substitution.

use Encode qw(decode);
use encoding 'utf8';
my $s;

# rx is "vÃƒ"
my $rx = qq{"v\xc3\x83"};
$s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set

These two do not match. The regex matches a 3-char string; $s (after
decoding) has only one char between the V and the e.

print 'rx : ', $rx, "\n";
print 'before: ', $s, "\n";
$s =~ s/v\xc3\x83//i;
print 'after : ', $s, "\n";

__END__

This prints this:
rx : "vÃƒ"
before: Vï¿½ersion
after : ï¿½ersion

Notice that the "ï¿½" wasn't substituted even though the 'V' was. Why?

Again, I think it's a bug. No substitution should have occurred, as the
regex didn't match.

Ben

bill_mckinnon · Jun 29, 2006

Ben said:
Again, I think it's a bug. No substitution should have occurred, as the
regex didn't match.

Lacking any reasonable explanation to the contrary, this is my
theory too. : ) It looks like "perlbug" is the recommended way of
reporting bugs in Perl...I'll try to run through this at some point (I
should probably confirm it happens on the latest and greatest Perl,
etc). Thanks for the responses...

- Bill

Unicode: Strings marked 'utf8'. Can they be converted to 'byte' without going the vec() route?	0	Aug 3, 2009
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
Confused by utf8/sysread/syswrite/DBD::Pg	1	Dec 29, 2009
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009
can't get utf8 / unicode strings from embedded python	19	Aug 23, 2013
Regex failed to replace utf8 character	10	Nov 29, 2006
Help with utf8	4	Apr 7, 2009
utf8 issue with substitution pattern	0	Apr 19, 2005

Malformed utf8; where's the null byte coming from?

bill_mckinnon

Ben Morrow

bill_mckinnon

Mumia W.

Ben Morrow

Ben Morrow

bill_mckinnon

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads