Malformed utf8; where's the null byte coming from?

B

bill_mckinnon

I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--

Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:

$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$

Granted, what I'm trying to do is to match the literal utf8 bytes
for a Unicode character against a Unicode string, which may not be a
reasonable thing to do. But the way this fails doesn't make any sense
to me; I don't have a null byte after (or before) the \xc3 byte in my
regex. Also, if the regex string was being upgraded to Unicode
(presumably from iso-latin-1) I can see it not doing what I intended,
but this shouldn't cause this error; it should just not match the way I
want. And then if the \x sequences were taken to be code points instead
of literal bytes then that's fine...it may not do what I want, but it
still shouldn't cause this warning.
Does anyone know why this warning is coming up? It makes me think
there's more going on under the surface than just an extra iso-latin-1
-> utf8 conversion. Thanks in advance for any insight. :)

- Bill

P.S. - I can do the match I want by using the results of
encode('utf8', $s) to do the match; since it's a byte
string everything works fine. But I want to understand
what the issue was with the warning. :)
 
B

Ben Morrow

Quoth (e-mail address removed):
I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;
--

Running this with Perl 5.8.6 on Linux (and Windows) produces this
warning:

$ ./test.pl
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xc3) in substitution (s///) at ./test.pl
line 7.
$

Some more data points: 5.8.7 i686-linux

1. There is no need for Encode.

my $s = "foo";
utf8::upgrade($s);

works fine (in the sense that it fails).

2. It only fails if the first character matches. This makes sense...

3. It only fails if there are zero-or-one characters after the \xc3.
Putting a second stops the warning.

4. It still fails if the \xc3 is the first character (and the string is
modified to match, obviously).

5. The match does not have to be at the start of the string.

6. \xf3 behaves the same way (the number of expected continuation bytes
doesn't matter).

I believe this is a bug: anyone else?

Of course, what you are trying to do is completely wrong :). /v\xf3\x83/
is a regex which matches three characters, not two. The fact that those
three, if expressed as bytes in iso8859-1, happen to look like the utf8
for two characters is irrelevant. It seems perl is having something of
the same confusion you are :)

Ben
 
B

bill_mckinnon

Ben said:
Of course, what you are trying to do is completely wrong :). /v\xf3\x83/
is a regex which matches three characters, not two. The fact that those
three, if expressed as bytes in iso8859-1, happen to look like the utf8
for two characters is irrelevant. It seems perl is having something of
the same confusion you are :)

Yep, agreed...I was initially feeding the s/// data that DIDN'T have
the utf8 flag set even though it was real utf8 data, and this of course
works ok. At some point the regex got string data that did have the
utf8 flag set, and then it didn't work right and got this warning...and
I wondered what was up with the warning. : )
Also, interestingly enough the regex was trying to match a utf8 byte
stream that had been incorrectly interpreted as iso-8859-1 and then
re-encoded as utf8. : ) Funny how these things happen...

- Bill
 
M

Mumia W.

I've spent some time trying to understand Perl's Unicode support and
its nuances, and I think I actually understand some amount of it. But
the behavior of this snippet of code is puzzling me at the moment:

--
#!/usr/local/bin/perl -w

use Encode qw(decode);

$s = decode('utf8', "Version"); # String w/utf8 flag set
$s =~ s/v\xc3\x83//i;

I was able to eliminate the warning by using "use encoding 'utf8'," but
there is a problem with the substitution.

use Encode qw(decode);
use encoding 'utf8';
my $s;

# rx is "vÃ"
my $rx = qq{"v\xc3\x83"};
$s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set
print 'rx : ', $rx, "\n";
print 'before: ', $s, "\n";
$s =~ s/v\xc3\x83//i;
print 'after : ', $s, "\n";

__END__

This prints this:
rx : "vÃ"
before: V�ersion
after : �ersion


Notice that the "�" wasn't substituted even though the 'V' was. Why?
 
B

Ben Morrow

Quoth Ben Morrow said:
Quoth (e-mail address removed):

Some more data points: 5.8.7 i686-linux

1. There is no need for Encode.

my $s = "foo";
utf8::upgrade($s);

works fine (in the sense that it fails).

2. It only fails if the first character matches. This makes sense...

3. It only fails if there are zero-or-one characters after the \xc3.
Putting a second stops the warning.

4. It still fails if the \xc3 is the first character (and the string is
modified to match, obviously).

5. The match does not have to be at the start of the string.

6. \xf3 behaves the same way (the number of expected continuation bytes
doesn't matter).

Sorry, one more:

7. The warning only occurs when the /i flag is used.
I believe this is a bug: anyone else?

Ben
 
B

Ben Morrow

Quoth "Mumia W. said:
I was able to eliminate the warning by using "use encoding 'utf8'," but
there is a problem with the substitution.

use Encode qw(decode);
use encoding 'utf8';
my $s;

# rx is "vÃ"
my $rx = qq{"v\xc3\x83"};
$s = decode('utf8', "V\x{c3}\x{83}ersion"); # String w/utf8 flag set

These two do not match. The regex matches a 3-char string; $s (after
decoding) has only one char between the V and the e.
print 'rx : ', $rx, "\n";
print 'before: ', $s, "\n";
$s =~ s/v\xc3\x83//i;
print 'after : ', $s, "\n";

__END__

This prints this:
rx : "vÃ"
before: V�ersion
after : �ersion

Notice that the "�" wasn't substituted even though the 'V' was. Why?

Again, I think it's a bug. No substitution should have occurred, as the
regex didn't match.

Ben
 
B

bill_mckinnon

Ben said:
Again, I think it's a bug. No substitution should have occurred, as the
regex didn't match.

Lacking any reasonable explanation to the contrary, this is my
theory too. : ) It looks like "perlbug" is the recommended way of
reporting bugs in Perl...I'll try to run through this at some point (I
should probably confirm it happens on the latest and greatest Perl,
etc). Thanks for the responses...

- Bill
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,049
Latest member
Allen00Reed

Latest Threads

Top