utf8 in regexp (perl 5.8.1)

Wes Groleau · Apr 11, 2005

I have a file containing thousands of Spanish words, encoded AFAIK)
in UTF-8. I also have a perl script in UTF-8, which says (hope
pasting works):

#!/usr/bin/perl -w -CSD
#
# NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
use warnings;
use strict;
use utf8;

while (<>)
{
print if ( /Ã±/ )
}

What is in the regexp is supposed to be "small n with tilde"
and I verified with od -xc that it is hex C3 B1 as is every
place in the file where that letter appears.

The script is intended to find all words containing that
letter. But it finds nothing. After wading through gallons
of text (man encoding, man utf8, man perlunicode, etc.),
I still had no reason to think it was wrong. But I added

use encoding "utf8";

and ran it again, getting only:

Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

Have I found a bug in perl or is my ignorance just getting
the best of me?

Oh, yeah, I also tried a few things with 'binmode' that didn't
work either.

WWG

Wes Groleau · Apr 12, 2005

Wes said:
[problems with]
use utf8;

while (<>)
{
print if ( /Ã±/ )
}

I removed "use utf8" and it worked. So I think it's
a bug, especially since

use encoding "utf8";
caused

Malformed UTF-8 character (unexpected non-continuation byte 0x00,
immediately after start byte 0xde) at
/Volumes/Parents/wgroleau/bin/char-find line 12.

?!? According to 'od -xc' the script does NOT contain any
byte that is 0xde In fact, the ONLY bytes in the script that
are not ASCII are the bytes for the "enye" which are on line
twelve, but neither of them is a DE and NO bytes are 00.

--
Wes Groleau

A pessimist says the glass is half empty.

An optimist says the glass is half full.

An engineer says somebody made the glass
twice as big as it needed to be.

UTF-8 in regexp with 5.8.1	2	Apr 11, 2005
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
utf8 pragma - strange behavior	1	Mar 17, 2005
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
utf8 and chomp	13	Feb 22, 2009
Problems with utf8, locale and regex	0	Dec 5, 2007
given char* utf8, how to read unicode line by line, and output utf8	2	Mar 13, 2012
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013

utf8 in regexp (perl 5.8.1)

Wes Groleau

Wes Groleau

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads