utf8 in regexp (perl 5.8.1)

Discussion in 'Perl' started by Wes Groleau, Apr 11, 2005.

  1. Wes Groleau

    Wes Groleau Guest

    I have a file containing thousands of Spanish words, encoded AFAIK)
    in UTF-8. I also have a perl script in UTF-8, which says (hope
    pasting works):

    #!/usr/bin/perl -w -CSD
    #
    # NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
    use warnings;
    use strict;
    use utf8;

    while (<>)
    {
    print if ( /ñ/ )
    }

    What is in the regexp is supposed to be "small n with tilde"
    and I verified with od -xc that it is hex C3 B1 as is every
    place in the file where that letter appears.

    The script is intended to find all words containing that
    letter. But it finds nothing. After wading through gallons
    of text (man encoding, man utf8, man perlunicode, etc.),
    I still had no reason to think it was wrong. But I added

    use encoding "utf8";

    and ran it again, getting only:

    Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
    Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    immediately after start byte 0xde) at
    /Volumes/Parents/wgroleau/bin/char-find line 12.

    ?!? According to 'od -xc' the script does NOT contain any
    byte that is 0xde In fact, the ONLY bytes in the script that
    are not ASCII are the bytes for the "enye" which are on line
    twelve, but neither of them is a DE and NO bytes are 00.

    Have I found a bug in perl or is my ignorance just getting
    the best of me?

    Oh, yeah, I also tried a few things with 'binmode' that didn't
    work either.

    WWG
    Wes Groleau, Apr 11, 2005
    #1
    1. Advertising

  2. Wes Groleau

    Wes Groleau Guest

    Wes Groleau wrote:
    > [problems with]
    > use utf8;
    >
    > while (<>)
    > {
    > print if ( /ñ/ )
    > }


    I removed "use utf8" and it worked. So I think it's
    a bug, especially since

    > use encoding "utf8";


    caused

    > Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    > immediately after start byte 0xde) at
    > /Volumes/Parents/wgroleau/bin/char-find line 12.
    >
    > ?!? According to 'od -xc' the script does NOT contain any
    > byte that is 0xde In fact, the ONLY bytes in the script that
    > are not ASCII are the bytes for the "enye" which are on line
    > twelve, but neither of them is a DE and NO bytes are 00.


    --
    Wes Groleau

    A pessimist says the glass is half empty.

    An optimist says the glass is half full.

    An engineer says somebody made the glass
    twice as big as it needed to be.
    Wes Groleau, Apr 12, 2005
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Joao Silva
    Replies:
    16
    Views:
    341
    7stud --
    Aug 21, 2009
  2. gry
    Replies:
    2
    Views:
    705
    Alf P. Steinbach
    Mar 13, 2012
  3. Risto Vaarandi

    regexp problem with UTF8

    Risto Vaarandi, Jul 16, 2003, in forum: Perl Misc
    Replies:
    0
    Views:
    117
    Risto Vaarandi
    Jul 16, 2003
  4. Graham Wood
    Replies:
    4
    Views:
    102
    Alan J. Flavell
    Sep 5, 2003
  5. Jochen Lehmeier

    Anything to be done about utf8 regexp performance?

    Jochen Lehmeier, Nov 3, 2009, in forum: Perl Misc
    Replies:
    1
    Views:
    151
    Eric Pozharski
    Nov 4, 2009
Loading...

Share This Page