UTF-8 in regexp with 5.8.1

Discussion in 'Perl Misc' started by Wes Groleau, Apr 11, 2005.

  1. Wes Groleau

    Wes Groleau Guest

    I have a file containing thousands of Spanish words, encoded AFAIK)
    in UTF-8. I also have a perl script in UTF-8, which says (hope
    pasting works):
    #!/usr/bin/perl -w -CSD
    #
    # NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)
    use warnings;
    use strict;
    use utf8;

    while (<>)
    {
    print if ( /ñ/ )
    }

    What is in the regexp is supposed to be "small n with tilde"
    and I verified with od -xc that it is hex C3 B1 as is every
    place in the file where that letter appears.

    The script is intended to find all words containing that
    letter. But it finds nothing. After wading through gallons
    of text (man encoding, man utf8, man perlunicode, etc.),
    I still had no reason to think it was wrong. But I added

    use encoding "utf8";

    and ran it again, getting only:
    Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
    Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    immediately after start byte 0xde) at
    /Volumes/Parents/wgroleau/bin/char-find line 12.

    ?!? According to 'od -xc' the script does NOT contain any
    byte that is 0xde In fact, the ONLY bytes in the script that
    are not ASCII are the bytes for the "enye" which are on line
    twelve, but neither of them is a DE and NO bytes are 00.

    Have I found a bug in perl or is my ignorance just getting
    the best of me?

    Oh, yeah, I also tried a few things with 'binmode' that didn't
    work either.

    WWG
    Wes Groleau, Apr 11, 2005
    #1
    1. Advertising

  2. On Sun, 10 Apr 2005, Wes Groleau wrote:

    > I have a file containing thousands of Spanish words, encoded AFAIK)
    > in UTF-8.


    Well, your whole report stands or falls by that "AFAIK", so it might
    be useful to have a test case, including data, which we could run for
    ourselves (preferably on a web page, to exclude any possibility of
    lossage in usenet postings) to help pin-down your problem.

    > I also have a perl script in UTF-8,


    Noted, although I don't see any compelling reason to code the script
    itself in utf-8. Sure, you /can/ do, but it seems to me to be a
    potential additional complication that one could do well to avoid
    when feasible.

    > #!/usr/bin/perl -w -CSD
    > #
    > # NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)


    Do you have a cite on that? My knowledge of this area is admittedly
    somewhat limited, but I hadn't met this before.

    > What is in the regexp is supposed to be "small n with tilde"
    > and I verified with od -xc that it is hex C3 B1 as is every
    > place in the file where that letter appears.


    Sounds good. That even seems to have worked in your usenet posting,
    as far as I can see.

    > use encoding "utf8";
    >
    > and ran it again, getting only:
    > Ze-Admins-Computer:~/Desktop wgroleau$ char-find palabras.utf8
    > Malformed UTF-8 character (unexpected non-continuation byte 0x00,
    > immediately after start byte 0xde) at
    > /Volumes/Parents/wgroleau/bin/char-find line 12.
    >
    > ?!?


    Bizarre.

    > According to 'od -xc' the script does NOT contain any
    > byte that is 0xde In fact, the ONLY bytes in the script that
    > are not ASCII are the bytes for the "enye" which are on line
    > twelve, but neither of them is a DE and NO bytes are 00.


    I've successfully processed utf-8 and utf-16 data without the use of
    the -C flag(s), by using explicit binmode() on the relevant files.

    If you could at least get one working variant of your script, you
    could then at least move forward from there.

    Sorry, this is a bit inconclusive, as yet.
    Alan J. Flavell, Apr 11, 2005
    #2
    1. Advertising

  3. Wes Groleau

    Wes Groleau Guest

    Alan J. Flavell wrote:
    > On Sun, 10 Apr 2005, Wes Groleau wrote:
    >>I have a file containing thousands of Spanish words, encoded AFAIK)
    >>in UTF-8.

    >
    > Well, your whole report stands or falls by that "AFAIK", so it might


    Well, I told my editor to save it as UTF-8, and I think it works.
    (When I save web pages that way, and specify UTF-8 in a META tag,
    Spanish, French, Polish, and Japanese characters are correctly
    rendered by most browsers.)

    >>I also have a perl script in UTF-8,

    >
    > Noted, although I don't see any compelling reason to code the script
    > itself in utf-8. Sure, you /can/ do, but it seems to me to be a
    > potential additional complication that one could do well to avoid


    Well, in this case, I am trying to regexp a non-ASCII character.
    Since I am an easily-distracted (A.D.D.) type, and I work with
    several different character sets, I am attempting to standardize
    on UTF-8 rather than constantly be debugging places where I forgot
    to make a switch. :)

    >>#!/usr/bin/perl -w -CSD
    >>#
    >># NOTE: The extra space in the bang line is mandatory (bug in perl 5.8)

    >
    > Do you have a cite on that? My knowledge of this area is admittedly
    > somewhat limited, but I hadn't met this before.


    Oh, I reported that a while back. If I take the space out
    on Mac OS X, I get frequent segment violations. If I remove
    the space on NetBSD/Alpha, I get consistent nasty-grams about
    the wrong method of invoking the debugger.

    > I've successfully processed utf-8 and utf-16 data without the use of
    > the -C flag(s), by using explicit binmode() on the relevant files.


    I tried a couple of things with binmode that also didn't work,
    but I don't remember exactly what happened.

    > If you could at least get one working variant of your script, you
    > could then at least move forward from there.
    >
    > Sorry, this is a bit inconclusive, as yet.


    Well, a post in another thread made me try removing the
    "use utf8" and it worked. So, I really think this is
    a bug:

    - A regexp containing a non-ASCII character in
    correct UTF-8 encoding works.

    - Add "use utf8" and it silently stops working.

    - Add 'use encoding "utf8"' and you get chewed out
    for having invalid UTF-8, in a message that bitches
    about the presence of bytes that don't exist.

    I'll send it in .....

    --
    Wes Groleau

    He that is good for making excuses, is seldom good for anything else.
    -- Benjamin Franklin
    Wes Groleau, Apr 12, 2005
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. JJBW
    Replies:
    1
    Views:
    10,126
    Joerg Jooss
    Apr 24, 2004
  2. =?Utf-8?B?QXNoYQ==?=
    Replies:
    3
    Views:
    419
  3. Arifi Koseoglu
    Replies:
    2
    Views:
    964
    Arifi Koseoglu
    Apr 13, 2004
  4. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,317
    P.J. Plauger
    Aug 1, 2006
  5. Joao Silva
    Replies:
    16
    Views:
    355
    7stud --
    Aug 21, 2009
Loading...

Share This Page