Apparent bug in Perl 5.10 regexes w. UTF-8 expression

Discussion in 'Perl Misc' started by Ben Bullock, Jul 13, 2008.

  1. Ben Bullock

    Ben Bullock Guest

    I've found a place where Perl seems to behave differently depending on
    whether something is marked as UTF-8 or not, regardless of the fact that
    it is just ASCII.

    In the following code snippet,

    #!/usr/local/bin/perl -lw
    use strict;
    use Encode 'decode';
    use Lingua::JA::FindDates 'subsjdate';
    binmode STDERR,"utf8";
    binmode STDOUT,"utf8";
    print STDERR "first try\n";
    my $test = "ABCDEFG";
    print subsjdate($test);
    print STDERR "now try again\n";
    $test = decode ('utf8', $test);
    print subsjdate($test);

    the output is like this:

    ben ~ 541 $ ./test2.pl
    first try

    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.
    ABCDEFG
    now try again

    ABCDEFG
    ben ~ 542 $

    But, if I

    use utf8;

    and call the routine with a non-ascii string, like å¹³æˆ, I don't get the
    error messages.

    What's more, after about one hour of exhaustive checking, I'm fairly sure
    that there is no uninitialized value in the pattern match in question. In
    fact I can remove the error message by removing a variable which is
    initialized, called $kanjidigits, from the pattern match, but that seems
    even more weird.

    I think the above-described behaviour, regardless of any errors in the
    module, indicates an error in Perl. Also, I think there is nothing wrong
    with the module. Does anybody have any other opinions?
     
    Ben Bullock, Jul 13, 2008
    #1
    1. Advertising

  2. On 2008-07-13 14:14, Ben Bullock <> wrote:
    > I've found a place where Perl seems to behave differently depending on
    > whether something is marked as UTF-8 or not, regardless of the fact that
    > it is just ASCII.
    >
    > In the following code snippet,
    >
    > #!/usr/local/bin/perl -lw
    > use strict;
    > use Encode 'decode';
    > use Lingua::JA::FindDates 'subsjdate';
    > binmode STDERR,"utf8";
    > binmode STDOUT,"utf8";
    > print STDERR "first try\n";
    > my $test = "ABCDEFG";
    > print subsjdate($test);
    > print STDERR "now try again\n";
    > $test = decode ('utf8', $test);
    > print subsjdate($test);
    >
    > the output is like this:
    >
    > ben ~ 541 $ ./test2.pl
    > first try
    >
    > Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    > site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.

    [...]
    > What's more, after about one hour of exhaustive checking, I'm fairly sure
    > that there is no uninitialized value in the pattern match in question.


    Right. Your problem can be reproduced with this script:

    #!/usr/bin/perl
    use warnings;
    use strict;

    my $regex =
    "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";
    my $test = "ABCDEFG";
    if ($test =~ /($regex)/) {
    print "m:<$1>\n";
    }
    __END__

    If the last character ("\x{5e74}") is removed from the regexp, the
    warning vanishes. But if the capturing () is removed (leaving just
    "\\s*\x{5e74}", the warning vanishes, too - so it's not just \x{5e74}
    which triggers the warning, only that combined with something else.

    hp
     
    Peter J. Holzer, Jul 13, 2008
    #2
    1. Advertising

  3. Ben Bullock

    Ben Morrow Guest

    Quoth "Peter J. Holzer" <>:
    > On 2008-07-13 14:14, Ben Bullock <> wrote:
    > > I've found a place where Perl seems to behave differently depending on
    > > whether something is marked as UTF-8 or not, regardless of the fact that
    > > it is just ASCII.

    >
    > Right. Your problem can be reproduced with this script:
    >
    > #!/usr/bin/perl
    > use warnings;
    > use strict;
    >
    > my $regex =
    > "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]*)\\s*\x{5e74}";


    Using utf8 in regexen is not well-supported in 5.8; in particular, the
    regex engine is not consistent about when to apply utf8 semantics and
    when to apply byte semantics. Some of the bugs have been fixed in 5.10;
    I don't know if they all have.

    Ben

    --
    For far more marvellous is the truth than any artists of the past imagined!
    Why do the poets of the present not speak of it? What men are poets who can
    speak of Jupiter if he were like a man, but if he is an immense spinning
    sphere of methane and ammonia must be silent? [Feynmann]
     
    Ben Morrow, Jul 13, 2008
    #3
  4. Ben Bullock

    Ben Bullock Guest

    On Sun, 13 Jul 2008 19:46:14 +0100, Ben Morrow wrote:

    > Quoth "Peter J. Holzer" <>:
    >> On 2008-07-13 14:14, Ben Bullock <> wrote:
    >> > I've found a place where Perl seems to behave differently depending

    on
    >> > whether something is marked as UTF-8 or not, regardless of the fact

    that
    >> > it is just ASCII.

    >>
    >> Right. Your problem can be reproduced with this script:
    >>
    >> #!/usr/bin/perl
    >> use warnings;
    >> use strict;
    >>
    >> my $regex =
    >> "([\x{ff10}-\x{ff19}0-9]{4}|[\x{5341}\x{516d}\x{4e03}\x{4e5d}\x{4e94}\x

    {56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x{516b}\x{4e09}]?\x{5343}[\x{5341}
    \x{516d}\x{4e03}\x{4e5d}\x{4e94}\x{56db}\x{5343}\x{767e}\x{4e8c}\x{4e00}\x
    {516b}\x{4e09}]*)\\s*\x{5e74}";
    >
    > Using utf8 in regexen is not well-supported in 5.8; in particular, the
    > regex engine is not consistent about when to apply utf8 semantics and
    > when to apply byte semantics. Some of the bugs have been fixed in 5.10;
    > I don't know if they all have.


    The problem I described is the behaviour of Perl 5.10:

    ben ~ 501 $ perl --version

    This is perl, v5.10.0 built for i686-linux

    Copyright 1987-2007, Larry Wall

    Perl may be copied only under the terms of either the Artistic License or
    the
    GNU General Public License, which may be found in the Perl 5 source kit.

    Complete documentation for Perl, including FAQ lists, should be found on
    this system using "man perl" or "perldoc perl". If you have access to the
    Internet, point your browser at http://www.perl.org/, the Perl Home Page.

    ben ~ 502 $ ben ~ 502 $ ./test2.pl
    first try

    Use of uninitialized value in pattern match (m//) at /usr/local/lib/perl5/
    site_perl/5.10.0/Lingua/JA/FindDates.pm line 531.

    etc.

    Should I report this as a bug?
     
    Ben Bullock, Jul 13, 2008
    #4
  5. Ben Bullock

    Ben Bullock Guest

    On Sun, 13 Jul 2008 22:18:43 +0000, Ben Bullock wrote:

    > Should I report this as a bug?


    Never mind, I reported it anyway.
     
    Ben Bullock, Jul 13, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Integer Software
    Replies:
    3
    Views:
    749
    Tejal Joshi \(MSFT\)
    Apr 19, 2004
  2. Bengt Richter
    Replies:
    3
    Views:
    303
    Steve Holden
    Jan 19, 2005
  3. Harold Yarmouth

    Apparent bug in FileLock

    Harold Yarmouth, Nov 19, 2008, in forum: Java
    Replies:
    1
    Views:
    351
    Harold Yarmouth
    Nov 20, 2008
  4. Bill Kelly
    Replies:
    6
    Views:
    357
    Bill Kelly
    Aug 27, 2004
  5. Eric J. Roode

    Apparent bug in 5.8 wrt tied scalars

    Eric J. Roode, Nov 19, 2005, in forum: Perl Misc
    Replies:
    2
    Views:
    122
    Ilya Zakharevich
    Nov 23, 2005
Loading...

Share This Page