regexp problem in perl 5.6.1 and 5.8.4

Discussion in 'Perl Misc' started by Thomas Stauffer, Jun 4, 2004.

  1. I have done some Perl programming in the past but I am by no means and
    expert. I am currently working on changing some code written some time
    ago by an employee no longer with the company. The code is currently
    running under 5.005.02. I am making changes and adding some ucs2 ->
    utf8 conversion. I want to run the code under Perl 5.8.4 to take
    advantage of Perl's internal Unicode support. At any rate, there is a
    regular expression in the code the works fine under 5.005.02 but loops
    under 5.6.1 and above. Following code illustrates the problem:

    $orig_string = 'JKXXAF';

    $regex = qr {\G
    # Match as many characters as possible
    # that can be passed thru as-is
    ([^\x00-\xFF]+)

    # Then try to match $A1 and next two bytes
    | (@..)

    # Otherwise just get the next byte
    | (.)
    }sx;

    print "regex = $regex\n";

    while ($orig_string =~ /$regex/g) {
    print "\$1=$1\n";
    print "\$2=$2\n";
    print "\$3=$3\n";
    }

    The problem seems to be with the use of the \G attribute. If I take it
    out, the regular expression works the same in all versions of Perl.
    However, since I did not write the code and the programmer who did was
    considerably more experienced using Perl than I am, I am hesitant just
    to remove it. Anyhow, I have been looking at this for several days
    without success. My Perl expert suggested I post it to this forum. Any
    help would be greatly appreciated.

    Following is the details of the version of Perl I'm using:

    Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
    Platform:
    osname=solaris, osvers=2.8, archname=sun4-solaris
    uname='sunos cwu21awu 5.8 generic_108528-29 sun4u sparc
    sunw,sun-blade-100 '
    config_args=''
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=undef use5005threads=undef useithreads=undef
    usemultiplicity=undef
    useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=undef uselongdouble=undef
    usemymalloc=n, bincompat5005=undef
    Compiler:
    cc='/opt/SUNWspro/bin/cc', ccflags =' -D_LARGEFILE_SOURCE
    -D_FILE_OFFSET_BITS=64',
    optimize='-O',
    cppflags=''
    ccversion='Sun WorkShop 6 update 2 C 5.3 Patch 111679-08
    2002/05/09', gccversion='', gccosandvers=''
    intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
    ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
    lseeksize=8
    alignbytes=8, prototype=define
    Linker and Libraries:
    ld='/opt/SUNWspro/bin/cc', ldflags =' -L/usr/lib -L/usr/ccs/lib
    -L/opt/SUNWspro/WS6U2/lib -L/usr/local/lib '
    libpth=/usr/lib /usr/ccs/lib /opt/SUNWspro/WS6U2/lib /usr/local/lib
    libs=-lsocket -lnsl -ldl -lm -lc
    perllibs=-lsocket -lnsl -ldl -lm -lc
    libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a
    gnulibc_version=''
    Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
    cccdlflags='-KPIC', lddlflags='-G -L/usr/lib -L/usr/ccs/lib
    -L/opt/SUNWspro/WS6U2/lib -L/usr/local/lib'


    Characteristics of this binary (from libperl):
    Compile-time options: USE_LARGE_FILES
    Built under solaris
    Compiled at Apr 22 2004 16:07:19
    @INC:
    /usr/local/perl5/lib/5.8.4/sun4-solaris
    /usr/local/perl5/lib/5.8.4
    /usr/local/perl5/lib/site_perl/5.8.4/sun4-solaris
    /usr/local/perl5/lib/site_perl/5.8.4
    /usr/local/perl5/lib/site_perl
     
    Thomas Stauffer, Jun 4, 2004
    #1
    1. Advertising

  2. Thomas Stauffer

    Anno Siegel Guest

    Thomas Stauffer <> wrote in comp.lang.perl.misc:
    > I have done some Perl programming in the past but I am by no means and
    > expert. I am currently working on changing some code written some time
    > ago by an employee no longer with the company. The code is currently
    > running under 5.005.02. I am making changes and adding some ucs2 ->
    > utf8 conversion. I want to run the code under Perl 5.8.4 to take
    > advantage of Perl's internal Unicode support. At any rate, there is a
    > regular expression in the code the works fine under 5.005.02 but loops
    > under 5.6.1 and above. Following code illustrates the problem:
    >
    > $orig_string = 'JKXXAF';
    >
    > $regex = qr {\G
    > # Match as many characters as possible
    > # that can be passed thru as-is
    > ([^\x00-\xFF]+)
    >
    > # Then try to match $A1 and next two bytes
    > | (@..)
    >
    > # Otherwise just get the next byte
    > | (.)
    > }sx;
    >
    > print "regex = $regex\n";
    >
    > while ($orig_string =~ /$regex/g) {
    > print "\$1=$1\n";
    > print "\$2=$2\n";
    > print "\$3=$3\n";
    > }
    >
    > The problem seems to be with the use of the \G attribute. If I take it
    > out, the regular expression works the same in all versions of Perl.
    > However, since I did not write the code and the programmer who did was
    > considerably more experienced using Perl than I am, I am hesitant just
    > to remove it. Anyhow, I have been looking at this for several days
    > without success. My Perl expert suggested I post it to this forum. Any
    > help would be greatly appreciated.


    The \G is really not needed for the function of the loop. //g in scalar
    context makes sure \G is implicitly matched before each match is attempted.

    Note that adding \G only anchors the first alternative explicitly,
    the second and third are free to match anywhere. One could argue
    that scalar //g should still anchor the whole match, so the current
    would be a bug. In any case, the behavior in presence of both
    /G and //g appears to have changed.

    Adding non-capturing parentheses around the alternative fixes the
    behavior:

    my $regex = qr { \G
    (?:
    # Match as many characters as possible
    # that can be passed thru as-is
    ([^\x00-\xFF]+)

    # Then try to match $A1 and next two bytes
    | (@..)

    # Otherwise just get the next byte
    | (.)
    )
    }sx;

    I'd say you can safely leave it \G off. If you want to keep it, add
    the grouping, otherwise it doesn't make much sense.

    Anno
     
    Anno Siegel, Jun 5, 2004
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    163
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    490
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    363
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    192
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    181
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page