regexp problem in perl 5.6.1 and 5.8.4

  • Thread starter Thomas Stauffer
  • Start date
T

Thomas Stauffer

I have done some Perl programming in the past but I am by no means and
expert. I am currently working on changing some code written some time
ago by an employee no longer with the company. The code is currently
running under 5.005.02. I am making changes and adding some ucs2 ->
utf8 conversion. I want to run the code under Perl 5.8.4 to take
advantage of Perl's internal Unicode support. At any rate, there is a
regular expression in the code the works fine under 5.005.02 but loops
under 5.6.1 and above. Following code illustrates the problem:

$orig_string = 'JKXXAF';

$regex = qr {\G
# Match as many characters as possible
# that can be passed thru as-is
([^\x00-\xFF]+)

# Then try to match $A1 and next two bytes
| (@..)

# Otherwise just get the next byte
| (.)
}sx;

print "regex = $regex\n";

while ($orig_string =~ /$regex/g) {
print "\$1=$1\n";
print "\$2=$2\n";
print "\$3=$3\n";
}

The problem seems to be with the use of the \G attribute. If I take it
out, the regular expression works the same in all versions of Perl.
However, since I did not write the code and the programmer who did was
considerably more experienced using Perl than I am, I am hesitant just
to remove it. Anyhow, I have been looking at this for several days
without success. My Perl expert suggested I post it to this forum. Any
help would be greatly appreciated.

Following is the details of the version of Perl I'm using:

Summary of my perl5 (revision 5 version 8 subversion 4) configuration:
Platform:
osname=solaris, osvers=2.8, archname=sun4-solaris
uname='sunos cwu21awu 5.8 generic_108528-29 sun4u sparc
sunw,sun-blade-100 '
config_args=''
hint=recommended, useposix=true, d_sigaction=define
usethreads=undef use5005threads=undef useithreads=undef
usemultiplicity=undef
useperlio=define d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=undef uselongdouble=undef
usemymalloc=n, bincompat5005=undef
Compiler:
cc='/opt/SUNWspro/bin/cc', ccflags =' -D_LARGEFILE_SOURCE
-D_FILE_OFFSET_BITS=64',
optimize='-O',
cppflags=''
ccversion='Sun WorkShop 6 update 2 C 5.3 Patch 111679-08
2002/05/09', gccversion='', gccosandvers=''
intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=16
ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t',
lseeksize=8
alignbytes=8, prototype=define
Linker and Libraries:
ld='/opt/SUNWspro/bin/cc', ldflags =' -L/usr/lib -L/usr/ccs/lib
-L/opt/SUNWspro/WS6U2/lib -L/usr/local/lib '
libpth=/usr/lib /usr/ccs/lib /opt/SUNWspro/WS6U2/lib /usr/local/lib
libs=-lsocket -lnsl -ldl -lm -lc
perllibs=-lsocket -lnsl -ldl -lm -lc
libc=/lib/libc.so, so=so, useshrplib=false, libperl=libperl.a
gnulibc_version=''
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so, d_dlsymun=undef, ccdlflags=' '
cccdlflags='-KPIC', lddlflags='-G -L/usr/lib -L/usr/ccs/lib
-L/opt/SUNWspro/WS6U2/lib -L/usr/local/lib'


Characteristics of this binary (from libperl):
Compile-time options: USE_LARGE_FILES
Built under solaris
Compiled at Apr 22 2004 16:07:19
@INC:
/usr/local/perl5/lib/5.8.4/sun4-solaris
/usr/local/perl5/lib/5.8.4
/usr/local/perl5/lib/site_perl/5.8.4/sun4-solaris
/usr/local/perl5/lib/site_perl/5.8.4
/usr/local/perl5/lib/site_perl
 
A

Anno Siegel

Thomas Stauffer said:
I have done some Perl programming in the past but I am by no means and
expert. I am currently working on changing some code written some time
ago by an employee no longer with the company. The code is currently
running under 5.005.02. I am making changes and adding some ucs2 ->
utf8 conversion. I want to run the code under Perl 5.8.4 to take
advantage of Perl's internal Unicode support. At any rate, there is a
regular expression in the code the works fine under 5.005.02 but loops
under 5.6.1 and above. Following code illustrates the problem:

$orig_string = 'JKXXAF';

$regex = qr {\G
# Match as many characters as possible
# that can be passed thru as-is
([^\x00-\xFF]+)

# Then try to match $A1 and next two bytes
| (@..)

# Otherwise just get the next byte
| (.)
}sx;

print "regex = $regex\n";

while ($orig_string =~ /$regex/g) {
print "\$1=$1\n";
print "\$2=$2\n";
print "\$3=$3\n";
}

The problem seems to be with the use of the \G attribute. If I take it
out, the regular expression works the same in all versions of Perl.
However, since I did not write the code and the programmer who did was
considerably more experienced using Perl than I am, I am hesitant just
to remove it. Anyhow, I have been looking at this for several days
without success. My Perl expert suggested I post it to this forum. Any
help would be greatly appreciated.

The \G is really not needed for the function of the loop. //g in scalar
context makes sure \G is implicitly matched before each match is attempted.

Note that adding \G only anchors the first alternative explicitly,
the second and third are free to match anywhere. One could argue
that scalar //g should still anchor the whole match, so the current
would be a bug. In any case, the behavior in presence of both
/G and //g appears to have changed.

Adding non-capturing parentheses around the alternative fixes the
behavior:

my $regex = qr { \G
(?:
# Match as many characters as possible
# that can be passed thru as-is
([^\x00-\xFF]+)

# Then try to match $A1 and next two bytes
| (@..)

# Otherwise just get the next byte
| (.)
)
}sx;

I'd say you can safely leave it \G off. If you want to keep it, add
the grouping, otherwise it doesn't make much sense.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,059
Latest member
cryptoseoagencies

Latest Threads

Top