perl + regex bug?

N

nitroamos

hello -- i've been spending quite a bit of time trying to figure out
this issue, and now that i've found a workaround, i'm wondering if the
problem was a bug. basically, i have some code that looks like this:

$orbitals[$i] =~ /\s+(\d)\s+Orbital Energy/;
$index = $1;
$index--; $index++;

if($element =~ /_${index}_/ or $element =~ m/_$index$/){
print "$element matches $& with index = $index\n";


}

this is just a small part of my program, so i hope i've shown enough to
isolate the bug. Basically, I'm trying to grab an integer out of a
string, and then look in another string to see if i have a match. i've
pasted what the output looks like at the bottom. as you can see, the
incorrect results are producing a subset of the matches I expect --
only the first pattern on the if line is matching.

for some reason, without the $index--; $index++; business, my second
pattern is not matching. after fooling around, it seems that somehow
it's entirely related to the $ anchor. although not what I want, if I
add a ^ anchor i can get matches with or without the $ anchor. so
somehow the $ anchor all by itself is not working unless i do the --/++
business.

what kind of weirdness is this? i've pasted perl -V at the bottom. if
this is some version specific bug, then that's all i need to know; i'm
ok with the workaround. but if there's something that i'm not
understanding, then i want to know what i'm missing.

thanks!



results where everything is working as expected (correctly):
_1 matches _1 with index = 1
_2 matches _2 with index = 2
_3 matches _3 with index = 3
_4 matches _4 with index = 4
_5 matches _5 with index = 5
_1_2 matches _1_ with index = 1
_1_2 matches _2 with index = 2
_1_3 matches _1_ with index = 1
_1_3 matches _3 with index = 3
_2_3 matches _2_ with index = 2
_2_3 matches _3 with index = 3
_1_4 matches _1_ with index = 1
_1_4 matches _4 with index = 4
_2_4 matches _2_ with index = 2
_2_4 matches _4 with index = 4
_3_4 matches _3_ with index = 3
_3_4 matches _4 with index = 4
_1_5 matches _1_ with index = 1
_1_5 matches _5 with index = 5
_2_5 matches _2_ with index = 2
_2_5 matches _5 with index = 5
_3_5 matches _3_ with index = 3
_3_5 matches _5 with index = 5
_4_5 matches _4_ with index = 4
_4_5 matches _5 with index = 5

results where the "$index--; $index++;" line has been commented out
(incorrect results):
_1_2 matches _1_ with index = 1
_1_3 matches _1_ with index = 1
_2_3 matches _2_ with index = 2
_1_4 matches _1_ with index = 1
_2_4 matches _2_ with index = 2
_3_4 matches _3_ with index = 3
_1_5 matches _1_ with index = 1
_2_5 matches _2_ with index = 2
_3_5 matches _3_ with index = 3
_4_5 matches _4_ with index = 4


here is what perl -V says:
Summary of my perl5 (revision 5.0 version 8 subversion 0)
configuration:
Platform:
osname=linux, osvers=2.4.21-1.1931.2.382.entsmp,
archname=i386-linux-thread-multi
uname='linux str'
config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686
-Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red
Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux
-Dvendorprefix=/usr -Dsiteprefix=/usr
-Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads
-Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db
-Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio
-Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less
-isr'
hint=recommended, useposix=true, d_sigaction=define
usethreads=define use5005threads=undef'
useithreads=define usemultiplicity=
useperlio= d_sfio=undef uselargefiles=define usesocks=undef
use64bitint=undef use64bitall=un uselongdouble=
usemymalloc=, bincompat5005=undef
Compiler:
cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
-DDEBUGGING -fno-strict-aliasing -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
optimize='',
cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
-DDEBUGGING -fno-strict-aliasing -I/usr/local/include
-I/usr/include/gdbm'
ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)',
gccosandvers=''
gccversion='3.2.2 200302'
intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234
d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
ivtype='long'
k', ivsize=4'
ivtype='l, nvtype='double'
o_nonbl', nvsize=, Off_t='', lseeksize=8
alignbytes=4, prototype=define
Linker and Libraries:
ld='gcc'
l', ldflags =' -L/u'
libpth=/usr/local/lib /lib /usr/lib
libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
perllibs=
libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper
gnulibc_version='2.3.2'
Dynamic Linking:
dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef,
ccdlflags='-rdynamic
-Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
cccdlflags='-fPIC'
ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s
Unicode/Normalize XS/A'


Characteristics of this binary (from libperl):
Compile-time options: DEBUGGING MULTIPLICITY USE_ITHREADS
USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
Locally applied patches:
MAINT18379
Built under linux
Compiled at Aug 13 2003 11:47:58
@INC:
/usr/lib/perl5/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/5.8.0
/usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/site_perl/5.8.0
/usr/lib/perl5/site_perl
/usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/vendor_perl/5.8.0
/usr/lib/perl5/vendor_perl
/usr/lib/perl5/5.8.0/i386-linux-thread-multi
/usr/lib/perl5/5.8.0
 
P

Paul Lalli

hello -- i've been spending quite a bit of time trying to figure out
this issue, and now that i've found a workaround, i'm wondering if the
problem was a bug.

So there's two possibilities: You did something wrong, or Perl has a
bug that has never been detected or repaired. It is exceedingly
arrogant to guess that the latter has a greater chance than the former,
IMHO.
basically, i have some code that looks like this:

Please reduce your *real* code to the smallest possible script that
demonstrates the error, and yet is still a complete script we can run.
$orbitals[$i] =~ /\s+(\d)\s+Orbital Energy/;

And what is the value of $i? What are the values in @orbitals?
$index = $1;

NEVER use $1, $2, $3 etc without first assuring that the pattern match
succeeded. If the above pattern did not match, $1 will be whatever it
was after the last successful pattern match. If either there was no
prior successful match, or that match did not contain any capturing
parentheses, $1 will be undef.
$index--; $index++;

This is a sure sign that you're doing something wrong, but don't
understand what, so you're throwing random code at it until it almost
works. Find and fix the *real* problem.
if($element =~ /_${index}_/ or $element =~ m/_$index$/){

And now what is in $element? How can you expect us to interpret your
results without telling us what these three critical pieces of
information are?
print "$element matches $& with index = $index\n";


}

this is just a small part of my program, so i hope i've shown enough to
isolate the bug.

Again the assumption that Perl must be wrong instead of you. And no,
there isn't enough code above for us to see what you've done wrong.
Basically, I'm trying to grab an integer out of a
string,

No, you're trying to grab a single digit out of a string. Integers
include numbers such as "100", "42", and "8749312". Yours will grab
only 0 through 9. Perhaps that's your mistake? We can't possibly know
without seeing the original strings you were trying to match.
and then look in another string to see if i have a match. i've
pasted what the output looks like at the bottom. as you can see, the
incorrect results are producing a subset of the matches I expect --
only the first pattern on the if line is matching.

for some reason, without the $index--; $index++; business, my second
pattern is not matching.

Here's my guess - about half the time, your pattern doesn't match
because you have an integer greater than 9 in your string. Thus,
$index becomes undefined when you assign it to $1. You're not using
warnings, so Perl doesn't bother telling you that you've included an
undef inside your pattern match, and instead treats it as the empty
string. When you do the $index--; $index++; idiocy, Perl is forced to
treat the undef as the integer 0 instead.

As I said, this is a complete guess. Without seeing your actual data,
there is no way of *knowing* what is happening.
after fooling around, it seems that somehow
it's entirely related to the $ anchor. although not what I want, if I
add a ^ anchor i can get matches with or without the $ anchor. so
somehow the $ anchor all by itself is not working unless i do the --/++
business.

I find that remarkably unlikely, and am far more willing to bet you
have a bug in your diagnostic process.
what kind of weirdness is this?

The weirdness that happens when you don't program with warnings, don't
check the return values of your pattern matches, and don't show
complete data when asking for help.
i've pasted perl -V at the bottom. if
this is some version specific bug, then that's all i need to know; i'm
ok with the workaround. but if there's something that i'm not
understanding, then i want to know what i'm missing.

We can't tell you that, because you haven't given enough information.

Please read the Posting Guidelines for this group. They will give you
all sorts of hints on how to best ask questions in this and other
technical forums.

Paul Lalli
 
K

Klaus

hello -- i've been spending quite a bit of time trying to figure out
this issue, and now that i've found a workaround, i'm wondering if the
problem was a bug. basically, i have some code that looks like this:

$orbitals[$i] =~ /\s+(\d)\s+Orbital Energy/;
$index = $1;
$index--; $index++;

if($element =~ /_${index}_/ or $element =~ m/_$index$/){
print "$element matches $& with index = $index\n";

I suspect that something might be wrong with $index not being numeric
(although from your regex /...(\d).../ that seems impossible, or do you
have some unexpected locale settings / character encoding / UTF-8 /
UTF-16, etc... that have numeric characters other than '0'...'9' ? )

Anyway, try printing the content of "$index" in hex before and after
the "$index--; $index++;", like so...

print "before: \$index = x'", unpack('H*', $index), "'\n";
$index--; $index++;
print "after : \$index = x'", unpack('H*', $index), "'\n";

....and tell us what you get.
 
D

Dr.Ruud

(e-mail address removed) schreef:
this is just a small part of my program, so i hope i've shown enough
to isolate the bug.

You haven't. You could try again.
 
N

nitroamos

Hello -- I totally agree that it is unlikely that there is a bug in
perl itself, and so that's why I'm posting -- because I'm relatively
inexperienced with perl and I think there is something important to
learn here. On the other hand, I have enough experience with cross
platform development that I know that in general compiler bugs can't be
ruled out. Further, my perl debugging skills are a bit weak... I don't
even know what kinds of problems to look for. Also, I almost never post
online because I'm almost always able to figure out the problems on my
own or with google's help, so I'm sorry if my protocol is wrong.
However, this one has eluded me.

Also, sorry for not providing enough information. I had hoped that the
sample output would be enough, but since it's not, I went ahead and put
my whole program online (see line 106). I debated whether I should try
to extract a new program; I hope this is ok:
http://www.wag.caltech.edu/QMcBeaver/jag_recorr_all
which requires this input file:
http://www.wag.caltech.edu/QMcBeaver/ne_pw91.01.in
(I won't keep those files up there forever...)

A short description about what has happened by the time line 106 is
reached: Basically the point of the program is to read in some of those
orbitals and then spit some of them back out combinatorially into new
files. By line 106, the first 5 orbitals (for example the 20 lines of
numbers including a header line from the sample input) are stored in
$orbitals[$i], so this pattern:

$orbitals[$i] =~ /\s+(\d+)\s+Orbital Energy/;

should grab the first integer from this part of that string (i've
replaced the other 19 lines of numbers with "..."):
1 Orbital Energy -32.554095 Occupation 1.000000
....

And then because I want to preserve that integer, I save it $index =
$1; so that I can use it in later pattern matches. I have a set of
combinatorially produced strings that look like "_1_3_4_5" and I want
to find out if the integer that I matched from the orbital string above
is contained in the combinatorial strings. The combinatorial strings
are a sequence of "_Integer", so once I have an integer from an
orbital, i have to check for "_Integer_" and "_Integer$" since I don't
want orbital "2" to match both "_1_2_3" and "_1_3_24". Now I realize
that I could change the pattern in my combinatorial string, but I've
already invested enough time in this particular bug (in my code) that I
want to know what the problem is.

I would be happy to run any recommended test with my code. I fixed the
integer vs single digit bug that Paul Lalli pointed out (thanks!), but
that was not the source of the problem. Also, regarding my
understanding of regex and regarding those couple of lines, I have
played around extensively with printing out different combinations of
$i, $index, $1, etc and all of them are printing exactly as I would
expect. The only deviation from what I expect is that anchored pattern
matching.

Here is what Klaus recommended I try:
print "before: \$index = x'", unpack('H*', $index), "'\n";
$index--; $index++;
print "after : \$index = x'", unpack('H*', $index), "'\n";

I very slightly modified it to save space, and I've only pasted a
sample of the output at the bottom. As you can see, by commenting out
that one line, I'm getting different matching results even though
unpacking shows the same thing before and after. Regarding locale
settings, I don't know how to check... but I don't think there are any
strange characters in my strings because I tried looking for them using
pattern matching. Also, if I add print out this: length "$index", I get
length of 1 in all cases -- before and after.

I think the most telling piece of evidence is that both this pattern:
if($element =~ m/^_$index$/){
and this pattern:
if($element =~ m/^_$index/){
correctly produce these kinds of matches:
_1 matches _1 with index = 1
_2 matches _2 with index = 2
_3 matches _3 with index = 3
printed with this:
print "$element matches $& with index = $index\n";

whereas this pattern:
if($element =~ m/_$index$/){
will not produce those matches.


Thanks!

Amos.




****** Results from Klaus' test ******

Here are partial results from:
print "index=$index before: \$index = x'", unpack('H*',
$index), "' ";

$index--; $index++;


print "after : \$index = x'", unpack('H*', $index), "'\n";



index=1 before: $index = x'31' after : $index = x'31'
index=2 before: $index = x'32' after : $index = x'32'
_2_3_4_5 matches _2_ with index = 2
index=3 before: $index = x'33' after : $index = x'33'
_2_3_4_5 matches _3_ with index = 3
index=4 before: $index = x'34' after : $index = x'34'
_2_3_4_5 matches _4_ with index = 4
index=5 before: $index = x'35' after : $index = x'35'
_2_3_4_5 matches _5 with index = 5
index=1 before: $index = x'31' after : $index = x'31'
_1_2_3_4_5 matches _1_ with index = 1
index=2 before: $index = x'32' after : $index = x'32'
_1_2_3_4_5 matches _2_ with index = 2
index=3 before: $index = x'33' after : $index = x'33'
_1_2_3_4_5 matches _3_ with index = 3
index=4 before: $index = x'34' after : $index = x'34'
_1_2_3_4_5 matches _4_ with index = 4
index=5 before: $index = x'35' after : $index = x'35'
_1_2_3_4_5 matches _5 with index = 5

and with the idiocy line commented out:
print "index=$index before: \$index = x'", unpack('H*',
$index), "' ";

#$index--; $index++;


print "after : \$index = x'", unpack('H*', $index), "'\n";
I get (note that none of the $& printed have the form "_Integer"):
index=1 before: $index = x'31' after : $index = x'31'
index=2 before: $index = x'32' after : $index = x'32'
_2_3_4_5 matches _2_ with index = 2
index=3 before: $index = x'33' after : $index = x'33'
_2_3_4_5 matches _3_ with index = 3
index=4 before: $index = x'34' after : $index = x'34'
_2_3_4_5 matches _4_ with index = 4
index=5 before: $index = x'35' after : $index = x'35'
index=1 before: $index = x'31' after : $index = x'31'
_1_2_3_4_5 matches _1_ with index = 1
index=2 before: $index = x'32' after : $index = x'32'
_1_2_3_4_5 matches _2_ with index = 2
index=3 before: $index = x'33' after : $index = x'33'
_1_2_3_4_5 matches _3_ with index = 3
index=4 before: $index = x'34' after : $index = x'34'
_1_2_3_4_5 matches _4_ with index = 4
index=5 before: $index = x'35' after : $index = x'35'
 
T

Tad McClellan

And then because I want to preserve that integer, I save it $index =
$1; so that I can use it in later pattern matches. I have a set of
combinatorially produced strings that look like "_1_3_4_5" and I want
to find out if the integer that I matched from the orbital string above
is contained in the combinatorial strings.


print "$index is contained in $orbital\n"
if grep $_ == $index, split /_/, $orbital;
 
A

axel

Also, sorry for not providing enough information. I had hoped that the
sample output would be enough, but since it's not, I went ahead and put

Why could it be since we did not know the input?
my whole program online (see line 106). I debated whether I should try
to extract a new program; I hope this is ok:

No it's not ok... this is Usenet, you should not expect us to fire
up a browser in order to look at and give advice on your problems...
the coomon curtesy would be to follow the guidelines of the group.
http://www.wag.caltech.edu/QMcBeaver/jag_recorr_all
which requires this input file:
http://www.wag.caltech.edu/QMcBeaver/ne_pw91.01.in
(I won't keep those files up there forever...)

Oh... a deadline for us to look at the problem then.

'nuff said.

Axel
 
C

CsB

You should use caution when assigning variables from a regular
expressions when it is not necessary.

There were several instances in your script where you're using paren to
assign ($1, $2, etc.) but you are not actually using (or even needing)
that data.

I removed all unnecessary variable assignments from your regular
expressions, commented out the "$index--; $index++;" line and all seems
to work well. At least it matches your posting for what good results
should look like.

Since it seems to work well now, the extraneous variable assignments
must have been trampling your data. How the "$index--; $index++;"
controlled that problem, I haven't a clue...

Here are the changes I made to the script you made available via the
web:

30c30
< if ($line =~ /(\s+)(\d+) Orbital
Energy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/){
---
if ( $line =~ /\s+(\d+)\sOrbital\sEnergy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/ ) {

32,34c32,34
< if($4 > 0.0){
< $num_protons += 2.0*$4;
< print "Orbital $2 has occupation $4 and energy $3\n";
---
if($3 > 0.0){
$num_protons += 2.0*$3;
print "Orbital $1 has occupation $3 and energy $2\n";

43c43
< if ($line =~ /(\s+)(\d+) Orbital
Energy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/){
---
if ( $line =~ /\s+\d+\sOrbital\sEnergy\s+[\-0-9.]+\s+Occupation\s+[\-0-9.]+/ ) {

53c53
< if ($line =~ /(\s+)(\d+) Orbital
Energy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/){
---
if ( $line =~ /\s+\d+\sOrbital\sEnergy\s+[\-0-9.]+\s+Occupation\s+[\-0-9.]+/ ) {

105,107c105,107
< print "index=$index before: \$index = x'", unpack('H*',
$index), "' ";
< $index--; $index++;
< print "after : \$index = x'", unpack('H*', $index), "'\n";
---
#print "index=$index before: \$index = x'", unpack('H*', $index), "' ";
#$index--; $index++;
#print "after : \$index = x'", unpack('H*', $index), "'\n";

If you should have any questions or comments, let me know.
 
N

nitroamos

I tried your recommended fixes on the machine where I was having the
problem, and they don't seem to be making a difference. Based on the
suspicion that something is system dependent, I just tried the original
script on an OSX machine, where:

"This is perl, v5.8.6 built for darwin-thread-multi-2level"

And it seems that my original script works. That is, whatever it was
that I was seeing was a bug in perl, but that it was fixed at some
point. Thus, it now seems to me a waste of time to attempt further
diagnosis.

My conclusion is that (ignoring TIMTOWTDI) sometimes even perl can have
strange bugs -- especially older versions. fortunately, perl is diverse
enough that there are always ways around the problem.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

$index--; $index++;

if($element =~ /_${index}_/ or $element =~ m/_$index$/){
for some reason, without the $index--; $index++; business, my second
pattern is not matching.

Summary of my perl5 (revision 5.0 version 8 subversion 0)
configuration:

So this is 5.8.0? Its REx engine is very buggy. For best results,
upgrade.

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top