perl + regex bug?

Discussion in 'Perl Misc' started by nitroamos@gmail.com, Jul 21, 2006.

  1. Guest

    hello -- i've been spending quite a bit of time trying to figure out
    this issue, and now that i've found a workaround, i'm wondering if the
    problem was a bug. basically, i have some code that looks like this:

    $orbitals[$i] =~ /\s+(\d)\s+Orbital Energy/;
    $index = $1;
    $index--; $index++;

    if($element =~ /_${index}_/ or $element =~ m/_$index$/){
    print "$element matches $& with index = $index\n";


    }

    this is just a small part of my program, so i hope i've shown enough to
    isolate the bug. Basically, I'm trying to grab an integer out of a
    string, and then look in another string to see if i have a match. i've
    pasted what the output looks like at the bottom. as you can see, the
    incorrect results are producing a subset of the matches I expect --
    only the first pattern on the if line is matching.

    for some reason, without the $index--; $index++; business, my second
    pattern is not matching. after fooling around, it seems that somehow
    it's entirely related to the $ anchor. although not what I want, if I
    add a ^ anchor i can get matches with or without the $ anchor. so
    somehow the $ anchor all by itself is not working unless i do the --/++
    business.

    what kind of weirdness is this? i've pasted perl -V at the bottom. if
    this is some version specific bug, then that's all i need to know; i'm
    ok with the workaround. but if there's something that i'm not
    understanding, then i want to know what i'm missing.

    thanks!



    results where everything is working as expected (correctly):
    _1 matches _1 with index = 1
    _2 matches _2 with index = 2
    _3 matches _3 with index = 3
    _4 matches _4 with index = 4
    _5 matches _5 with index = 5
    _1_2 matches _1_ with index = 1
    _1_2 matches _2 with index = 2
    _1_3 matches _1_ with index = 1
    _1_3 matches _3 with index = 3
    _2_3 matches _2_ with index = 2
    _2_3 matches _3 with index = 3
    _1_4 matches _1_ with index = 1
    _1_4 matches _4 with index = 4
    _2_4 matches _2_ with index = 2
    _2_4 matches _4 with index = 4
    _3_4 matches _3_ with index = 3
    _3_4 matches _4 with index = 4
    _1_5 matches _1_ with index = 1
    _1_5 matches _5 with index = 5
    _2_5 matches _2_ with index = 2
    _2_5 matches _5 with index = 5
    _3_5 matches _3_ with index = 3
    _3_5 matches _5 with index = 5
    _4_5 matches _4_ with index = 4
    _4_5 matches _5 with index = 5

    results where the "$index--; $index++;" line has been commented out
    (incorrect results):
    _1_2 matches _1_ with index = 1
    _1_3 matches _1_ with index = 1
    _2_3 matches _2_ with index = 2
    _1_4 matches _1_ with index = 1
    _2_4 matches _2_ with index = 2
    _3_4 matches _3_ with index = 3
    _1_5 matches _1_ with index = 1
    _2_5 matches _2_ with index = 2
    _3_5 matches _3_ with index = 3
    _4_5 matches _4_ with index = 4


    here is what perl -V says:
    Summary of my perl5 (revision 5.0 version 8 subversion 0)
    configuration:
    Platform:
    osname=linux, osvers=2.4.21-1.1931.2.382.entsmp,
    archname=i386-linux-thread-multi
    uname='linux str'
    config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686
    -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_by=Red
    Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux
    -Dvendorprefix=/usr -Dsiteprefix=/usr
    -Dotherlibdirs=/usr/lib/perl5/5.8.0 -Duseshrplib -Dusethreads
    -Duseithreads -Duselargefiles -Dd_dosuid -Dd_semctl_semun -Di_db
    -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslog -Dman3ext=3pm -Duseperlio
    -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -Dpager=/usr/bin/less
    -isr'
    hint=recommended, useposix=true, d_sigaction=define
    usethreads=define use5005threads=undef'
    useithreads=define usemultiplicity=
    useperlio= d_sfio=undef uselargefiles=define usesocks=undef
    use64bitint=undef use64bitall=un uselongdouble=
    usemymalloc=, bincompat5005=undef
    Compiler:
    cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
    -DDEBUGGING -fno-strict-aliasing -I/usr/local/include
    -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -I/usr/include/gdbm',
    optimize='',
    cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS
    -DDEBUGGING -fno-strict-aliasing -I/usr/local/include
    -I/usr/include/gdbm'
    ccversion='', gccversion='3.2.2 20030222 (Red Hat Linux 3.2.2-5)',
    gccosandvers=''
    gccversion='3.2.2 200302'
    intsize=r, longsize=r, ptrsize=5, doublesize=8, byteorder=1234
    d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=12
    ivtype='long'
    k', ivsize=4'
    ivtype='l, nvtype='double'
    o_nonbl', nvsize=, Off_t='', lseeksize=8
    alignbytes=4, prototype=define
    Linker and Libraries:
    ld='gcc'
    l', ldflags =' -L/u'
    libpth=/usr/local/lib /lib /usr/lib
    libs=-lnsl -lgdbm -ldb -ldl -lm -lpthread -lc -lcrypt -lutil
    perllibs=
    libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libper
    gnulibc_version='2.3.2'
    Dynamic Linking:
    dlsrc=dl_dlopen.xs, dlext=so', d_dlsymun=undef,
    ccdlflags='-rdynamic
    -Wl,-rpath,/usr/lib/perl5/5.8.0/i386-linux-thread-multi/CORE'
    cccdlflags='-fPIC'
    ccdlflags='-rdynamic -Wl,-rpath,/usr/lib/perl5', lddlflags='s
    Unicode/Normalize XS/A'


    Characteristics of this binary (from libperl):
    Compile-time options: DEBUGGING MULTIPLICITY USE_ITHREADS
    USE_LARGE_FILES PERL_IMPLICIT_CONTEXT
    Locally applied patches:
    MAINT18379
    Built under linux
    Compiled at Aug 13 2003 11:47:58
    @INC:
    /usr/lib/perl5/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/5.8.0
    /usr/lib/perl5/site_perl/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/site_perl/5.8.0
    /usr/lib/perl5/site_perl
    /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/vendor_perl/5.8.0
    /usr/lib/perl5/vendor_perl
    /usr/lib/perl5/5.8.0/i386-linux-thread-multi
    /usr/lib/perl5/5.8.0
     
    , Jul 21, 2006
    #1
    1. Advertising

  2. Paul Lalli Guest

    wrote:
    > hello -- i've been spending quite a bit of time trying to figure out
    > this issue, and now that i've found a workaround, i'm wondering if the
    > problem was a bug.


    So there's two possibilities: You did something wrong, or Perl has a
    bug that has never been detected or repaired. It is exceedingly
    arrogant to guess that the latter has a greater chance than the former,
    IMHO.

    > basically, i have some code that looks like this:


    Please reduce your *real* code to the smallest possible script that
    demonstrates the error, and yet is still a complete script we can run.

    >
    > $orbitals[$i] =~ /\s+(\d)\s+Orbital Energy/;


    And what is the value of $i? What are the values in @orbitals?

    > $index = $1;


    NEVER use $1, $2, $3 etc without first assuring that the pattern match
    succeeded. If the above pattern did not match, $1 will be whatever it
    was after the last successful pattern match. If either there was no
    prior successful match, or that match did not contain any capturing
    parentheses, $1 will be undef.

    > $index--; $index++;


    This is a sure sign that you're doing something wrong, but don't
    understand what, so you're throwing random code at it until it almost
    works. Find and fix the *real* problem.

    >
    > if($element =~ /_${index}_/ or $element =~ m/_$index$/){


    And now what is in $element? How can you expect us to interpret your
    results without telling us what these three critical pieces of
    information are?

    > print "$element matches $& with index = $index\n";
    >
    >
    > }
    >
    > this is just a small part of my program, so i hope i've shown enough to
    > isolate the bug.


    Again the assumption that Perl must be wrong instead of you. And no,
    there isn't enough code above for us to see what you've done wrong.

    > Basically, I'm trying to grab an integer out of a
    > string,


    No, you're trying to grab a single digit out of a string. Integers
    include numbers such as "100", "42", and "8749312". Yours will grab
    only 0 through 9. Perhaps that's your mistake? We can't possibly know
    without seeing the original strings you were trying to match.

    > and then look in another string to see if i have a match. i've
    > pasted what the output looks like at the bottom. as you can see, the
    > incorrect results are producing a subset of the matches I expect --
    > only the first pattern on the if line is matching.
    >
    > for some reason, without the $index--; $index++; business, my second
    > pattern is not matching.


    Here's my guess - about half the time, your pattern doesn't match
    because you have an integer greater than 9 in your string. Thus,
    $index becomes undefined when you assign it to $1. You're not using
    warnings, so Perl doesn't bother telling you that you've included an
    undef inside your pattern match, and instead treats it as the empty
    string. When you do the $index--; $index++; idiocy, Perl is forced to
    treat the undef as the integer 0 instead.

    As I said, this is a complete guess. Without seeing your actual data,
    there is no way of *knowing* what is happening.

    > after fooling around, it seems that somehow
    > it's entirely related to the $ anchor. although not what I want, if I
    > add a ^ anchor i can get matches with or without the $ anchor. so
    > somehow the $ anchor all by itself is not working unless i do the --/++
    > business.


    I find that remarkably unlikely, and am far more willing to bet you
    have a bug in your diagnostic process.

    > what kind of weirdness is this?


    The weirdness that happens when you don't program with warnings, don't
    check the return values of your pattern matches, and don't show
    complete data when asking for help.

    > i've pasted perl -V at the bottom. if
    > this is some version specific bug, then that's all i need to know; i'm
    > ok with the workaround. but if there's something that i'm not
    > understanding, then i want to know what i'm missing.


    We can't tell you that, because you haven't given enough information.

    Please read the Posting Guidelines for this group. They will give you
    all sorts of hints on how to best ask questions in this and other
    technical forums.

    Paul Lalli
     
    Paul Lalli, Jul 21, 2006
    #2
    1. Advertising

  3. Klaus Guest

    wrote:
    > hello -- i've been spending quite a bit of time trying to figure out
    > this issue, and now that i've found a workaround, i'm wondering if the
    > problem was a bug. basically, i have some code that looks like this:
    >
    > $orbitals[$i] =~ /\s+(\d)\s+Orbital Energy/;
    > $index = $1;
    > $index--; $index++;
    >
    > if($element =~ /_${index}_/ or $element =~ m/_$index$/){
    > print "$element matches $& with index = $index\n";


    I suspect that something might be wrong with $index not being numeric
    (although from your regex /...(\d).../ that seems impossible, or do you
    have some unexpected locale settings / character encoding / UTF-8 /
    UTF-16, etc... that have numeric characters other than '0'...'9' ? )

    Anyway, try printing the content of "$index" in hex before and after
    the "$index--; $index++;", like so...

    print "before: \$index = x'", unpack('H*', $index), "'\n";
    $index--; $index++;
    print "after : \$index = x'", unpack('H*', $index), "'\n";

    ....and tell us what you get.
     
    Klaus, Jul 22, 2006
    #3
  4. Dr.Ruud Guest

    schreef:

    > this is just a small part of my program, so i hope i've shown enough
    > to isolate the bug.


    You haven't. You could try again.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jul 22, 2006
    #4
  5. Guest

    Hello -- I totally agree that it is unlikely that there is a bug in
    perl itself, and so that's why I'm posting -- because I'm relatively
    inexperienced with perl and I think there is something important to
    learn here. On the other hand, I have enough experience with cross
    platform development that I know that in general compiler bugs can't be
    ruled out. Further, my perl debugging skills are a bit weak... I don't
    even know what kinds of problems to look for. Also, I almost never post
    online because I'm almost always able to figure out the problems on my
    own or with google's help, so I'm sorry if my protocol is wrong.
    However, this one has eluded me.

    Also, sorry for not providing enough information. I had hoped that the
    sample output would be enough, but since it's not, I went ahead and put
    my whole program online (see line 106). I debated whether I should try
    to extract a new program; I hope this is ok:
    http://www.wag.caltech.edu/QMcBeaver/jag_recorr_all
    which requires this input file:
    http://www.wag.caltech.edu/QMcBeaver/ne_pw91.01.in
    (I won't keep those files up there forever...)

    A short description about what has happened by the time line 106 is
    reached: Basically the point of the program is to read in some of those
    orbitals and then spit some of them back out combinatorially into new
    files. By line 106, the first 5 orbitals (for example the 20 lines of
    numbers including a header line from the sample input) are stored in
    $orbitals[$i], so this pattern:

    $orbitals[$i] =~ /\s+(\d+)\s+Orbital Energy/;

    should grab the first integer from this part of that string (i've
    replaced the other 19 lines of numbers with "..."):
    1 Orbital Energy -32.554095 Occupation 1.000000
    ....

    And then because I want to preserve that integer, I save it $index =
    $1; so that I can use it in later pattern matches. I have a set of
    combinatorially produced strings that look like "_1_3_4_5" and I want
    to find out if the integer that I matched from the orbital string above
    is contained in the combinatorial strings. The combinatorial strings
    are a sequence of "_Integer", so once I have an integer from an
    orbital, i have to check for "_Integer_" and "_Integer$" since I don't
    want orbital "2" to match both "_1_2_3" and "_1_3_24". Now I realize
    that I could change the pattern in my combinatorial string, but I've
    already invested enough time in this particular bug (in my code) that I
    want to know what the problem is.

    I would be happy to run any recommended test with my code. I fixed the
    integer vs single digit bug that Paul Lalli pointed out (thanks!), but
    that was not the source of the problem. Also, regarding my
    understanding of regex and regarding those couple of lines, I have
    played around extensively with printing out different combinations of
    $i, $index, $1, etc and all of them are printing exactly as I would
    expect. The only deviation from what I expect is that anchored pattern
    matching.

    Here is what Klaus recommended I try:
    >
    > print "before: \$index = x'", unpack('H*', $index), "'\n";
    > $index--; $index++;
    > print "after : \$index = x'", unpack('H*', $index), "'\n";


    I very slightly modified it to save space, and I've only pasted a
    sample of the output at the bottom. As you can see, by commenting out
    that one line, I'm getting different matching results even though
    unpacking shows the same thing before and after. Regarding locale
    settings, I don't know how to check... but I don't think there are any
    strange characters in my strings because I tried looking for them using
    pattern matching. Also, if I add print out this: length "$index", I get
    length of 1 in all cases -- before and after.

    I think the most telling piece of evidence is that both this pattern:
    if($element =~ m/^_$index$/){
    and this pattern:
    if($element =~ m/^_$index/){
    correctly produce these kinds of matches:
    _1 matches _1 with index = 1
    _2 matches _2 with index = 2
    _3 matches _3 with index = 3
    printed with this:
    print "$element matches $& with index = $index\n";

    whereas this pattern:
    if($element =~ m/_$index$/){
    will not produce those matches.


    Thanks!

    Amos.




    ****** Results from Klaus' test ******

    Here are partial results from:
    print "index=$index before: \$index = x'", unpack('H*',
    $index), "' ";

    $index--; $index++;


    print "after : \$index = x'", unpack('H*', $index), "'\n";



    index=1 before: $index = x'31' after : $index = x'31'
    index=2 before: $index = x'32' after : $index = x'32'
    _2_3_4_5 matches _2_ with index = 2
    index=3 before: $index = x'33' after : $index = x'33'
    _2_3_4_5 matches _3_ with index = 3
    index=4 before: $index = x'34' after : $index = x'34'
    _2_3_4_5 matches _4_ with index = 4
    index=5 before: $index = x'35' after : $index = x'35'
    _2_3_4_5 matches _5 with index = 5
    index=1 before: $index = x'31' after : $index = x'31'
    _1_2_3_4_5 matches _1_ with index = 1
    index=2 before: $index = x'32' after : $index = x'32'
    _1_2_3_4_5 matches _2_ with index = 2
    index=3 before: $index = x'33' after : $index = x'33'
    _1_2_3_4_5 matches _3_ with index = 3
    index=4 before: $index = x'34' after : $index = x'34'
    _1_2_3_4_5 matches _4_ with index = 4
    index=5 before: $index = x'35' after : $index = x'35'
    _1_2_3_4_5 matches _5 with index = 5

    and with the idiocy line commented out:
    print "index=$index before: \$index = x'", unpack('H*',
    $index), "' ";

    #$index--; $index++;


    print "after : \$index = x'", unpack('H*', $index), "'\n";
    I get (note that none of the $& printed have the form "_Integer"):
    index=1 before: $index = x'31' after : $index = x'31'
    index=2 before: $index = x'32' after : $index = x'32'
    _2_3_4_5 matches _2_ with index = 2
    index=3 before: $index = x'33' after : $index = x'33'
    _2_3_4_5 matches _3_ with index = 3
    index=4 before: $index = x'34' after : $index = x'34'
    _2_3_4_5 matches _4_ with index = 4
    index=5 before: $index = x'35' after : $index = x'35'
    index=1 before: $index = x'31' after : $index = x'31'
    _1_2_3_4_5 matches _1_ with index = 1
    index=2 before: $index = x'32' after : $index = x'32'
    _1_2_3_4_5 matches _2_ with index = 2
    index=3 before: $index = x'33' after : $index = x'33'
    _1_2_3_4_5 matches _3_ with index = 3
    index=4 before: $index = x'34' after : $index = x'34'
    _1_2_3_4_5 matches _4_ with index = 4
    index=5 before: $index = x'35' after : $index = x'35'
     
    , Jul 22, 2006
    #5
  6. <> wrote:

    > And then because I want to preserve that integer, I save it $index =
    > $1; so that I can use it in later pattern matches. I have a set of
    > combinatorially produced strings that look like "_1_3_4_5" and I want
    > to find out if the integer that I matched from the orbital string above
    > is contained in the combinatorial strings.



    print "$index is contained in $orbital\n"
    if grep $_ == $index, split /_/, $orbital;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jul 22, 2006
    #6
  7. Guest

    wrote:
    > Also, sorry for not providing enough information. I had hoped that the
    > sample output would be enough, but since it's not, I went ahead and put


    Why could it be since we did not know the input?

    > my whole program online (see line 106). I debated whether I should try
    > to extract a new program; I hope this is ok:


    No it's not ok... this is Usenet, you should not expect us to fire
    up a browser in order to look at and give advice on your problems...
    the coomon curtesy would be to follow the guidelines of the group.

    > http://www.wag.caltech.edu/QMcBeaver/jag_recorr_all
    > which requires this input file:
    > http://www.wag.caltech.edu/QMcBeaver/ne_pw91.01.in
    > (I won't keep those files up there forever...)


    Oh... a deadline for us to look at the problem then.

    'nuff said.

    Axel
     
    , Jul 24, 2006
    #7
  8. CsB Guest

    You should use caution when assigning variables from a regular
    expressions when it is not necessary.

    There were several instances in your script where you're using paren to
    assign ($1, $2, etc.) but you are not actually using (or even needing)
    that data.

    I removed all unnecessary variable assignments from your regular
    expressions, commented out the "$index--; $index++;" line and all seems
    to work well. At least it matches your posting for what good results
    should look like.

    Since it seems to work well now, the extraneous variable assignments
    must have been trampling your data. How the "$index--; $index++;"
    controlled that problem, I haven't a clue...

    Here are the changes I made to the script you made available via the
    web:

    30c30
    < if ($line =~ /(\s+)(\d+) Orbital
    Energy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/){
    ---
    > if ( $line =~ /\s+(\d+)\sOrbital\sEnergy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/ ) {


    32,34c32,34
    < if($4 > 0.0){
    < $num_protons += 2.0*$4;
    < print "Orbital $2 has occupation $4 and energy $3\n";
    ---
    > if($3 > 0.0){
    > $num_protons += 2.0*$3;
    > print "Orbital $1 has occupation $3 and energy $2\n";


    43c43
    < if ($line =~ /(\s+)(\d+) Orbital
    Energy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/){
    ---
    > if ( $line =~ /\s+\d+\sOrbital\sEnergy\s+[\-0-9.]+\s+Occupation\s+[\-0-9.]+/ ) {


    53c53
    < if ($line =~ /(\s+)(\d+) Orbital
    Energy\s+([\-0-9.]+)\s+Occupation\s+([\-0-9.]+)/){
    ---
    > if ( $line =~ /\s+\d+\sOrbital\sEnergy\s+[\-0-9.]+\s+Occupation\s+[\-0-9.]+/ ) {


    105,107c105,107
    < print "index=$index before: \$index = x'", unpack('H*',
    $index), "' ";
    < $index--; $index++;
    < print "after : \$index = x'", unpack('H*', $index), "'\n";
    ---
    > #print "index=$index before: \$index = x'", unpack('H*', $index), "' ";
    > #$index--; $index++;
    > #print "after : \$index = x'", unpack('H*', $index), "'\n";


    If you should have any questions or comments, let me know.
     
    CsB, Jul 25, 2006
    #8
  9. Guest

    I tried your recommended fixes on the machine where I was having the
    problem, and they don't seem to be making a difference. Based on the
    suspicion that something is system dependent, I just tried the original
    script on an OSX machine, where:

    "This is perl, v5.8.6 built for darwin-thread-multi-2level"

    And it seems that my original script works. That is, whatever it was
    that I was seeing was a bug in perl, but that it was fixed at some
    point. Thus, it now seems to me a waste of time to attempt further
    diagnosis.

    My conclusion is that (ignoring TIMTOWTDI) sometimes even perl can have
    strange bugs -- especially older versions. fortunately, perl is diverse
    enough that there are always ways around the problem.
     
    , Jul 25, 2006
    #9
  10. [A complimentary Cc of this posting was sent to

    <>], who wrote in article <>:
    > $index--; $index++;
    >
    > if($element =~ /_${index}_/ or $element =~ m/_$index$/){


    > for some reason, without the $index--; $index++; business, my second
    > pattern is not matching.



    > Summary of my perl5 (revision 5.0 version 8 subversion 0)
    > configuration:


    So this is 5.8.0? Its REx engine is very buggy. For best results,
    upgrade.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, Aug 3, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rick Venter

    perl regex to java regex

    Rick Venter, Oct 29, 2003, in forum: Java
    Replies:
    5
    Views:
    1,660
    Ant...
    Nov 6, 2003
  2. Replies:
    2
    Views:
    618
  3. Replies:
    3
    Views:
    808
    Reedick, Andrew
    Jul 1, 2008
  4. Replies:
    3
    Views:
    168
    Paul Lalli
    Oct 27, 2005
  5. glob

    perl + Win32::OLE + regex = bug

    glob, Jan 18, 2007, in forum: Perl Misc
    Replies:
    1
    Views:
    191
    Mark Clements
    Jan 18, 2007
Loading...

Share This Page