Regex: Why is overreaching necessary?

S

Shannon Jacobs

Dealing with an array of fixed length strings. Goal is to select based
on certain columns. After rather lengthy study of the camel book and
searching on the web for various examples, I thought this should work:

X @foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

It did not. I consulted with a heavy Perler, and after a few minutes
of wrestling with the problem, he suggested something like this (as I
tinkered it into working):

@foo2 = grep(/^.{50,62}($1121|1217|1256|2033).{6,18}$/,@foo1);

My idea in the broken example was to ignore the first 50 and last 6
characters in each line, which was supposed to leave only the 12
characters in the middle to search against. My fuzzy understanding of
the working version is that I first had to match the entire thing, and
then let Perl fish for candidate matches by truncating down towards
50?

The examples above are slightly simplified for purposes of
explanation. Here is the actual code, just in case I did something
wrong in the tweaking:

@foo2 = grep(/^.{50,62}($form_values{'a_SEARCH_VALUE'}).
{6,18}$/,@foo1);
 
J

Jürgen Exner

Shannon said:
Dealing with an array of fixed length strings. Goal is to select based
on certain columns. After rather lengthy study of the camel book and
searching on the web for various examples, I thought this should work:

X @foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

It did not. I consulted with a heavy Perler, and after a few minutes
of wrestling with the problem, he suggested something like this (as I
tinkered it into working):

@foo2 = grep(/^.{50,62}($1121|1217|1256|2033).{6,18}$/,@foo1);

Ouch! That hurts!
When dealing with fixed length formats then REs are certainly not the tool
of choice.
One much better alternative: substr()
The other commonly used alternative: pack()/unpack()

jue
 
S

Shannon Jacobs

Ouch! That hurts!
When dealing with fixed length formats then REs are certainly not the tool
of choice.
One much better alternative: substr()
The other commonly used alternative: pack()/unpack()

It was not my intention to cause you any pain, but that's not the
question I asked, though I suppose it's good to rethink problems in
terms of the objectives. Actually, in another part of the program I do
use substrings to massage things where a more linear approach seemed
more suitable. I vaguely remember considering unpack() long ago (the
code has evolved over a period of about 10 years), but decided against
it for some reason. I didn't need pack() since this is actually a
backend query program, and there are limitations in the programs that
are exporting the data. (And yes, the Perler with whom I discussed the
problem did suggest alternatives including substrings.)

I'd still like to understand why this regular expression works as it
does. Or perhaps you should clarify your intended sense of "painful"?
As it is, I'm content with how well the code works. It seems like an
adequate amount of search bang for the small regex buck.
 
U

Uri Guttman

SJ> Dealing with an array of fixed length strings. Goal is to select based
SJ> on certain columns. After rather lengthy study of the camel book and
SJ> searching on the web for various examples, I thought this should work:

SJ> X @foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

SJ> It did not. I consulted with a heavy Perler, and after a few minutes
SJ> of wrestling with the problem, he suggested something like this (as I
SJ> tinkered it into working):

SJ> @foo2 = grep(/^.{50,62}($1121|1217|1256|2033).{6,18}$/,@foo1);

you should show some sample data as well so we can see what you are
matching. as jurgen said that is painful to read. even good perl hackers
will have trouble deciphering it quickly and that means it is not good
perl IMO.

also this line has $1121 and the previous one didn't have the $ so i am
not sure which is correct.


SJ> My idea in the broken example was to ignore the first 50 and last 6
SJ> characters in each line, which was supposed to leave only the 12
SJ> characters in the middle to search against. My fuzzy understanding of
SJ> the working version is that I first had to match the entire thing, and
SJ> then let Perl fish for candidate matches by truncating down towards
SJ> 50?

no need to ignore the last 6 chars as that won't affect the match unless
some lines were of different lengths.


SJ> The examples above are slightly simplified for purposes of
SJ> explanation. Here is the actual code, just in case I did something
SJ> wrong in the tweaking:

SJ> @foo2 = grep(/^.{50,62}($form_values{'a_SEARCH_VALUE'}).
SJ> {6,18}$/,@foo1);

that doesn't seem to be a fixed offset value. the initial skip is from
50-62 chars. if the search value can't appear in that, why not just
grep for that? is the search value something with alternation as the
above lines suggest? then a faster thing might be to grab the part you
want and look it up in a hash of wanted values. alternation can be very
slow especially with many choices (due to backtracking).

in fact as you have been told, substr and a hash lookup might be the
perfect thing for this (but i am not sure since the leading skip can
vary in size). again, showing some real data would help as we could see
what variants there are, what the searched for parts look like (and if
they are not found earlier in the string), etc.

uri
 
J

John W. Krahn

Shannon said:
Dealing with an array of fixed length strings. Goal is to select based
on certain columns. After rather lengthy study of the camel book and
searching on the web for various examples, I thought this should work:

X @foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

It did not. I consulted with a heavy Perler, and after a few minutes
of wrestling with the problem, he suggested something like this (as I
tinkered it into working):

@foo2 = grep(/^.{50,62}($1121|1217|1256|2033).{6,18}$/,@foo1);

My idea in the broken example was to ignore the first 50 and last 6
characters in each line,

@foo2 = grep substr( $_, 50, -6 ) =~ /1121|1217|1256|2033/, @foo1;
which was supposed to leave only the 12
characters in the middle to search against.

@foo2 = grep substr( $_, 50, 12 ) =~ /1121|1217|1256|2033/, @foo1;




John
 
S

Shannon Jacobs

@foo2 = grep(/^.{50,62}($1121|1217|1256|2033).{6,18}$/,@foo1);

Yes, the first $ there was a typo left over from the actual code
sample that was included later in the OP (where that part of the
search target was stored in a variable). My apologies for not
including a sample of the data, but I had attempted to constrain the
question in a way that I hoped limited the need for reference to the
actual data. Here is a short sample from the file:

Irrational Numbers 1976770514 392 0 0SF

Maske: Thaery 1976770514 393 0 0SF

The Turning Place 1976770515 394 0 0SF

The October Circle 1975770516 395 0 0Fi

Our Invaded Universities 1974830410 671 EdPSHi
Space Mail 1980840607 8 564 565SF

There are spaces at the ends of the apparently short lines, so all of
them are actually of the same length. The embedded 0s are actually an
anachronism of the source programs, but they 'seem' harmless, so I've
always ignored them. I think it's irrelvant, but for the sake of
sizing, this data file is only around 200 Kb in total.

Since substring operations have come up again, let me clarify that in
this part of the program it seemed easier and better to use a simple
regex at each stage of refinement. I didn't want to break the lines
apart and just put them together again (though later on I did break
the final filtered results apart for output). (In addition, I know
that the current approach allows me to usefully input search targets
consisting of regular expressions, such as ^.{44}07 to pull the
current year's entries.)

At this point I am most interested in the operation and probable
performance advantages of John Krahn's

@foo2 = grep substr( $_, 50, 12 ) =~ /1121|1217|1256|2033/, @foo1;

versus my (corrected) version of

@foo2 = grep(/^.{50,62}(1121|1217|1256|2033).{6,18}$/,@foo1);

.. I have the (fuzzy) intuition that his code is more directly
performing the operation that I described in the OP. If so, I'd like
to share it with the Perler who led me to my probably awkward solution
(but I'm not sure yet which is better nor why).

I am also still interested in understanding why this version failed:

@foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

Minor point is wondering whether or not it is necessary to worry about
the end of the string (as mentioned by Uri Guttman). It seems to me
that there would still be a general risk of false positives in the
tail of the string unless they are explicitly ignored. Or is he really
saying that my version is still subject to that risk?
 
J

Joe Smith

Shannon said:
@foo2 = grep(/^.{50,62}(1121|1217|1256|2033).{6,18}$/,@foo1);

If all of your input strings are 50+4+18 = 72 = 62+4+6 characters long,
then forcing the tail end to be 6 to 18 characters long is redundant.
Once the pattern has skipped 50 to 60 characters of the 72-character
string and match 4 characters, the remainder will always be 6 to 18 characters.
. I have the (fuzzy) intuition that his code is more directly
performing the operation that I described in the OP. If so, I'd like
to share it with the Perler who led me to my probably awkward solution
(but I'm not sure yet which is better nor why).

I am also still interested in understanding why this version failed:

@foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

If your input is 72 characters long, that will never match.
You will have 18 characters at the end, not 6.
Minor point is wondering whether or not it is necessary to worry about
the end of the string (as mentioned by Uri Guttman). It seems to me
that there would still be a general risk of false positives in the
tail of the string unless they are explicitly ignored. Or is he really
saying that my version is still subject to that risk?

If all your input is fixed-length strings, there will never be
false positives. Are you planning on rejecting all lines that are
not exactly 72 characters long?

-Joe
 
S

Shannon Jacobs

General thanks to the people who provided hints, and now I can finally
answer the original Subject: question. The fundamental problem I was
wrestling with was between generalized grepping for existence of the search
target and more selective grepping. The answer: <drum roll>

If you only want to deal with part of a string, then you *MUST* account for
*ALL* of the string in your regex.

Now it seems obvious, but it took me a while to get the point. I think I was
led astray by the convenience of using little regular expressions as the
search target. Yes, it's convenient, but it's fundamentally sloppy. Anyway,
now I have three working solutions, and I even sort of think I understand
how they work (except for relative performance). The first one is mostly
mine, and the other two are mostly from real Perlers.

@foo2 = grep(/^.{50}.*($form_values{'a_SEARCH_VALUE'}).*.{6}$/,@foo1);

@foo2 = grep substr( $_, 50, 12 ) =~ /$form_values{'a_SEARCH_VALUE'}/,
@foo1;

@foo2 = grep(/^.{50,62}($form_values{'a_SEARCH_VALUE'}).{6,18}$/,@foo1);

There's still one more little wrinkle that's bugging me--but there always
is. Trivial enough to ignore, but I'm wondering if there's an elegant
solution... I'll try to restate that wrinkled problem in terms that are more
consistent with the posting guidelines:

An example of the search target in $form_values.... could be
1224|1357|2239|2243|2468 (intended to match anywhere in the 12 unmasked
digits), which are actually (up to) three numbers. What I want to do is
insert something like (.{4}){0,2} before and after the search target (where
I currently have .* in my first version above) so that it only considers 4
digits at a time. Here is some sample data from the file.

The Brethren 20010210282239 Fa
Gorilla, My Love 19810211042240 HF
KeitaiDenwaNoHimitsu 200102110722412242 JaChCS
Harry Potter and the Philosopher's Stone199702111722362243 Fa

In this example the first and fourth lines are proper matches against 2239
and 2243, respectively, but the third line is an undesired match against
1224. The problem as I see it is that the two things I'm thinking about
inserting should communicate with each other so that they always consume a
total of 8 characters, thereby forcing the target to consider only four
characters at a time.

Shannon said:
Yes, the first $ there was a typo left over from the actual code
sample that was included later in the OP (where that part of the
search target was stored in a variable). My apologies for not
including a sample of the data, but I had attempted to constrain the
question in a way that I hoped limited the need for reference to the
actual data. Here is a short sample from the file:

Irrational Numbers 1976770514 392 0 0SF

Maske: Thaery 1976770514 393 0 0SF

The Turning Place 1976770515 394 0 0SF

The October Circle 1975770516 395 0 0Fi

Our Invaded Universities 1974830410 671 EdPSHi
Space Mail 1980840607 8 564 565SF

There are spaces at the ends of the apparently short lines, so all of
them are actually of the same length. The embedded 0s are actually an
anachronism of the source programs, but they 'seem' harmless, so I've
always ignored them. I think it's irrelevant, but for the sake of
sizing, this data file is only around 200 Kb in total.

Since substring operations have come up again, let me clarify that in
this part of the program it seemed easier and better to use a simple
regex at each stage of refinement. I didn't want to break the lines
apart and just put them together again (though later on I did break
the final filtered results apart for output). (In addition, I know
that the current approach allows me to usefully input search targets
consisting of regular expressions, such as ^.{44}07 to pull the
current year's entries.)

At this point I am most interested in the operation and probable
performance advantages of John Krahn's

@foo2 = grep substr( $_, 50, 12 ) =~ /1121|1217|1256|2033/, @foo1;

versus my (corrected) version of

@foo2 = grep(/^.{50,62}(1121|1217|1256|2033).{6,18}$/,@foo1);

. I have the (fuzzy) intuition that his code is more directly
performing the operation that I described in the OP. If so, I'd like
to share it with the Perler who led me to my probably awkward solution
(but I'm not sure yet which is better nor why).

I am also still interested in understanding why this version failed:

@foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);

Minor point is wondering whether or not it is necessary to worry about
the end of the string (as mentioned by Uri Guttman). It seems to me
that there would still be a general risk of false positives in the
tail of the string unless they are explicitly ignored. Or is he really
saying that my version is still subject to that risk?
<older snip>
 
A

anno4000

[...]
@foo2 = grep substr( $_, 50, 12 ) =~ /$form_values{'a_SEARCH_VALUE'}/,
@foo1;
[...]

There's still one more little wrinkle that's bugging me--but there always
is. Trivial enough to ignore, but I'm wondering if there's an elegant
solution... I'll try to restate that wrinkled problem in terms that are more
consistent with the posting guidelines:

An example of the search target in $form_values.... could be
1224|1357|2239|2243|2468 (intended to match anywhere in the 12 unmasked
digits), which are actually (up to) three numbers. What I want to do is
insert something like (.{4}){0,2} before and after the search target (where
I currently have .* in my first version above) so that it only considers 4
digits at a time. Here is some sample data from the file.

The Brethren 20010210282239 Fa
Gorilla, My Love 19810211042240 HF
KeitaiDenwaNoHimitsu 200102110722412242 JaChCS
Harry Potter and the Philosopher's Stone199702111722362243 Fa

In this example the first and fourth lines are proper matches against 2239
and 2243, respectively, but the third line is an undesired match against
1224. The problem as I see it is that the two things I'm thinking about
inserting should communicate with each other so that they always consume a
total of 8 characters, thereby forcing the target to consider only four
characters at a time.

Try this variant:

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

Essentially that ties the pattern to the beginning of the substring,
then allows zero to two groups of four digits before a match.

Anno
 
S

Shannon Jacobs

[...]
@foo2 = grep substr( $_, 50, 12 ) =~ /$form_values{'a_SEARCH_VALUE'}/,
@foo1;
[...]



There's still one more little wrinkle that's bugging me--but there always
is. Trivial enough to ignore, but I'm wondering if there's an elegant
solution... I'll try to restate that wrinkled problem in terms that are more
consistent with the posting guidelines:
An example of the search target in $form_values.... could be
1224|1357|2239|2243|2468 (intended to match anywhere in the 12 unmasked
digits), which are actually (up to) three numbers. What I want to do is
insert something like (.{4}){0,2} before and after the search target (where
I currently have .* in my first version above) so that it only considers 4
digits at a time. Here is some sample data from the file.
The Brethren 20010210282239 Fa
Gorilla, My Love 19810211042240 HF
KeitaiDenwaNoHimitsu 200102110722412242 JaChCS
Harry Potter and the Philosopher's Stone199702111722362243 Fa
In this example the first and fourth lines are proper matches against 2239
and 2243, respectively, but the third line is an undesired match against
1224. The problem as I see it is that the two things I'm thinking about
inserting should communicate with each other so that they always consume a
total of 8 characters, thereby forcing the target to consider only four
characters at a time.

Try this variant:

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

Essentially that ties the pattern to the beginning of the substring,
then allows zero to two groups of four digits before a match.

Anno

Sorry, but that doesn't work. I think it's because it picks up the
false matches when it has no groups of four digits before the match.
Somehow it needs to be limited to considering only four source digits
at a time, or to think that there is a non-digit boundary between the
two groups of four digits.

(I don't think it matters, and I tested it both ways, but I think it
should be

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:.{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

rather than your version. The data file may have spaces, and I think
that \d wouldn't count them at that point.)

I sort of like Perl, but it seems to be the case that I like it as a
mind-bending experience...
 
A

anno4000

Shannon Jacobs said:
Shannon Jacobs <[email protected]> wrote in comp.lang.perl.misc:
[...]
Try this variant:

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

Essentially that ties the pattern to the beginning of the substring,
then allows zero to two groups of four digits before a match.

Anno

Sorry, but that doesn't work. I think it's because it picks up the
false matches when it has no groups of four digits before the match.

It doesn't pick up false matches from the sample you supplied.
Somehow it needs to be limited to considering only four source digits
at a time, or to think that there is a non-digit boundary between the
two groups of four digits.

(I don't think it matters, and I tested it both ways, but I think it
should be

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:.{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

rather than your version. The data file may have spaces,

Then your sample data should have included such a case.
and I think
that \d wouldn't count them at that point.)

You are more permissive than the data requires. If you want to allow
blanks, allow blanks:

/^(?:[\d ]{4}){0,2}$form_values{'a_SEARCH_VALUE'}/


Anno
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to

Try this variant:

@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;

May be better, may be worse than the "obvious":

@foo2 = grep /^(.50) # Start at pos=50
(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}
(?!.{63}) # Do not allow pos>62
/sx, @foo1;

Hope this helps,
Ilya
 
S

Shannon Jacobs

Shannon Jacobs said:
Shannon Jacobs <[email protected]> wrote in comp.lang.perl.misc:
[...]
I currently have .* in my first version above) so that it only considers 4
digits at a time. Here is some sample data from the file.
The Brethren 20010210282239 Fa
Gorilla, My Love 19810211042240 HF
KeitaiDenwaNoHimitsu 200102110722412242 JaChCS
Harry Potter and the Philosopher's Stone199702111722362243 Fa
In this example the first and fourth lines are proper matches against 2239
and 2243, respectively, but the third line is an undesired match against
1224. The problem as I see it is that the two things I'm thinking about
inserting should communicate with each other so that they always consume a
total of 8 characters, thereby forcing the target to consider only four
characters at a time.
Try this variant:
@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;
Essentially that ties the pattern to the beginning of the substring,
then allows zero to two groups of four digits before a match.
Anno
Sorry, but that doesn't work. I think it's because it picks up the
false matches when it has no groups of four digits before the match.

It doesn't pick up false matches from the sample you supplied.
Somehow it needs to be limited to considering only four source digits
at a time, or to think that there is a non-digit boundary between the
two groups of four digits.
(I don't think it matters, and I tested it both ways, but I think it
should be
@foo2 = grep substr( $_, 50, 12 ) =~
/^(?:.{4}){0,2}$form_values{'a_SEARCH_VALUE'}/,
@foo1;
rather than your version. The data file may have spaces,

Then your sample data should have included such a case.
and I think
that \d wouldn't count them at that point.)

You are more permissive than the data requires. If you want to allow
blanks, allow blanks:

/^(?:[\d ]{4}){0,2}$form_values{'a_SEARCH_VALUE'}/

Anno

You are correct, but the problem is apparently in the particular data
sample which I provided. When tested against the full data file it
still has the problem of the false matches. I was in a hurry to
acknowledge my error, but I don't have time this morning to do more
diagnostics.

Perhaps it is something about the presence of the third number in some
of the real data that is causing it to fail? I see that the sample I
included did not have any cases with 12 digits, but only 8.

(I did test Ilya Zakharevich's proposed suggestion in the next post,
and it worked more poorly, producing additional false matches. I'm
eager to study the differences there, though his approach seems more
complicated than yours.)
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Shannon Jacobs
(I did test Ilya Zakharevich's proposed suggestion in the next post,
and it worked more poorly, producing additional false matches. I'm
eager to study the differences there, though his approach seems more
complicated than yours.)

The only difference *in semantic* is //x which I used (for clarity);
the possible difference I meant is in performance.

If your hash contains spaces etc, you may need to modify it
accordingly (either remove //x and formatting, or wrap the hash in (?-x:)).

Hope this helps,
Ilya
 
S

Shannon Jacobs

[A complimentary Cc of this posting was sent to
Shannon Jacobs
(I did test Ilya Zakharevich's proposed suggestion in the next post,
and it worked more poorly, producing additional false matches. I'm
eager to study the differences there, though his approach seems more
complicated than yours.)

The only difference *in semantic* is //x which I used (for clarity);
the possible difference I meant is in performance.

If your hash contains spaces etc, you may need to modify it
accordingly (either remove //x and formatting, or wrap the hash in (?-x:)).

Hope this helps,
Ilya

Looking at your version again, the problem is the same one that I had
before with the first part being included in the match. For example,
if the search target includes 1101, and the first fifty characters
include 1101, then it produces a false match for that line. I tested
it both with and without the x parameter, and it made no difference.
(My ancient version of the camel book doesn't even 'translate' that
"x" word in that context. From that perspective, my limited Japanese
is superior to my fluency in Perl.) I couldn't understand your .50
versus my .{50}, so I tried that, too. No joy.

I did some more testing with Anno's version, and I feel like it is
correct and that there is some other problem going on there. In
particular, when I 'manually' forced a search on 1101 in isolation
using Anno's version, it works correctly, but running in the normal
context with other values, it produces a false match on the second
line of the following data:

Algorithms + Data Structures = Programs 19760010021657 CS

Waking Up Just in Time 1995001003210921101612PsPh

Boukyouhen: Hinotori No.6 19870010042111 SFJa


So in my next round of experiments, I manually trimmed the search
target to see what it was really matching on. This broke my mind...
However, I think I found the minimal example of the problem. If I
limit the search target to this:

909|1101

It produces the false match of that second line. But as

1101| 909

there is no problem. The problem is somehow apparently due to the
leading space, since

908|1101

also produces the false match (though the rest of the results are
correct). Lisp and Japanese make more sense to me... If this was a
contest, I would say that the laurels rest with Anno, and I think I'll
just try to live with this slight peculiarity.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Shannon Jacobs
So in my next round of experiments, I manually trimmed the search
target to see what it was really matching on. This broke my mind...
However, I think I found the minimal example of the problem. If I
limit the search target to this:

909|1101

It produces the false match of that second line. But as

1101| 909

there is no problem. The problem is somehow apparently due to the
leading space, since

908|1101

also produces the false match (though the rest of the results are
correct). Lisp and Japanese make more sense to me... If this was a
contest, I would say that the laurels rest with Anno, and I think I'll
just try to live with this slight peculiarity.

If we could see the actual code you used, it would be much easier to
understand what problem you see.

$_ = ' 908|1101';
print "Matched!\n" if / /; # <--- Fill this line

Thanks,
Ilya
 
S

Shannon Jacobs

If we could see the actual code you used, it would be much easier to
understand what problem you see.

$_ = ' 908|1101';
print "Matched!\n" if / /; # <--- Fill this line

Thanks,
Ilya

The actual code is a pretty large piece and has a very mottled and
almost ancient history, too. Kind of a cancerous spare-time project of
essentially low priority. The result is embarrassing at best, but that
it works at all is something of a testimonial to the robustness of
Perl. However, because of it's size, complexity, and almost complete
lack of documentation I've tried to focus any questions on key
snippets. If you really want to see the whole mess, I guess you can,
but I think it would be quite an effort for anyone else to figure all
of it out...

However, by looking at the following two URLs, I think I can see what
is going on, though I still don't understand exactly why it produces
those results. The leading space has been removed in the case that
produces the error. For a reason which I still don't understand, that
means that 909 as an or option of the target somehow matches the 12
digits 210921101612. Actually, since I now trust Anno-san's approach,
it means the false match is apparently against 2109, 2110, or 1612...

http://shanenj.tripod.com/cgi-bin/p...se&sorttype=none&datetype=comp&numorname=both

http://shanenj.tripod.com/cgi-bin/p...se&sorttype=none&datetype=comp&numorname=both

Anyway, it seems like the natural solution if I can't live with it
would be to go back to the HTML and prevent it from removing the
leading space there...
 
A

anno4000

Shannon Jacobs said:
[A complimentary Cc of this posting was sent to
Shannon Jacobs
<[email protected]>], who wrote in article
(I did test Ilya Zakharevich's proposed suggestion in the next post,
and it worked more poorly, producing additional false matches. I'm
eager to study the differences there, though his approach seems more
complicated than yours.)

The only difference *in semantic* is //x which I used (for clarity);
the possible difference I meant is in performance.

If your hash contains spaces etc, you may need to modify it
accordingly (either remove //x and formatting, or wrap the hash in (?-x:)).

Hope this helps,
Ilya

Looking at your version again, the problem is the same one that I had
before with the first part being included in the match. For example,
if the search target includes 1101, and the first fifty characters
include 1101, then it produces a false match for that line. I tested
it both with and without the x parameter, and it made no difference.
(My ancient version of the camel book doesn't even 'translate' that
"x" word in that context. From that perspective, my limited Japanese
is superior to my fluency in Perl.) I couldn't understand your .50
versus my .{50}, so I tried that, too. No joy.

/.50/ was a typo by Ilya. (Apologies for presuming to speak for you,
Ilya.) It matches any character followed by a literal "50" which
can't be what he meant.
I did some more testing with Anno's version, and I feel like it is
correct and that there is some other problem going on there. In
particular, when I 'manually' forced a search on 1101 in isolation
using Anno's version, it works correctly, but running in the normal
context with other values, it produces a false match on the second
line of the following data:

Algorithms + Data Structures = Programs 19760010021657 CS

Waking Up Just in Time 1995001003210921101612PsPh

Boukyouhen: Hinotori No.6 19870010042111 SFJa


So in my next round of experiments, I manually trimmed the search
target to see what it was really matching on. This broke my mind...
However, I think I found the minimal example of the problem. If I
limit the search target to this:

909|1101

It produces the false match of that second line. But as

1101| 909

there is no problem. The problem is somehow apparently due to the
leading space, since

908|1101

also produces the false match (though the rest of the results are
correct). Lisp and Japanese make more sense to me... If this was a
contest, I would say that the laurels rest with Anno, and I think I'll
just try to live with this slight peculiarity.

You should show us actual code that actually demonstrates the behavior.

Jumping back and forth between your postings, I tried to reconstruct
what you might have done from the bits of code you *did* post. I
arrived at this:

my %form_values = (
a_SEARCH_VALUE => '(1121|1217|1256|2033)',
another => '( 909|1101)',
);
my @foo1 = <DATA>;

my @foo2 = grep substr( $_, 50, 12 ) =~
/^(?:\d{4}){0,2}$form_values{ another}/,
@foo1;

print for @foo2;

__DATA__
Algorithms + Data Structures = Programs 19760010021657 CS
Waking Up Just in Time 1995001003210921101612PsPh
Boukyouhen: Hinotori No.6 19870010042111 SFJa

That doesn't show the spurious match you reported. So, reduce your
program and data to a small set that demonstrates the problem.
There's no other way we could help you any further.

Anno
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Shannon Jacobs
The actual code is a pretty large piece and has a very mottled and
almost ancient history, too.

This is absolutely irrelevant. You apply an m// operator at some
moment. Just show us what is the regexp, and what is the value you
match agains.

Thanks,
Ilya
 
S

Shannon Jacobs

On Feb 21, 11:58 pm, (e-mail address removed)-berlin.de wrote:
That doesn't show the spurious match you reported. So, reduce your
program and data to a small set that demonstrates the problem.
There's no other way we could help you any further.

Anno

I'm already amazed and very grateful for the amount of help I have
received, and I am glad to say so again if I haven't made it clear.
The most important thing was finally understanding the answer to the
original question, as awkwardly as I worded it.

In spite of that satisfaction, right now I feel like I should take
another break from the Perling... It is certainly not your fault that
my code is so peculiar and annoying. Your solution does in fact seem
to be perfect or very close, but...

Having fixed one problem, my latest testing discovered yet another
peculiarity which could easily consume much more time than it's
worth... I'm only going to mention it as an example of the peculiarity
of my code... I have discovered that using the search target 2471|2396
returns different results from the search target 2396|2471. I don't
think this can really be Perl's fault. However and fortuitously, every
problem that I've discovered (so far) is in the direction of false
positives, and that is not very troublesome for this application...

My current belief is that this newly discovered flaw is somewhere on
the HTML side, possibly in my JavaScript. However if I can't find it
there, and if it seems to be in the Perl, I may be back. Thanks again.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top