General thanks to the people who provided hints, and now I can finally
answer the original Subject: question. The fundamental problem I was
wrestling with was between generalized grepping for existence of the search
target and more selective grepping. The answer: <drum roll>
If you only want to deal with part of a string, then you *MUST* account for
*ALL* of the string in your regex.
Now it seems obvious, but it took me a while to get the point. I think I was
led astray by the convenience of using little regular expressions as the
search target. Yes, it's convenient, but it's fundamentally sloppy. Anyway,
now I have three working solutions, and I even sort of think I understand
how they work (except for relative performance). The first one is mostly
mine, and the other two are mostly from real Perlers.
@foo2 = grep(/^.{50}.*($form_values{'a_SEARCH_VALUE'}).*.{6}$/,@foo1);
@foo2 = grep substr( $_, 50, 12 ) =~ /$form_values{'a_SEARCH_VALUE'}/,
@foo1;
@foo2 = grep(/^.{50,62}($form_values{'a_SEARCH_VALUE'}).{6,18}$/,@foo1);
There's still one more little wrinkle that's bugging me--but there always
is. Trivial enough to ignore, but I'm wondering if there's an elegant
solution... I'll try to restate that wrinkled problem in terms that are more
consistent with the posting guidelines:
An example of the search target in $form_values.... could be
1224|1357|2239|2243|2468 (intended to match anywhere in the 12 unmasked
digits), which are actually (up to) three numbers. What I want to do is
insert something like (.{4}){0,2} before and after the search target (where
I currently have .* in my first version above) so that it only considers 4
digits at a time. Here is some sample data from the file.
The Brethren 20010210282239 Fa
Gorilla, My Love 19810211042240 HF
KeitaiDenwaNoHimitsu 200102110722412242 JaChCS
Harry Potter and the Philosopher's Stone199702111722362243 Fa
In this example the first and fourth lines are proper matches against 2239
and 2243, respectively, but the third line is an undesired match against
1224. The problem as I see it is that the two things I'm thinking about
inserting should communicate with each other so that they always consume a
total of 8 characters, thereby forcing the target to consider only four
characters at a time.
Shannon said:
Yes, the first $ there was a typo left over from the actual code
sample that was included later in the OP (where that part of the
search target was stored in a variable). My apologies for not
including a sample of the data, but I had attempted to constrain the
question in a way that I hoped limited the need for reference to the
actual data. Here is a short sample from the file:
Irrational Numbers 1976770514 392 0 0SF
Maske: Thaery 1976770514 393 0 0SF
The Turning Place 1976770515 394 0 0SF
The October Circle 1975770516 395 0 0Fi
Our Invaded Universities 1974830410 671 EdPSHi
Space Mail 1980840607 8 564 565SF
There are spaces at the ends of the apparently short lines, so all of
them are actually of the same length. The embedded 0s are actually an
anachronism of the source programs, but they 'seem' harmless, so I've
always ignored them. I think it's irrelevant, but for the sake of
sizing, this data file is only around 200 Kb in total.
Since substring operations have come up again, let me clarify that in
this part of the program it seemed easier and better to use a simple
regex at each stage of refinement. I didn't want to break the lines
apart and just put them together again (though later on I did break
the final filtered results apart for output). (In addition, I know
that the current approach allows me to usefully input search targets
consisting of regular expressions, such as ^.{44}07 to pull the
current year's entries.)
At this point I am most interested in the operation and probable
performance advantages of John Krahn's
@foo2 = grep substr( $_, 50, 12 ) =~ /1121|1217|1256|2033/, @foo1;
versus my (corrected) version of
@foo2 = grep(/^.{50,62}(1121|1217|1256|2033).{6,18}$/,@foo1);
. I have the (fuzzy) intuition that his code is more directly
performing the operation that I described in the OP. If so, I'd like
to share it with the Perler who led me to my probably awkward solution
(but I'm not sure yet which is better nor why).
I am also still interested in understanding why this version failed:
@foo2 = grep(/^.{50}(1121|1217|1256|2033).{6}$/,@foo1);
Minor point is wondering whether or not it is necessary to worry about
the end of the string (as mentioned by Uri Guttman). It seems to me
that there would still be a general risk of false positives in the
tail of the string unless they are explicitly ignored. Or is he really
saying that my version is still subject to that risk?
<older snip>