Perl-style regex advice please

A

Alan J. Flavell

Sorry, I'd love to put a pithy description of the problem in the
subject line, just like the smart-questions FAQ, but this time I
couldn't manage it. Here's what my problem boils down to, as best I
can simplify it (motivation at the end, for anyone who cares).

Simplified problem:

I've got a block of text that contains several dotted IP addresses in
the form [a.b.c.d]. What I need to do is find the first one of those
addresses which I don't recognise.

To define "which I don't recognise", I can provide a list of explicit
addresses, or a pattern, or whatever's convenient.

OK, I've no problem matching a dotted IP address and capturing the
result, that's easy. What I can't work out a strategy for, is how to
skip over a match if it matches a list of, or pattern of, addresses
which aren't of interest.

The constraint of the actual application is that I have to supply a
Perl-compatible regex, which will return the answer via the regex
capture mechanism (...). So, program loops through the text aren't
feasible, it seems.

There will be further such [...] in the text, so it's not the last one
that I'm looking for: it's the first one that doesn't match one of the
known addresses.

Advice please?


OK, the motivation. This wodge of text is in fact the concatenated
contents of a bunch of "Received:" headers from forwarded mail. We
know where the forwarded mail came from (those will be the addresses
which I already know about and am not interested in, so I want to skip
their matches), and it might have been forwarded several times between
different mail servers within the forwarding site (it varies between
examples), so we'll want to skip over a variable number of IPs that we
recognise, in order to pick up the first one that we don't recognise.

This will then be the IP address from which _they_ accepted the mail
before forwarding it to us, and I want to get that IP so that I can
look it up in a dnsRBL to help decide whether it is forwarded spam.

There are several (a small number) of forwarding sites of interest to
us, but if I can see a strategy for dealing with one, then I don't see
any problem with extending it to a few more. It's just that I don't
know how to make the match against, say, \[(\d+\.\d+\.\d+\.\d+)\] get
skipped if it happens to one of the ones that I'm not interested in.

(yes, I have pored over perlretut, but perhaps I'm looking at the
problem in the wrong way...)

cheers
 
U

Uri Guttman

AJF> To define "which I don't recognise", I can provide a list of explicit
AJF> addresses, or a pattern, or whatever's convenient.

AJF> OK, I've no problem matching a dotted IP address and capturing the
AJF> result, that's easy. What I can't work out a strategy for, is how to
AJF> skip over a match if it matches a list of, or pattern of, addresses
AJF> which aren't of interest.

does this have to be in a single regex? in perl i would do a loop over a
grab match of the dotted stuff and then loop over each of the 'don't
recognise' things and check the grabbed stuff against that.

AJF> The constraint of the actual application is that I have to supply a
AJF> Perl-compatible regex, which will return the answer via the regex
AJF> capture mechanism (...). So, program loops through the text aren't
AJF> feasible, it seems.

ahh, pcre which is wrong since you can't run perl with /e modifiers :)


AJF> There will be further such [...] in the text, so it's not the last one
AJF> that I'm looking for: it's the first one that doesn't match one of the
AJF> known addresses.

possibly some combination of lookahead would look appealing:

<bogus pcre and perl re>

/(?=match_dotted_stuff)
(?!pattern_it_can't_match_or_alternation_of_such_patterns)/x

uri
 
M

Malcolm Dew-Jones

Alan J. Flavell ([email protected]) wrote:

: Sorry, I'd love to put a pithy description of the problem in the
: subject line, just like the smart-questions FAQ, but this time I
: couldn't manage it. Here's what my problem boils down to, as best I
: can simplify it (motivation at the end, for anyone who cares).

: Simplified problem:

: I've got a block of text that contains several dotted IP addresses in
: the form [a.b.c.d]. What I need to do is find the first one of those
: addresses which I don't recognise.

: To define "which I don't recognise", I can provide a list of explicit
: addresses, or a pattern, or whatever's convenient.

: OK, I've no problem matching a dotted IP address and capturing the
: result, that's easy. What I can't work out a strategy for, is how to
: skip over a match if it matches a list of, or pattern of, addresses
: which aren't of interest.

: The constraint of the actual application is that I have to supply a
: Perl-compatible regex, which will return the answer via the regex
: capture mechanism (...). So, program loops through the text aren't
: feasible, it seems.

: There will be further such [...] in the text, so it's not the last one
: that I'm looking for: it's the first one that doesn't match one of the
: known addresses.

I can't try it as I don't have the version, but (??{}) and $^N could be
used

something like

sub checker
{ my $ip = shift;
if ( ip_ok($ip) ) { return quotemeta($ip) } # this will match
else { return 'this_wont_match' }
}

$your_string =~
m/(($ip_regex)(?<=(??{checker($^N)})))/;


The intention (which I can't test) is that checker is called with $2 (the
ip just found) and then it returns either that ip or something else. The
(?<= uses that result to check what it just looked at. The look behind
will either succeed or fail based on what checker returns, and thereby $1
ends up getting set exactly when we want it to be set. (Checker is of
course dsigned to return either something that will match or something
that won't.)

Something other than (?<= using the found ip would probably be better for
the succeed/fail of the $1 match, oh well...

I don't see a way to do this without $^N

$0.02
 
E

Eric Wilhelm

The constraint of the actual application is that I have to supply a
Perl-compatible regex, which will return the answer via the regex
capture mechanism (...). So, program loops through the text aren't
feasible, it seems.

Maybe you should be looking at what causes this constraint.

If you are designing on OO function, your object could easily contain the
recognized addresses.

If the text can be butchered before the pattern match of which you speak,
you could replace all of the recognized addresses with ----- or
something.

If you can manage to get a list of all ip-like matches, you can loop over
that list (or easily: @unknown = grep({! $known{$_}} @list); ).

That is what I consider to be the Perl style. Don't ask a single regex
to do too much. In this case, you are basically asking it to do what the
rest of the language was designed to do.

--Eric
 
A

Alan J. Flavell

/\[((?!$UNINTERESTING_IP_ADDRESSES_PATTERN)$IP_ADDRESS_PATTERN)\]/

Beautiful, thanks.

So if I code e.g

/\[((?!127\.0|192\.168)\d+\.\d+\.\d+\.\d+)\]/

telling it (for example) that I'm not interested in addresses
127.0--- and 192.168--- , and feed it e.g

zzz [127.0.0.1] yyy [192.168.1.1] xxx [12.23.34.45] ppp [3.3.3.3]

then it produces a match

12.23.34.45

which is almost exactly what I want. (I just have to reverse the
order of the octets in order to look them up, but that's no big
deal...).

I probably look pretty silly now, but I was kindof stuck. Thanks!

Btw, originally I had (\d+)\.(\d+)\.(\d+)\.(\d+) in the regex, and was
using $4.$3.$2.$1 as the reversed octets; but I couldn't see a way to
fold that into your negative lookahead recipe - so I'll tackle it as a
separate step.

cheers
 
A

Alan J. Flavell

Maybe you should be looking at what causes this constraint.

It's a fair comment, indeed. I had admittedly simplified the
description; but as it turns out, Abigail seems to have pressed just
the right button ;-)
Don't ask a single regex to do too much. In this case, you are
basically asking it to do what the rest of the language was designed
to do.

Point taken, for sure. Truth is, I was doing something with the PCRE
(perl compatible regex) package, on the borderlines of Perl.
Apologies for any offence caused, but it seems to me that the answer
has actually come out right ;-)

And my thanks to the other contributors - it's appreciated ;-)
 
B

Ben Morrow

Alan J. Flavell said:
/\[((?!127\.0|192\.168)\d+\.\d+\.\d+\.\d+)\]/
Btw, originally I had (\d+)\.(\d+)\.(\d+)\.(\d+) in the regex, and was
using $4.$3.$2.$1 as the reversed octets; but I couldn't see a way to
fold that into your negative lookahead recipe - so I'll tackle it as a
separate step.

Ummm... brackets which aren't quantified don't make any difference to
what is matched, only to what is captured.

/\[(?!127\.0|192\.168)(\d+)\.(\d+)\.(\d+)\.(\d+)\]/

Ben
 
A

Alan J. Flavell

/\[(?!127\.0|192\.168)(\d+)\.(\d+)\.(\d+)\.(\d+)\]/

Oh gosh, thanks. I had been very close to that, at one point before,
but I had one pair of parentheses in the wrong place, it seems.

Here's your recipe in action (courtesy of pcretest):

re> /\[(?!127\.0|192\.168)(\d+)\.(\d+)\.(\d+)\.(\d+)\]/
data> zzz [127.0.0.1] yyy [192.168.1.1] xxx [12.23.34.45] ppp [3.3.3.3]
0: [12.23.34.45]
1: 12
2: 23
3: 34
4: 45

Thanks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads

RegEx 0
ADA Compliance/Web Accessibility Advice 1
Need Advice 10
Learning Regex looking for criticism 3
Looking For Advice 1
Twitter Bot for Series recommendations help please 1
Help please 8
Style Tag Problem 1

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top