re.compile and very specific searches

rbt · Feb 18, 2005

Is it possible to use re.compile to exclude certain numbers? For
example, this will find IP addresses:

ip = re.compile('\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}')

But it will also find 999.999.999.999 (something which could not
possibly be an IPv4 address). Can re.compile be configured to filter
results like this out?

Diez B. Roggisch · Feb 18, 2005

rbt said:
Is it possible to use re.compile to exclude certain numbers? For
example, this will find IP addresses:

ip = re.compile('\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}')

But it will also find 999.999.999.999 (something which could not
possibly be an IPv4 address). Can re.compile be configured to filter
results like this out?

You could use another regular expressin, e.g. like this:

import re

rex = re.compile(r"^((\d)|(1\d{1,2})|(2[0-5]\d))$")

for i in xrange(1000):
m = rex.match(str(i))
if m:
print m.groups(), i

This is of course only for one number. Extend it accordingly to the ip4
address format.

However, you won't be able to suppress the matching of e.g. 259 by that
regular expression. So I'd suggest you dump re and do it like this:

address = "192.168.1.1"

def validate_ip4(address):
digits = address.split(".")
if len(digits) == 4:
for d in digits:
if int(d) < 0 or int(d) > 255:
return False
return True

John Machin · Feb 18, 2005

Diez B. Roggisch wrote:

So I'd suggest you dump re and do it like this:

address = "192.168.1.1"

def validate_ip4(address):
digits = address.split(".")
if len(digits) == 4:
for d in digits:
if int(d) < 0 or int(d) > 255:
return False
return True

The OP wanted to "find" IP addresses -- unclear whether re.search or
re.match is required. Your solution doesn't address the search case.
For the match case, it needs some augmentation. It will fall apart if
presented with something like "..." or "comp.lang.python.announce". AND
while I'm at it ... in the event of a valid string of digits, it will
evaluate int(d) twice, rather unnecessarily & uglily.

So: match case:

! for s in strings_possibly_containing_digits:
! # if not(s.isdigit() and 0 <= int(s) <= 255): # prettier, but test
on zero is now redundant
! if not s.isdigit() or int(s) > 255:

and the search case: DON'T dump re; it can find highly probable
candidates (using a regexp like the OP's original or yours) a damn
sight faster than anything else this side of C or Pyrex. Then you
validate the result, with a cut-down validator that relies on the fact
that there are 4 segments and they contain only digits:

! # no need to test length == 4
! for s in address.split('.'):
! if int(s) > 255:

HTH,
John

John Machin · Feb 18, 2005

Diez said:
You could use another regular expressin, e.g. like this:

rex = re.compile(r"^((\d)|(1\d{1,2})|(2[0-5]\d))$")

This approach would actually work without the need for subsequent
validation, if implemented properly. Not only as you noted does it let
"259" through, but also it doesn't cover 2-digit numbers starting with
2. Assuming excess leading zeroes are illegal, the components required
are:

\d
[1-9]\d
1\d\d
2[0-4]\d
25[0-5]

Diez B. Roggisch · Feb 18, 2005

The OP wanted to "find" IP addresses -- unclear whether re.search or

re.match is required. Your solution doesn't address the search case.
For the match case, it needs some augmentation. It will fall apart if
presented with something like "..." or "comp.lang.python.announce". AND
while I'm at it ... in the event of a valid string of digits, it will
evaluate int(d) twice, rather unnecessarily & uglily.

You are right of course. I concentrated on the right value range, but bogus
entries should be dealt with, too.

! for s in strings_possibly_containing_digits:
! # if not(s.isdigit() and 0 <= int(s) <= 255): # prettier, but test
on zero is now redundant
! if not s.isdigit() or int(s) > 255:

Instead of this, I'd go for

def validate_ip4(address):
    digits = address.split(".")
    if len(digits) == 4:
try:
        for d in digits:
d = int(d)
             if d < 0 or d > 255:
                   return False
return True
except ValueError:
pass
return False

And I don't think that an isdigit() is necessary faster than int(). The
basically do the same.

and the search case: DON'T dump re; it can find highly probable
candidates (using a regexp like the OP's original or yours) a damn
sight faster than anything else this side of C or Pyrex. Then you
validate the result, with a cut-down validator that relies on the fact
that there are 4 segments and they contain only digits:

The search case needs a regular expression. But the OP didn't say much about
what he actually wants.

Diez B. Roggisch · Feb 18, 2005

This approach would actually work without the need for subsequent
validation, if implemented properly. Not only as you noted does it let
"259" through, but also it doesn't cover 2-digit numbers starting with
2. Assuming excess leading zeroes are illegal, the components required
are:

Damn. Certainly not my glory regular expression day.

rbt · Feb 18, 2005

John said:
Diez B. Roggisch wrote:

The OP wanted to "find" IP addresses -- unclear whether re.search or
re.match is required. Your solution doesn't address the search case.
For the match case, it needs some augmentation. It will fall apart if
presented with something like "..." or "comp.lang.python.announce". AND
while I'm at it ... in the event of a valid string of digits, it will
evaluate int(d) twice, rather unnecessarily & uglily.

So: match case:

! for s in strings_possibly_containing_digits:
! # if not(s.isdigit() and 0 <= int(s) <= 255): # prettier, but test
on zero is now redundant
! if not s.isdigit() or int(s) > 255:

and the search case: DON'T dump re; it can find highly probable
candidates (using a regexp like the OP's original or yours) a damn
sight faster than anything else this side of C or Pyrex. Then you
validate the result, with a cut-down validator that relies on the fact
that there are 4 segments and they contain only digits:

This is what I ended up doing... re.compile and then findall(data) does an excellent
job finding all strings that look like ipv4 addys, then the split works just as well
in weeding out strings that are not actual ipv4 addys.

Thanks to all for the advice!

Denis S. Otkidach · Feb 19, 2005

Is it possible to use re.compile to exclude certain numbers? For
example, this will find IP addresses:

ip = re.compile('\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}')

But it will also find 999.999.999.999 (something which could not
possibly be an IPv4 address). Can re.compile be configured to filter
results like this out?

Try this one:
re.compile(r'\b%s\b' % r'\.'.join(['(?

?:2[0-4]|1\d|[1-9])?\d|25[0-5])']*4))

Question on regex	1	Dec 23, 2006
Using regexes versus "in" membership test?	6	Dec 12, 2012
Parsing log in SQL DB to change IPs to hostnames	13	Apr 10, 2007
Linux: using "clone3" and "waitid"	0	Oct 17, 2023
Regex to match a numerical IP range	7	Dec 11, 2010
A Spurious Warning	19	Sep 27, 2011
string search and modification	3	Sep 6, 2006
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013

re.compile and very specific searches

rbt

Diez B. Roggisch

John Machin

John Machin

Diez B. Roggisch

Diez B. Roggisch

rbt

Denis S. Otkidach

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads