re.compile and very specific searches

R

rbt

Is it possible to use re.compile to exclude certain numbers? For
example, this will find IP addresses:

ip = re.compile('\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}')

But it will also find 999.999.999.999 (something which could not
possibly be an IPv4 address). Can re.compile be configured to filter
results like this out?
 
D

Diez B. Roggisch

rbt said:
Is it possible to use re.compile to exclude certain numbers? For
example, this will find IP addresses:

ip = re.compile('\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}')

But it will also find 999.999.999.999 (something which could not
possibly be an IPv4 address). Can re.compile be configured to filter
results like this out?

You could use another regular expressin, e.g. like this:

import re

rex = re.compile(r"^((\d)|(1\d{1,2})|(2[0-5]\d))$")

for i in xrange(1000):
m = rex.match(str(i))
if m:
print m.groups(), i


This is of course only for one number. Extend it accordingly to the ip4
address format.

However, you won't be able to suppress the matching of e.g. 259 by that
regular expression. So I'd suggest you dump re and do it like this:

address = "192.168.1.1"

def validate_ip4(address):
digits = address.split(".")
if len(digits) == 4:
for d in digits:
if int(d) < 0 or int(d) > 255:
return False
return True
 
J

John Machin

Diez B. Roggisch wrote:

So I'd suggest you dump re and do it like this:

address = "192.168.1.1"

def validate_ip4(address):
digits = address.split(".")
if len(digits) == 4:
for d in digits:
if int(d) < 0 or int(d) > 255:
return False
return True

The OP wanted to "find" IP addresses -- unclear whether re.search or
re.match is required. Your solution doesn't address the search case.
For the match case, it needs some augmentation. It will fall apart if
presented with something like "..." or "comp.lang.python.announce". AND
while I'm at it ... in the event of a valid string of digits, it will
evaluate int(d) twice, rather unnecessarily & uglily.

So: match case:

! for s in strings_possibly_containing_digits:
! # if not(s.isdigit() and 0 <= int(s) <= 255): # prettier, but test
on zero is now redundant
! if not s.isdigit() or int(s) > 255:

and the search case: DON'T dump re; it can find highly probable
candidates (using a regexp like the OP's original or yours) a damn
sight faster than anything else this side of C or Pyrex. Then you
validate the result, with a cut-down validator that relies on the fact
that there are 4 segments and they contain only digits:

! # no need to test length == 4
! for s in address.split('.'):
! if int(s) > 255:

HTH,
John
 
J

John Machin

Diez said:
You could use another regular expressin, e.g. like this:


rex = re.compile(r"^((\d)|(1\d{1,2})|(2[0-5]\d))$")

This approach would actually work without the need for subsequent
validation, if implemented properly. Not only as you noted does it let
"259" through, but also it doesn't cover 2-digit numbers starting with
2. Assuming excess leading zeroes are illegal, the components required
are:

\d
[1-9]\d
1\d\d
2[0-4]\d
25[0-5]
 
D

Diez B. Roggisch

The OP wanted to "find" IP addresses -- unclear whether re.search or
re.match is required. Your solution doesn't address the search case.
For the match case, it needs some augmentation. It will fall apart if
presented with something like "..." or "comp.lang.python.announce". AND
while I'm at it ... in the event of a valid string of digits, it will
evaluate int(d) twice, rather unnecessarily & uglily.


You are right of course. I concentrated on the right value range, but bogus
entries should be dealt with, too.
! for s in strings_possibly_containing_digits:
! # if not(s.isdigit() and 0 <= int(s) <= 255): # prettier, but test
on zero is now redundant
! if not s.isdigit() or int(s) > 255:

Instead of this, I'd go for

def validate_ip4(address):
    digits = address.split(".")
    if len(digits) == 4:
try:
        for d in digits:
d = int(d)
             if d < 0 or d > 255:
                   return False
return True
except ValueError:
pass
return False

And I don't think that an isdigit() is necessary faster than int(). The
basically do the same.
and the search case: DON'T dump re; it can find highly probable
candidates (using a regexp like the OP's original or yours) a damn
sight faster than anything else this side of C or Pyrex. Then you
validate the result, with a cut-down validator that relies on the fact
that there are 4 segments and they contain only digits:

The search case needs a regular expression. But the OP didn't say much about
what he actually wants.
 
D

Diez B. Roggisch

This approach would actually work without the need for subsequent
validation, if implemented properly. Not only as you noted does it let
"259" through, but also it doesn't cover 2-digit numbers starting with
2. Assuming excess leading zeroes are illegal, the components required
are:

Damn. Certainly not my glory regular expression day.
 
R

rbt

John said:
Diez B. Roggisch wrote:





The OP wanted to "find" IP addresses -- unclear whether re.search or
re.match is required. Your solution doesn't address the search case.
For the match case, it needs some augmentation. It will fall apart if
presented with something like "..." or "comp.lang.python.announce". AND
while I'm at it ... in the event of a valid string of digits, it will
evaluate int(d) twice, rather unnecessarily & uglily.

So: match case:

! for s in strings_possibly_containing_digits:
! # if not(s.isdigit() and 0 <= int(s) <= 255): # prettier, but test
on zero is now redundant
! if not s.isdigit() or int(s) > 255:

and the search case: DON'T dump re; it can find highly probable
candidates (using a regexp like the OP's original or yours) a damn
sight faster than anything else this side of C or Pyrex. Then you
validate the result, with a cut-down validator that relies on the fact
that there are 4 segments and they contain only digits:

This is what I ended up doing... re.compile and then findall(data) does an excellent
job finding all strings that look like ipv4 addys, then the split works just as well
in weeding out strings that are not actual ipv4 addys.

Thanks to all for the advice!
 
D

Denis S. Otkidach

Is it possible to use re.compile to exclude certain numbers? For
example, this will find IP addresses:

ip = re.compile('\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}')

But it will also find 999.999.999.999 (something which could not
possibly be an IPv4 address). Can re.compile be configured to filter
results like this out?

Try this one:
re.compile(r'\b%s\b' % r'\.'.join(['(?:(?:2[0-4]|1\d|[1-9])?\d|25[0-5])']*4))
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,522
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top