Help beautify ugly heuristic code

  • Thread starter Stuart D. Gathman
  • Start date
S

Stuart D. Gathman

I have a function that recognizes PTR records for dynamic IPs. There is
no hard and fast rule for this - every ISP does it differently, and may
change their policy at any time, and use different conventions in
different places. Nevertheless, it is useful to apply stricter
authentication standards to incoming email when the PTR for the IP
indicates a dynamic IP (namely, the PTR record is ignored since it doesn't
mean anything except to the ISP). This is because Windoze Zombies are the
favorite platform of spammers.

Here is the very ugly code so far. It offends me to look at it, but
haven't had any better ideas. I have lots of test data from mail logs.

# examples we don't yet recognize:
#
# 1Cust65.tnt4.atl4.da.uu.net at ('67.192.40.65', 4588)
# 1Cust200.tnt8.bne1.da.uu.net at ('203.61.67.200', 4144)
# 1Cust141.tnt30.rtm1.nld.da.uu.net at ('213.116.154.141', 2036)
# user64.net2045.mo.sprint-hsd.net at ('67.77.185.64', 3901)
# wiley-268-8196.roadrunner.nf.net at ('205.251.174.46', 4810)
# 221.fib163.satnet.net at ('200.69.163.221', 3301)
# cpc2-ches1-4-0-cust8.lutn.cable.ntl.com at ('80.4.105.8', 61099)
# user239.res.openband.net at ('65.246.82.239', 1392)
# xdsl-2449.zgora.dialog.net.pl at ('81.168.237.145', 1238)
# spr1-runc1-4-0-cust25.bagu.broadband.ntl.com at ('80.5.10.25', 1684)
# user-0c6s7hv.cable.mindspring.com at ('24.110.30.63', 3720)
# user-0c8hvet.cable.mindspring.com at ('24.136.253.221', 4529)
# user-0cdf5j8.cable.mindspring.com at ('24.215.150.104', 3783)
# mmds-dhcp-11-143.plateautel.net at ('63.99.131.143', 4858)
# ca-santaanahub-cuda3-c6b-134.anhmca.adelphia.net at ('68.67.152.134', 62047)
# cbl-sd-02-79.aster.com.do at ('200.88.62.79', 4153)
# h105n6c2o912.bredband.skanova.com at ('213.67.33.105', 3259)

import re

ip3 = re.compile('([0-9]{1,3})[.x-]([0-9]{1,3})[.x-]([0-9]{1,3})')
rehmac = re.compile(
'h[0-9a-f]{12}[.]|pcp[0-9]{6,10}pcs[.]|no-reverse|S[0-9a-f]{16}[.][a-z]{2}[.]'
)

def is_dynip(host,addr):
"""Return True if hostname is for a dynamic ip.
Examples:
>>> is_dynip('post3.fabulousdealz.com','69.60.99.112') False
>>> is_dynip('adsl-69-208-201-177.dsl.emhril.ameritech.net','69.208.201.177') True
>>> is_dynip('[1.2.3.4]','1.2.3.4')
True
"""
if host.startswith('[') and host.endswith(']'):
return True
if addr:
if host.find(addr) >= 0: return True
a = addr.split('.')
ia = map(int,a)
m = ip3.search(host)
if m:
g = map(int,m.groups())
if g == ia[1:] or g == ia[:3]: return True
if g[0] == ia[3] and g[1:] == ia[:2]: return True
g.reverse()
if g == ia[1:] or g == ia[:3]: return True
if rehmac.search(host): return True
if host.find("%s." % '-'.join(a[2:])) >= 0: return True
if host.find("w%s." % '-'.join(a[:2])) >= 0: return True
if host.find("dsl%s-" % '-'.join(a[:2])) >= 0: return True
if host.find(''.join(a[:3])) >= 0: return True
if host.find(''.join(a[1:])) >= 0: return True
x = "%02x%02x%02x%02x" % tuple(ia)
if host.lower().find(x) >= 0: return True
z = [n.zfill(3) for n in a]
if host.find('-'.join(z)) >= 0: return True
if host.find("-%s." % '-'.join(z[2:])) >= 0: return True
if host.find("%s." % ''.join(z[2:])) >= 0: return True
if host.find(''.join(z)) >= 0: return True
a.reverse()
if host.find("%s." % '-'.join(a[:2])) >= 0: return True
if host.find("%s." % '.'.join(a[:2])) >= 0: return True
if host.find("%s." % a[0]) >= 0 and \
host.find('.adsl.') > 0 or host.find('.dial-up.') > 0: return True
return False

if __name__ == '__main__':
import fileinput
for ln in fileinput.input():
a = ln.split()
if len(a) == 2:
ip,host = a
if host.startswith('[') and host.endswith(']'):
continue # no PTR
if is_dynip(host,ip):
print ip,host
 
M

Mitja

I have a function that recognizes PTR records for dynamic IPs....
Here is the very ugly code so far.
...
# examples we don't yet recognize:
...

This doesn't help much; post example of all the possible patterns you have
to match (kind of like the docstring at the beginning, only more
elaborate), otherwise it's hard to know what kind of code you're trying to
implement.
 
S

Stuart D. Gathman

This doesn't help much; post example of all the possible patterns you
have to match (kind of like the docstring at the beginning, only more
elaborate), otherwise it's hard to know what kind of code you're trying
to implement.

This is a heuristic, so there is no exhaustive list or hard rule.
However, I have posted 23K+ examples at http://bmsi.com/python/dynip.samp
with DYN appended for examples which the current algorithm classifies
as dynamic.

Here are the last 20 (which my subjective judgement says are correct):

65.112.76.15 usfshlxmx01.myreg.net
201.128.108.41 dsl-201-128-108-41.prod-infinitum.com.mx DYN
206.221.177.128 mail128.tanthillyingyang.com
68.234.254.147 68-234-254-147.stmnca.adelphia.net DYN
63.110.30.30 mx81.goingwiththe-flow.info
62.178.226.189 chello062178226189.14.15.vie.surfer.at DYN
80.179.107.85 80.179.107.85.ispeednet.net DYN
200.204.68.52 200-204-68-52.dsl.telesp.net.br DYN
12.203.156.234 12-203-156-234.client.insightBB.com DYN
200.83.68.217 CM-lconC1-68-217.cm.vtr.net DYN
81.57.115.43 pauguste-3-81-57-115-43.fbx.proxad.net DYN
64.151.91.225 sv2a.entertainmentnewsclips.com
64.62.197.31 teenfreeway.sparklist.com
201.9.136.235 201009136235.user.veloxzone.com.br DYN
66.63.187.91 91.asandox.com
83.69.188.198 st11h07.ptambre.com
66.192.199.217 66-192-199-217.pyramidcollection.org DYN
69.40.166.49 h49.166.40.69.ip.alltel.net DYN
203.89.206.62 smtp.immigrationexpert.ca
80.143.79.97 p508F4F61.dip0.t-ipconnect.de DYN
 
L

Lonnie Princehouse

Regular expressions.

It takes a while to craft the expressions, but this will be more
elegant, more extensible, and considerably faster to compute (matching
compiled re's is fast).

Example using the top five from your function's comments:

.. host_patterns = [
.. '^1Cust\d+\.tnt\d+\..*\.da\.uu\.net$',
.. '^user\d+\.net\d+\.mo\.sprint-hsd\.net$',
.. '^.*\.roadrunner\.nf\.net$',
.. ]
..
.. host_expr = re.compile('|'.join(host_patterns))
..
.. # only implementing host string matching, but you get the idea.
.. def is_dynip(host):
.. # host names are case insensitive
.. host = host.lower()
.. return host_expr.match(host) is not None
False


(the dots preceding code are to fool indentation stripping... please
ignore them)
 
S

Stuart D. Gathman

Regular expressions.

It takes a while to craft the expressions, but this will be more
elegant, more extensible, and considerably faster to compute (matching
compiled re's is fast).

I'm already doing that with the rehmac regex. I like your idea for making
it more readable, though. Looking for permutations of the IP address
gives much more bang for the line of code than most host only regexes
since it is ISP independent. At least one ISP uses roman numerals to code
the IP for their dynamic addresses! I tried matching a custom regex
computed from the IP, but compiling the regex for each test was too slow.

I could keep adding more patterns, but I was hoping for a tool that
"learns" from a database of preclassified examples how to recognize the
pattern. And the resulting data would be reasonably compact. I don't ask
for much, do I? A Bayesian classifier would have too big of a database, I
think. I've seen neural nets do amazing things with only 100 or so
neurons - a small weight database. But they are slow in software.

I have posted 10K preclassified (by current algorithm) examples here:
http://bmsi.com/python/dynip.samp
 
C

Carlos Ribeiro

Regular expressions.

It takes a while to craft the expressions, but this will be more
elegant, more extensible, and considerably faster to compute (matching
compiled re's is fast).

I think that this problem is probably a little bit harder. As the OP
noted, each ISP uses a different notation. I think that a better
solution is to use a statistical approach, possibly using a custom
Bayesian filter that could "learn" a little bit about some patters.

The basic idea is as follows:

-- break the URL in pieces, using not only the dots, but also hyphens
and underscores in the name.

-- classify each part, using REs to identify common patterns: frequent
strings (com, gov, net, org); normal words (sequences of letters);
normal numbers; combinations of numbers & letters; common substrings
can also be identified (such as isp, in the middle of one of the
strings).

-- check these pieces against the Bayesian filter, pretty much as it's
done for spam.

I think that this approach is promising. It relies on the fact that
real servers usually do not have numbers in their names; however,
exact identification either by a match or a regular expression is very
difficult. I'm willing to try it, but first, more data is needed.

--
Carlos Ribeiro
Consultoria em Projetos
blog: http://rascunhosrotos.blogspot.com
blog: http://pythonnotes.blogspot.com
mail: (e-mail address removed)
mail: (e-mail address removed)
 
L

Lonnie Princehouse

I don't think a Bayesian classifier is going to be very helpful here,
unless you have tens of thousands of examples to feed it, or unless it
was specially coded to first break addresses into better tokens for
classification (such as alphanumeric strings and numbers).

The series of if host.find(...) lines in is_dynip() is equivalent to a
regular expression, but much more expensive to execute because of all
the list slicing, and it won't benefit from the re module's speedy
native implementation of regular expressions.

Try building a host_expr (as per my previous post) in the following
way:

# suppose dynamic_host_list is a list of all the host strings already
known to be
# dynamic.

.. host_patterns = {} # use a dict to guarantee uniqueness. sets would
also work.
.. number_expr = re.compile("\d+")
.. for dynamic_host in dynamic_host_list:
.. pattern = '^' + number_expr.sub("\d+", dynamic_host) + '$'
.. host_patterns[pattern] = True
..
.. host_expr = re.compile('|'.join(host_patterns.keys()))

This will catch any hostname that differs only in numbers from any
other host you've
already classified.

For IP addresses, you really just need a mechanism to filter blocks of
IP addresses. It might be easiest to first convert them into hex and
then make liberal use of [0-f] in regular expressions.
 
S

Stuart D. Gathman

I don't think a Bayesian classifier is going to be very helpful here,
unless you have tens of thousands of examples to feed it, or unless it

We do have tens of thousands of examples to feed it.
The series of if host.find(...) lines in is_dynip() is equivalent to a
regular expression, but much more expensive to execute because of all

It is not equivalent, because the patterns are based on the IP address.
As I mentioned before, I tried building a custom regex from the IP for
each test - but compiling the regex is way too slow to be done for each
test.
For IP addresses, you really just need a mechanism to filter blocks of
IP addresses. It might be easiest to first convert them into hex and
then make liberal use of [0-f] in regular expressions.

The point of the ip address is *not* to recognize ip addresses. The
point is to look for transformations of the ip address in the hostname.
This gives a *huge* bang for the buck. I have been working on this
problem for a while. If the hostname has a transformation of the ip
address - it is (almost certainly) a dynamic address. The ISPs are very
creative in their transformations, using the parts of the ip in various
orders and encoding in hex, base64, decimal with or without zerofill, and
even roman numerals.

The regex engine is just not powerful enough to handle parameterized
regexe (that I know of).
 
L

Lonnie Princehouse

Doh! I misread "a" as host instead of ip in your first post. I'm
sorry about that; I really must slow down. Anyhow,

I believe you can still do this with only compiling a regex once and
then performing a few substitutions on the hostname.

Substitutions:

1st byte of IP => (0)
2nd byte of IP => (1)
3rd byte of IP => (2)
4th byte of IP => (3)
and likewise for hex => (x0) (x1) (x2) (x3)

Each host string will possibly map into multiple expansions, esp. if a
number repeats itself in the IP, or if an IP byte is less than 10 (such
that the decimal and hex representations are the same). Zero-padded
and unpadded will both have to be substituted, and it's probably best
to not to alter the last two fields in the host name since ISPs can't
change those.

With this scheme, here are a few expansions of (ip,host) tuples:

172.182.240.186 ACB6F0BA.ipt.aol.com
becomes
(x0)(x1)(x2)(x3).ipt.aol.com

67.119.55.77 adsl-67-119-55-77.dsl.lsan03.pacbell.net
becomes
adsl-(0)-(1)-(2)-(3).dsl.lsan03.pacbell.net
adsl-(0)-(1)-(2)-(x1).dsl.lsan03.pacbell.net


81.220.220.143 ip-143.net-81-220-220.henin.rev.numericable.fr
becomes
ip-(3).net-(0)-(1)-(1).henin.rev.numericable.fr
ip-(3).net-(0)-(1)-(2).henin.rev.numericable.fr
ip-(3).net-(0)-(2)-(1).henin.rev.numericable.fr
ip-(3).net-(0)-(2)-(2).henin.rev.numericable.fr

etcetera.

Now you can run a precompiled regular expression against these hostname
permutations, i.e. ".*\(0\).*\(1\).*\(2\).*\(3\).*" would match any
host in which the IP address numbers appeared in the correct order.

There are only a handful dynamic addresses in your sample data that
don't match a decimal or hexadecimal IP-based pattern, e.g.

68.53.109.99 pcp03902856pcs.nash01.tn.comcast.net
68.147.136.167 s01060050bf91c1e4.cg.shawcable.net
 
M

Mitja

I have a function that recognizes PTR records for dynamic IPs. There is
no hard and fast rule for this - every ISP does it differently, and may
change their policy at any time, and use different conventions in
different places. Nevertheless, it is useful to apply stricter
authentication standards to incoming email when the PTR for the IP
indicates a dynamic IP (namely, the PTR record is ignored since it
doesn't
mean anything except to the ISP). This is because Windoze Zombies are
the
favorite platform of spammers.

This is roughly it.... you'll have to experiment and find the right
numbers for different pattern matches, maybe even add some extra criteria
etc. I don't have the time for it right now, but I'd be interested to know
how much my code and yours differ in the detection process (i.e. where are
the return values different).

Hope the indentation makes it through alright.

#!/usr/bin/python

import re
reNum = re.compile(r'\d+')
reWord = re.compile(r'(?<=[^a-z])[a-z]+(?=[^a-z])|^[a-z]+(?=[^a-z])')
#words that imply a dynamic ip
dynWords = ('dial','dialup','dialin','adsl','dsl','dyn','dynamic')
#words that imply a static ip
staticWords = ('cable','static')

def isDynamic(host, ip):
"""
Heuristically checks whether hostname is likely to represent
a dynamic ip.
Returns True or False.
"""

#for easier matching
ip=[int(p) for p in ip.split('.')]
host=host.lower()

#since it's heuristic, we'll give the hostname
#(de)merits for every pattern it matches further on.
#based on the value of these points, we'll decide whether
#it's dynamic or not
points=0;

#the ip numbers; finding those in the hostname speaks
#for itself; also include hex and oct representations
#lowest ip byte is even more suggestive, give extra points
#for matching that
for p in ip[:3]:
#bytes 0, 1, 2
if (host.find(`p`) != -1) or (host.find(oct(p)[1:]) != -1): points+=20
#byte 3
if (host.find(`ip[3]`) != -1) or (host.find(oct(ip[3])[1:]) != -1):
points+=60
#it's hard to distinguish hex numbers from "normal"
#chars, so we simplify it a bit and only search for
#last two bytes of ip concatenated
if host.find(hex(ip[3])[2:]+hex(ip[3])[2:]) != -1: points+=60

#long, seemingly random serial numbers in the hostname are also a hint
#search for all numbers and "award" points for longer ones
for num in reNum.findall(host):
points += min(len(num)**2,60);

#substrings that are more than just a hint of a dynamic ip
for word in reWord.findall(host):
if word in dynWords: points+=30
if word in staticWords: points-=30

print '[[',points,']]'
return points>80

if __name__=='__main__':
for line in open('dynip.samp').readlines()[:50]:
(ip,host) = line.rstrip('DYN').split()[:2]
if host.find('.') != -1:
print host, ip, ['','DYNAMIC'][isDynamic(host,ip)]
 
J

Jeremy Sanders

Here are the last 20 (which my subjective judgement says are correct):

65.112.76.15 usfshlxmx01.myreg.net 201.128.108.41 [snip]
80.143.79.97 p508F4F61.dip0.t-ipconnect.de DYN

Looks like you could do something like look for a part of the dns name
which is over a certain length with a high fraction of numbers and
punctuation characters.

Of course, you could build up a probability tree of the chance of a
character being followed by another character in a dynamic name and a
static name.

This might work well, as many are sequences of numbers (high chance of
number followed by number), and mixed characters and numbers, which are
unusual in normal dns names.

Jeremy
 
S

Stuart D. Gathman

I believe you can still do this with only compiling a regex once and
then performing a few substitutions on the hostname.

Cool idea. Convert ip matches to fixed patterns before matching a fixed
regex. The leftovers like shaw cable (which has the MAC address of the
cable modem instead of the IP) can still be handled with regex patterns.

I had an idea last night to compile 254 regexes, one for each possible
last IP byte - but I think your idea is better.

Mitja suggested a socring system reminiscent of SpamAssassin.

That gives me a few things to try.
 
J

JanC

Stuart D. Gathman schreef:
I have a function that recognizes PTR records for dynamic IPs. There
is no hard and fast rule for this - every ISP does it differently, and
may change their policy at any time, and use different conventions in
different places. Nevertheless, it is useful to apply stricter
authentication standards to incoming email when the PTR for the IP
indicates a dynamic IP (namely, the PTR record is ignored since it
doesn't mean anything except to the ISP). This is because Windoze
Zombies are the favorite platform of spammers.

Did you also think about ISPs that use such a PTR record for both dynamic
and fixed IPs?
 
S

Stuart D. Gathman

Stuart D. Gathman schreef:
Did you also think about ISPs that use such a PTR record for both dynamic
and fixed IPs?

There seems to be a lot of misunderstanding about this. I am not blocking
anyones mail because they have a dynamic looking PTR. I simply don't
accept such a PTR as MTA authentication. You see, MTAs *SHOULD* provide a
fully qualified domain as their HELO name which resolves to the IP of the
MTA. Sadly, however, many internet facing MTAs don't do this, but I
accept a meaningful PTR as a substitute. I also waive the requirement for
MTA authentication if the MAIL FROM has an SPF record
(http://spf.pobox.com).

So, if your MTA complies with RFC HELO recommendations, you'll have no
trouble sending me mail. You can even use a dynamic IP with a dynamic DNS
service.

I 'do* block PTR names of "." or "localhost". I would like to block all
single word HELO names - but there are too many clueless mail admins out
there. People seem to be unsure of what to send for HELO.
 
S

Stuart D. Gathman

I believe you can still do this with only compiling a regex once and
then performing a few substitutions on the hostname.

That is a interesting idea. Convert ip matches to fixed patterns, and
*then* match the regex. I think I would convert hex matches to the same
pattern as decimal (and roman numeral). How would you handle zero fill?

1.2.3.4 001002003004foo.isp.com

An idea I had last night is to precompile 254 regexes - one for each of
the possible last ip bytes. However, your idea is cleaner - except, how
would it handle ip bytes that are the same: 1.2.2.2

Mitja has proposed a scoring system reminiscent of SpamAssassin.

This gives me a few things to try.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top