URL parsing for the hard cases

John Nagle · Jul 22, 2007

Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.

I'm parsing URLs used by hostile sites, and the wierd cases come up
all too frequently.

John Nagle

Miles · Jul 22, 2007

Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.

What do you mean by "parse" the field? What do you want to get back
from the parser function?

memracom · Jul 22, 2007

Is there something available that will parse the "netloc" field as
returned by URLparse, including all the hard cases? The "netloc" field
can potentially contain a port number and a numeric IP address. The
IP address may take many forms, including an IPv6 address.

I'm parsing URLs used by hostile sites, and the wierd cases come up
all too frequently.

I assume that when you say "netloc" you are referring to the second
field returned by the urlparse module. If this netloc contains an IPv6
address then it will also contain square brackets. The colons inside
the [] belong to the IPv6 address and the single possible colon
outside the brackets belongs to the port number. Of course, you might
want to try to help people who do not follow the RFCs and failed to
wrap the IPv6 address in square brackets. In that case, try...expect
comes in handy. You can try to parse an IPv6 address and if it fails
because of too many segments, then fallback to some other behaviour.

The worst case is a URL like http://2001::123:4567:abcd:8080/something.
Does the 8080 refer to a port number or part of the IPv6 address. If I
had to support non-bracketed IPv6 addresses, then I would interpret
this as http://[2001::123:4567:abcd]:8080/something.

RFC3986 is the reference for correct URL formats.

Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon

John Nagle · Jul 22, 2007

Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon

You wish. Hex input of IP addresses is allowed:

http://0x525eedda

and

http://0x52.0x5e.0xed.0xda

are both "Python.org". Or just put

0x52.0x5e.0xed.0xda

into the address bar of a browser. All these work in Firefox on Windows and
are recognized as valid IP addresses.

On the other hand,

0x52.com

is a valid domain name, in use by PairNIC.

But

http://test.0xda

is handled by Firefox on Windows as a domain name. It doesn't resolve, but it's
sent to DNS.

So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and there are
no more than four of them, it's an IP address. Otherwise it's a domain name.

There are phishing sites that pull stuff like this, and I'm parsing a long list
of such sites. So I really do need to get the hard cases right.

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

John Nagle

John Nagle · Jul 23, 2007

Here's another hard case. This one might be a bug in urlparse:

import urlparse

s = 'ftp://administrator

[email protected]/originals/6 june
07/ebay/login/ebayisapi.html'

urlparse.urlparse(s)

yields:

(u'ftp', u'administrator

[email protected]', u'/originals/6 june
07/ebay/login/ebayisapi.html', '', '', '')

That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.

That's a real URL, from a search for phishing sites. There are lots
of hostile URLs out there. Some of which can fool some parsers.

John Nagle

John said:
Once you eliminate IPv6 addresses, parsing is simple. Is there a
colon? Then there is a port number. Does the left over have any
characters not in [0123456789.]? Then it is a name, not an IPv4
address.

--Michael Dillon

Click to expand...

You wish. Hex input of IP addresses is allowed:

http://0x525eedda

and

http://0x52.0x5e.0xed.0xda

are both "Python.org". Or just put

0x52.0x5e.0xed.0xda

into the address bar of a browser. All these work in Firefox on Windows
and
are recognized as valid IP addresses.

On the other hand,

0x52.com

is a valid domain name, in use by PairNIC.

But

http://test.0xda

is handled by Firefox on Windows as a domain name. It doesn't resolve,
but it's
sent to DNS.

So I think the question is whether every term between dots can be parsed as
a decimal or hex number. If all terms can be parsed as a number, and
there are
no more than four of them, it's an IP address. Otherwise it's a domain
name.

There are phishing sites that pull stuff like this, and I'm parsing a
long list
of such sites. So I really do need to get the hard cases right.

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

John Nagle

Miles · Jul 23, 2007

Here's another hard case. This one might be a bug in urlparse:

import urlparse

s = 'ftp://administrator[email protected]/originals/6 june
07/ebay/login/ebayisapi.html'

urlparse.urlparse(s)

yields:

(u'ftp', u'administrator[email protected]', u'/originals/6 june
07/ebay/login/ebayisapi.html', '', '', '')

That second field is supposed to be the "hostport" (per the RFC usage
of the term; Python uses the term "netloc"), and the username/password
should have been parsed and moved to the "username" and "password" fields
of the object. So it looks like urlparse doesn't really understand FTP URLs.

Those values aren't "moved" to the fields; they're extracted on the
fly from the netloc. Use the .hostname property of the result tuple
to get just the hostname.

-Miles

Miles · Jul 23, 2007

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

import re, string

NETLOC_RE = re.compile(r'''^ # start of string
(?

[^@])+@)? # 1:
(?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
([^\[\]:]+)) # 3: IPv4 addr or reg-name
(?:

\d+))? # 4: optional port
$''', re.VERBOSE) # end of string

def normalize_IPv4(netloc):
try: # Assume it's an IP; if it's not, catch the error and return None
host = NETLOC_RE.match(netloc).group(3)
octets = [string.atoi(o, 0) for o in host.split('.')]
assert len(octets) <= 4
for i in range(len(octets), 4):
octets[i-1:] = divmod(octets[i-1], 256**(4-i))
for o in octets: assert o < 256
host = '.'.join(str(o) for o in octets)
except (AssertionError, ValueError, AttributeError): return None
return host

def is_ip(netloc):
if normalize_IPv4(netloc) is None:
match = NETLOC_RE.match(netloc)
# IPv6 validation could be stricter
if match and match.group(2): return True
else: return False
return True

The first function, I'd imagine, is the more interesting of the two.

-Miles

Miles · Jul 23, 2007

Is there any library function that correctly tests for an IP address vs. a
domain name based on syntax, i.e. without looking it up in DNS?

Click to expand...

import re, string

NETLOC_RE = re.compile(r'''^ # start of string
(?[^@])+@)? # 1:
(?:\[([0-9a-fA-F:]+)\]| # 2: IPv6 addr
([^\[\]:]+)) # 3: IPv4 addr or reg-name
(?:\d+))? # 4: optional port
$''', re.VERBOSE) # end of string

def normalize_IPv4(netloc):
try: # Assume it's an IP; if it's not, catch the error and return None
host = NETLOC_RE.match(netloc).group(3)
octets = [string.atoi(o, 0) for o in host.split('.')]
assert len(octets) <= 4
for i in range(len(octets), 4):
octets[i-1:] = divmod(octets[i-1], 256**(4-i))
for o in octets: assert o < 256
host = '.'.join(str(o) for o in octets)
except (AssertionError, ValueError, AttributeError): return None
return host

Apparently this will generally work as well:

import re, socket

NETLOC_RE = ...

def normalize_IPv4(netloc):
try:
host = NETLOC_RE.match(netloc).group(3)
return socket.inet_ntoa(socket.inet_aton(host))
except (AttributeError, socket.error):
return None

Thanks to http://mail.python.org/pipermail/python-list/2007-July/450317.html

-Miles

urlparse.urlparse bug - misparses long URL	6	Dec 14, 2007
Simple Python web proxy stalls for some web sites	6	Oct 6, 2004
Choosing the right parser for parsing C headers	11	Feb 8, 2005
Command language parsing - how formal to get?	14	Aug 10, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
ANN: M2Crypto 0.20	0	Aug 11, 2009
Call for Papers: The 2011 International Conference on Modeling,Simulation and Visualization Methods	0	Feb 27, 2011
Last Call for Papers: The 2011 International Conference on Modeling,Simulation, and Visualization M	0	May 17, 2011

URL parsing for the hard cases

John Nagle

Miles

memracom

John Nagle

John Nagle

Miles

Miles

Miles

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads