Usable street address parser in Python?

J

John Nagle

Is there a usable street address parser available? There are some
bad ones out there, but nothing good that I've found other than commercial
products with large databases. I don't need 100% accuracy, but I'd like
to be able to extract street name and street number for at least 98% of
US mailing addresses.

There's pyparsing, of course. There's a street address parser as an
example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".
It's not very good. It gets all of the following wrong:

1500 Deer Creek Lane (Parses "Creek" as a street type")
186 Avenue A (NYC street)
2081 N Webb Rd (Parses N Webb as a street name)
2081 N. Webb Rd (Parses N as street name)
1515 West 22nd Street (Parses "West" as name)
2029 Stierlin Court (Street names starting with "St" misparse.)

Some special cases that don't work, unsurprisingly.
P.O. Box 33170
The Landmark @ One Market, Suite 200
One Market, Suite 200
One Market

Much of the problem is that this parser starts at the beginning of the string.
US street addresses are best parsed from the end, says the USPS. That's why
things like "Deer Creek Lane" are mis-parsed. It's not clear that regular
expressions are the right tool for this job.

There must be something out there a little better than this.

John Nagle
 
J

John Roth

   Is there a usable street address parser available?  There are some
bad ones out there, but nothing good that I've found other than commercial
products with large databases.  I don't need 100% accuracy, but I'd like
to be able to extract street name and street number for at least 98% of
US mailing addresses.

   There's pyparsing, of course. There's a street address parser as an
example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
It's not very good.  It gets all of the following wrong:

        1500 Deer Creek Lane    (Parses "Creek" as a street type")
        186 Avenue A            (NYC street)
        2081 N Webb Rd          (Parses N Webb as a street name)
        2081 N. Webb Rd         (Parses N as street name)
        1515 West 22nd Street   (Parses "West" as name)
        2029 Stierlin Court     (Street names starting with "St" misparse.)

Some special cases that don't work, unsurprisingly.
        P.O. Box 33170
        The Landmark @ One Market, Suite 200
        One Market, Suite 200
        One Market

Much of the problem is that this parser starts at the beginning of the string.
US street addresses are best parsed from the end, says the USPS.  That's why
things like "Deer Creek Lane" are mis-parsed.  It's not clear that regular
expressions are the right tool for this job.

There must be something out there a little better than this.

                                        John Nagle

You have my sympathy. I used to work on the address parser module at
Trans Union, and I've never seen another piece of code that had as
many special cases, odd rules and stuff that absolutely didn't make
any sense until one of the old hands showed you the situation it was
supposed to handle.

And most of those files were supposed to be up to USPS mass mailing
standards.

When the USPS says that addresses are best parsed from the end, they
aren't talking about the street address; they're talking about the
address as a whole, where it's easiest if you look for a zip first,
then the state, etc. The best approach I know of for the street
address is simply to tokenize the thing, and then do some pattern
matching. Trying to use any kind of deterministic parser is going to
fail big time.

IMO, 98% is way too high for any module except one that's been given a
lot of love by a company that does this as part of their core
business. There's a reason why commercial products come with huge data
bases -- it's impossible to parse everything correctly with a single
set of rules. Those data bases also contain the actual street names
and address ranges by zip code, so that direct marketing files can be
cleansed to USPS standards.

That said, I don't see any reason why any of the examples in your
first group should be misparsed by a competent parser.

Sorry I don't have any real help for you.

John Roth
 
P

Paul McGuire

   Is there a usable street address parser available?  There are some
bad ones out there, but nothing good that I've found other than commercial
products with large databases.  I don't need 100% accuracy, but I'd like
to be able to extract street name and street number for at least 98% of
US mailing addresses.

   There's pyparsing, of course. There's a street address parser as an
example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
It's not very good.  It gets all of the following wrong:

        1500 Deer Creek Lane    (Parses "Creek" as a street type")
        186 Avenue A            (NYC street)
        2081 N Webb Rd          (Parses N Webb as a street name)
        2081 N. Webb Rd         (Parses N as street name)
        1515 West 22nd Street   (Parses "West" as name)
        2029 Stierlin Court     (Street names starting with "St" misparse.)

Some special cases that don't work, unsurprisingly.
        P.O. Box 33170
        The Landmark @ One Market, Suite 200
        One Market, Suite 200
        One Market

Please take a look at the updated form of this parser. It turns out
there actually *were* some bugs in the old form, plus there was no
provision for PO Boxes, avenues that start with "Avenue" instead of
ending with them, or house numbers spelled out as words. The only one
I consider a "special case" is the support for "Avenue X" instead of
"X Avenue" - adding support for the rest was added in a fairly general
way. With these bug fixes, I hope this improves your hit rate. (There
are also some simple attempts at adding apt/suite numbers, and APO and
AFP in addition to PO boxes - if not exactly what you need, the means
to extend to support other options should be pretty straightforward.)

-- Paul
 
S

Stefan Behnel

John Nagle, 17.04.2010 21:23:
Is there a usable street address parser available?

What kind of street address are you talking about? Only US-American ones?

Because street addresses are spelled differently all over the world. Some
have house numbers, some use letters or a combination, some have no house
numbers at all. Some use ordinal numbers, others use regular numbers. Some
put the house number before the street name, some after it. And this is
neither a comprehensive list, nor is this topic finished after parsing the
line that gives you the street (assuming there is such a thing in the first
place).

Stefan
 
J

John Nagle

John said:
Is there a usable street address parser available? There are some
bad ones out there, but nothing good that I've found other than commercial
products with large databases. I don't need 100% accuracy, but I'd like
to be able to extract street name and street number for at least 98% of
US mailing addresses.

There's pyparsing, of course. There's a street address parser as an
example at
"http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".

The author of that module has changed the code, and it has some
new features. This is much better.

Unfortunately, now it won't run with the released
version of "pyparsing" (1.5.2, from April 2009), because it uses
"originalTextFor", a feature introduced since then. I worked around that,
but discovered that the new version is case-sensitive. Changed
"Keyword" to "CaselessKeyword" where appropriate.

I put in the full list of USPS street types, and discovered
that "1500 DEER CREEK LANE" still parses with a street name
of "DEER", and a street type fo "CREEK", because "CREEK" is a
USPS street type. Need to do something to pick up the last street
type, not the first. I'm not sure how to do that with pyparsing.
Maybe if I buy the book...

There's still a problem with: "2081 N Webb Rd", where the street name
comes out as "N WEBB".
Addresses like "1234 5th St. S." yield a street name of "5 TH",
but if the directional is before the name, it ends up with the name.

Getting closer, though. If I can get to 95% of common cases, I'll
be happy.


John Nagle
 
J

John Yeung

My response is similar to John Roth's. It's mainly just sympathy. ;)

I deal with addresses a lot, and I know that a really good parser is
both rare/expensive to find and difficult to write yourself. We have
commercial, USPS-certified products where I work, and even with those
I've written a good deal of pre-processing and post-processing code,
consisting almost entirely of very silly-looking fixes for special
cases.

I don't have any experience whatsoever with pyparsing, but I will say
I agree that you should try to get the street type from the end of the
line. Just be aware that it can be valid to leave off the street type
completely. And of course it's a plus if you can handle suites that
are on the same line as the street (which is where the USPS prefers
them to be).

I would take the approach which John R. seems to be suggesting, which
is to tokenize and then write a whole bunch of very hairy, special-
case-laden logic. ;) I'm almost positive this is what all the
commercial packages are doing, and I have a tough time imagining what
else you could do. Addresses inherently have a high degree of
irregularity.

Good luck!

John Y.
 
I

Iain King

My response is similar to John Roth's.  It's mainly just sympathy. ;)

I deal with addresses a lot, and I know that a really good parser is
both rare/expensive to find and difficult to write yourself.  We have
commercial, USPS-certified products where I work, and even with those
I've written a good deal of pre-processing and post-processing code,
consisting almost entirely of very silly-looking fixes for special
cases.

I don't have any experience whatsoever with pyparsing, but I will say
I agree that you should try to get the street type from the end of the
line.  Just be aware that it can be valid to leave off the street type
completely.  And of course it's a plus if you can handle suites that
are on the same line as the street (which is where the USPS prefers
them to be).

I would take the approach which John R. seems to be suggesting, which
is to tokenize and then write a whole bunch of very hairy, special-
case-laden logic. ;)  I'm almost positive this is what all the
commercial packages are doing, and I have a tough time imagining what
else you could do.  Addresses inherently have a high degree of
irregularity.

Good luck!

John Y.

Not sure on the volume of addresses you're working with, but as an
alternative you could try grabbing the zip code, looking up all
addresses in that zip code, and then finding whatever one of those
address strings most closely resembles your address string (smallest
Levenshtein distance?).

Iain
 
G

Grant Edwards

This is a very tricky problem. Consider Salem, Oregon, which puts the
direction after the street:

3340 Astoria Way NE
Salem, OR 97303

In Minneapolis, the direction comes before the street in some
quadrants and after it in others. I used to live on W 43rd Street.
Now I live on 24th Ave NE. And just to be more inconsistent, only the
"NE" section uses two directions, everywhere else it's just W, S, N,
or E.
 
J

John Nagle

Iain said:
Not sure on the volume of addresses you're working with, but as an
alternative you could try grabbing the zip code, looking up all
addresses in that zip code, and then finding whatever one of those
address strings most closely resembles your address string (smallest
Levenshtein distance?).

The parser doesn't have to be perfect, but it should
reliably reports when it fails. Then I can run the hard cases through
one of the commercial online address standardizers. I'd like to
be able to knock off the easy cases cheaply.

What I want to do is to first extract the street number and
undecorated street name only, match that to a large database of US businesses
stored in MySQL, and then find the best match from the database
hits. So I need reliable extraction of undecorated street name and number. The
other fields are less important.

John Nagle
 
A

Albert van der Horst

The parser doesn't have to be perfect, but it should
reliably reports when it fails. Then I can run the hard cases through
one of the commercial online address standardizers. I'd like to
be able to knock off the easy cases cheaply.

In a similar situation I did the exact reverse. ( analysing
assembler code sequences for the stack effect.)
I made a list of all exceptions, and checked against that first.
If it is not an exception, the rule should apply.
If it doesn't, call Houston.
(Of course one starts with making an input canonical, all upper case
maybe reordering etc.)
What I want to do is to first extract the street number and
undecorated street name only, match that to a large database of US businesses
stored in MySQL, and then find the best match from the database
hits. So I need reliable extraction of undecorated street name and number. The
other fields are less important.

This kind of problem remains very tricky ...

At least in the Netherlands we have a book containing information
about how the spelling of a street should be officially using a limited
number of characters.
John Nagle

Groetjes Albert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top