Usable street address parser in Python?

Discussion in 'Python' started by John Nagle, Apr 17, 2010.

  1. John Nagle

    John Nagle Guest

    Is there a usable street address parser available? There are some
    bad ones out there, but nothing good that I've found other than commercial
    products with large databases. I don't need 100% accuracy, but I'd like
    to be able to extract street name and street number for at least 98% of
    US mailing addresses.

    There's pyparsing, of course. There's a street address parser as an
    example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".
    It's not very good. It gets all of the following wrong:

    1500 Deer Creek Lane (Parses "Creek" as a street type")
    186 Avenue A (NYC street)
    2081 N Webb Rd (Parses N Webb as a street name)
    2081 N. Webb Rd (Parses N as street name)
    1515 West 22nd Street (Parses "West" as name)
    2029 Stierlin Court (Street names starting with "St" misparse.)

    Some special cases that don't work, unsurprisingly.
    P.O. Box 33170
    The Landmark @ One Market, Suite 200
    One Market, Suite 200
    One Market

    Much of the problem is that this parser starts at the beginning of the string.
    US street addresses are best parsed from the end, says the USPS. That's why
    things like "Deer Creek Lane" are mis-parsed. It's not clear that regular
    expressions are the right tool for this job.

    There must be something out there a little better than this.

    John Nagle
     
    John Nagle, Apr 17, 2010
    #1
    1. Advertising

  2. John Nagle

    John Roth Guest

    On Apr 17, 1:23 pm, John Nagle <> wrote:
    >    Is there a usable street address parser available?  There are some
    > bad ones out there, but nothing good that I've found other than commercial
    > products with large databases.  I don't need 100% accuracy, but I'd like
    > to be able to extract street name and street number for at least 98% of
    > US mailing addresses.
    >
    >    There's pyparsing, of course. There's a street address parser as an
    > example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
    > It's not very good.  It gets all of the following wrong:
    >
    >         1500 Deer Creek Lane    (Parses "Creek" as a street type")
    >         186 Avenue A            (NYC street)
    >         2081 N Webb Rd          (Parses N Webb as a street name)
    >         2081 N. Webb Rd         (Parses N as street name)
    >         1515 West 22nd Street   (Parses "West" as name)
    >         2029 Stierlin Court     (Street names starting with "St" misparse.)
    >
    > Some special cases that don't work, unsurprisingly.
    >         P.O. Box 33170
    >         The Landmark @ One Market, Suite 200
    >         One Market, Suite 200
    >         One Market
    >
    > Much of the problem is that this parser starts at the beginning of the string.
    > US street addresses are best parsed from the end, says the USPS.  That's why
    > things like "Deer Creek Lane" are mis-parsed.  It's not clear that regular
    > expressions are the right tool for this job.
    >
    > There must be something out there a little better than this.
    >
    >                                         John Nagle


    You have my sympathy. I used to work on the address parser module at
    Trans Union, and I've never seen another piece of code that had as
    many special cases, odd rules and stuff that absolutely didn't make
    any sense until one of the old hands showed you the situation it was
    supposed to handle.

    And most of those files were supposed to be up to USPS mass mailing
    standards.

    When the USPS says that addresses are best parsed from the end, they
    aren't talking about the street address; they're talking about the
    address as a whole, where it's easiest if you look for a zip first,
    then the state, etc. The best approach I know of for the street
    address is simply to tokenize the thing, and then do some pattern
    matching. Trying to use any kind of deterministic parser is going to
    fail big time.

    IMO, 98% is way too high for any module except one that's been given a
    lot of love by a company that does this as part of their core
    business. There's a reason why commercial products come with huge data
    bases -- it's impossible to parse everything correctly with a single
    set of rules. Those data bases also contain the actual street names
    and address ranges by zip code, so that direct marketing files can be
    cleansed to USPS standards.

    That said, I don't see any reason why any of the examples in your
    first group should be misparsed by a competent parser.

    Sorry I don't have any real help for you.

    John Roth
     
    John Roth, Apr 18, 2010
    #2
    1. Advertising

  3. John Nagle

    Paul McGuire Guest

    On Apr 17, 2:23 pm, John Nagle <> wrote:
    >    Is there a usable street address parser available?  There are some
    > bad ones out there, but nothing good that I've found other than commercial
    > products with large databases.  I don't need 100% accuracy, but I'd like
    > to be able to extract street name and street number for at least 98% of
    > US mailing addresses.
    >
    >    There's pyparsing, of course. There's a street address parser as an
    > example at "http://pyparsing.wikispaces.com/file/view/streetAddressParser..py".
    > It's not very good.  It gets all of the following wrong:
    >
    >         1500 Deer Creek Lane    (Parses "Creek" as a street type")
    >         186 Avenue A            (NYC street)
    >         2081 N Webb Rd          (Parses N Webb as a street name)
    >         2081 N. Webb Rd         (Parses N as street name)
    >         1515 West 22nd Street   (Parses "West" as name)
    >         2029 Stierlin Court     (Street names starting with "St" misparse.)
    >
    > Some special cases that don't work, unsurprisingly.
    >         P.O. Box 33170
    >         The Landmark @ One Market, Suite 200
    >         One Market, Suite 200
    >         One Market
    >


    Please take a look at the updated form of this parser. It turns out
    there actually *were* some bugs in the old form, plus there was no
    provision for PO Boxes, avenues that start with "Avenue" instead of
    ending with them, or house numbers spelled out as words. The only one
    I consider a "special case" is the support for "Avenue X" instead of
    "X Avenue" - adding support for the rest was added in a fairly general
    way. With these bug fixes, I hope this improves your hit rate. (There
    are also some simple attempts at adding apt/suite numbers, and APO and
    AFP in addition to PO boxes - if not exactly what you need, the means
    to extend to support other options should be pretty straightforward.)

    -- Paul
     
    Paul McGuire, Apr 19, 2010
    #3
  4. John Nagle, 17.04.2010 21:23:
    > Is there a usable street address parser available?


    What kind of street address are you talking about? Only US-American ones?

    Because street addresses are spelled differently all over the world. Some
    have house numbers, some use letters or a combination, some have no house
    numbers at all. Some use ordinal numbers, others use regular numbers. Some
    put the house number before the street name, some after it. And this is
    neither a comprehensive list, nor is this topic finished after parsing the
    line that gives you the street (assuming there is such a thing in the first
    place).

    Stefan
     
    Stefan Behnel, Apr 19, 2010
    #4
  5. John Nagle

    John Nagle Guest

    John Nagle wrote:
    > Is there a usable street address parser available? There are some
    > bad ones out there, but nothing good that I've found other than commercial
    > products with large databases. I don't need 100% accuracy, but I'd like
    > to be able to extract street name and street number for at least 98% of
    > US mailing addresses.
    >
    > There's pyparsing, of course. There's a street address parser as an
    > example at
    > "http://pyparsing.wikispaces.com/file/view/streetAddressParser.py".


    The author of that module has changed the code, and it has some
    new features. This is much better.

    Unfortunately, now it won't run with the released
    version of "pyparsing" (1.5.2, from April 2009), because it uses
    "originalTextFor", a feature introduced since then. I worked around that,
    but discovered that the new version is case-sensitive. Changed
    "Keyword" to "CaselessKeyword" where appropriate.

    I put in the full list of USPS street types, and discovered
    that "1500 DEER CREEK LANE" still parses with a street name
    of "DEER", and a street type fo "CREEK", because "CREEK" is a
    USPS street type. Need to do something to pick up the last street
    type, not the first. I'm not sure how to do that with pyparsing.
    Maybe if I buy the book...

    There's still a problem with: "2081 N Webb Rd", where the street name
    comes out as "N WEBB".
    Addresses like "1234 5th St. S." yield a street name of "5 TH",
    but if the directional is before the name, it ends up with the name.

    Getting closer, though. If I can get to 95% of common cases, I'll
    be happy.


    John Nagle
     
    John Nagle, Apr 20, 2010
    #5
  6. John Nagle

    John Yeung Guest

    My response is similar to John Roth's. It's mainly just sympathy. ;)

    I deal with addresses a lot, and I know that a really good parser is
    both rare/expensive to find and difficult to write yourself. We have
    commercial, USPS-certified products where I work, and even with those
    I've written a good deal of pre-processing and post-processing code,
    consisting almost entirely of very silly-looking fixes for special
    cases.

    I don't have any experience whatsoever with pyparsing, but I will say
    I agree that you should try to get the street type from the end of the
    line. Just be aware that it can be valid to leave off the street type
    completely. And of course it's a plus if you can handle suites that
    are on the same line as the street (which is where the USPS prefers
    them to be).

    I would take the approach which John R. seems to be suggesting, which
    is to tokenize and then write a whole bunch of very hairy, special-
    case-laden logic. ;) I'm almost positive this is what all the
    commercial packages are doing, and I have a tough time imagining what
    else you could do. Addresses inherently have a high degree of
    irregularity.

    Good luck!

    John Y.
     
    John Yeung, Apr 20, 2010
    #6
  7. John Nagle

    Iain King Guest

    On Apr 20, 8:24 am, John Yeung <> wrote:
    > My response is similar to John Roth's.  It's mainly just sympathy. ;)
    >
    > I deal with addresses a lot, and I know that a really good parser is
    > both rare/expensive to find and difficult to write yourself.  We have
    > commercial, USPS-certified products where I work, and even with those
    > I've written a good deal of pre-processing and post-processing code,
    > consisting almost entirely of very silly-looking fixes for special
    > cases.
    >
    > I don't have any experience whatsoever with pyparsing, but I will say
    > I agree that you should try to get the street type from the end of the
    > line.  Just be aware that it can be valid to leave off the street type
    > completely.  And of course it's a plus if you can handle suites that
    > are on the same line as the street (which is where the USPS prefers
    > them to be).
    >
    > I would take the approach which John R. seems to be suggesting, which
    > is to tokenize and then write a whole bunch of very hairy, special-
    > case-laden logic. ;)  I'm almost positive this is what all the
    > commercial packages are doing, and I have a tough time imagining what
    > else you could do.  Addresses inherently have a high degree of
    > irregularity.
    >
    > Good luck!
    >
    > John Y.


    Not sure on the volume of addresses you're working with, but as an
    alternative you could try grabbing the zip code, looking up all
    addresses in that zip code, and then finding whatever one of those
    address strings most closely resembles your address string (smallest
    Levenshtein distance?).

    Iain
     
    Iain King, Apr 20, 2010
    #7
  8. On 2010-04-20, Tim Roberts <> wrote:

    > This is a very tricky problem. Consider Salem, Oregon, which puts the
    > direction after the street:
    >
    > 3340 Astoria Way NE
    > Salem, OR 97303


    In Minneapolis, the direction comes before the street in some
    quadrants and after it in others. I used to live on W 43rd Street.
    Now I live on 24th Ave NE. And just to be more inconsistent, only the
    "NE" section uses two directions, everywhere else it's just W, S, N,
    or E.

    --
    Grant Edwards grant.b.edwards Yow! Is it NOUVELLE
    at CUISINE when 3 olives are
    gmail.com struggling with a scallop
    in a plate of SAUCE MORNAY?
     
    Grant Edwards, Apr 20, 2010
    #8
  9. John Nagle

    John Nagle Guest

    Iain King wrote:
    > Not sure on the volume of addresses you're working with, but as an
    > alternative you could try grabbing the zip code, looking up all
    > addresses in that zip code, and then finding whatever one of those
    > address strings most closely resembles your address string (smallest
    > Levenshtein distance?).


    The parser doesn't have to be perfect, but it should
    reliably reports when it fails. Then I can run the hard cases through
    one of the commercial online address standardizers. I'd like to
    be able to knock off the easy cases cheaply.

    What I want to do is to first extract the street number and
    undecorated street name only, match that to a large database of US businesses
    stored in MySQL, and then find the best match from the database
    hits. So I need reliable extraction of undecorated street name and number. The
    other fields are less important.

    John Nagle
     
    John Nagle, Apr 20, 2010
    #9
  10. In article <4bcddc5a$0$1630$>,
    John Nagle <> wrote:
    >Iain King wrote:
    >> Not sure on the volume of addresses you're working with, but as an
    >> alternative you could try grabbing the zip code, looking up all
    >> addresses in that zip code, and then finding whatever one of those
    >> address strings most closely resembles your address string (smallest
    >> Levenshtein distance?).

    >
    > The parser doesn't have to be perfect, but it should
    >reliably reports when it fails. Then I can run the hard cases through
    >one of the commercial online address standardizers. I'd like to
    >be able to knock off the easy cases cheaply.


    In a similar situation I did the exact reverse. ( analysing
    assembler code sequences for the stack effect.)
    I made a list of all exceptions, and checked against that first.
    If it is not an exception, the rule should apply.
    If it doesn't, call Houston.
    (Of course one starts with making an input canonical, all upper case
    maybe reordering etc.)

    >
    > What I want to do is to first extract the street number and
    >undecorated street name only, match that to a large database of US businesses
    >stored in MySQL, and then find the best match from the database
    >hits. So I need reliable extraction of undecorated street name and number. The
    >other fields are less important.


    This kind of problem remains very tricky ...

    At least in the Netherlands we have a book containing information
    about how the spelling of a street should be officially using a limited
    number of characters.

    >
    > John Nagle


    Groetjes Albert

    --
    --
    Albert van der Horst, UTRECHT,THE NETHERLANDS
    Economic growth -- being exponential -- ultimately falters.
    albert@spe&ar&c.xs4all.nl &=n http://home.hccnet.nl/a.w.m.van.der.horst
     
    Albert van der Horst, Apr 21, 2010
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    1,067
  2. cjl
    Replies:
    7
    Views:
    387
    John Machin
    Jun 22, 2007
  3. Shawn Milochik
    Replies:
    15
    Views:
    989
    John Machin
    Oct 13, 2007
  4. John Nagle
    Replies:
    1
    Views:
    1,133
    John Nagle
    Jun 4, 2010
  5. Hasham Malik
    Replies:
    0
    Views:
    119
    Hasham Malik
    Apr 8, 2010
Loading...

Share This Page