newbie question: parsing street name from address

C

cjl

P:

I am working on a project that requires geocoding, and have written a
very simple geocoder that uses the Google service.

I would like to be able to extract the name of the street from the
addresses in my data, however they vary significantly. Here a some
examples:

25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main st
19th St

As you can see, sometimes I have the house number, and sometimes I do
not. Sometimes the street name is a number. Sometimes I simply have
the names of intersecting streets.

I would like to be able to parse the above into the following:

Main St
14th St
Bennet Pkwy
Pearl St
Bennet Rd
Main St
19th St

How might I approach this complex parsing problem?

-CJL
 
B

Bell, Kevin

Look at the string.split() built-in, then you could use s.isalpha &
s.isdigit to test each word... regular expressions would be the way to
go, but that's a bit to chew on if you're getting started with string
methods. You'll need to look at list indexing as well.

Kev
SLC DOT GIS

-----Original Message-----
From: [email protected]
[mailto:p[email protected]] On Behalf
Of cjl
Sent: Thursday, June 21, 2007 7:47 AM
To: (e-mail address removed)
Subject: newbie question: parsing street name from address

P:

I am working on a project that requires geocoding, and have written a
very simple geocoder that uses the Google service.

I would like to be able to extract the name of the street from the
addresses in my data, however they vary significantly. Here a some
examples:

25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main st
19th St

As you can see, sometimes I have the house number, and sometimes I do
not. Sometimes the street name is a number. Sometimes I simply have
the names of intersecting streets.

I would like to be able to parse the above into the following:

Main St
14th St
Bennet Pkwy
Pearl St
Bennet Rd
Main St
19th St

How might I approach this complex parsing problem?

-CJL
 
J

jay graves

I would like to be able to extract the name of the street from the
addresses in my data, however they vary significantly. Here a some
examples:

25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main st
19th St

As you can see, sometimes I have the house number, and sometimes I do
not. Sometimes the street name is a number. Sometimes I simply have
the names of intersecting streets.

How might I approach this complex parsing problem?

In the past, I used the data cleansing routines from Febrl to do this
with some success.

http://www.google.com/search?q=febrl

HTH.
....
Jay Graves
 
P

Paul McGuire

P:

I am working on a project that requires geocoding, and have written a
very simple geocoder that uses the Google service.

I would like to be able to extract the name of the street from the
addresses in my data, however they vary significantly. Here a some
examples:

25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main st
19th St

As you can see, sometimes I have the house number, and sometimes I do
not. Sometimes the street name is a number. Sometimes I simply have
the names of intersecting streets.

I would like to be able to parse the above into the following:

Main St
14th St
Bennet Pkwy
Pearl St
Bennet Rd
Main St
19th St

How might I approach this complex parsing problem?

-CJL

Parsing street addresses is a very complex parsing problem. Please
look at this example (http://pyparsing.wikispaces.com/space/showimage/
streetAddressParser.py) from the pyparsing wiki, which includes
support for these test cases:

100 South Street
123 Main
221B Baker Street
10 Downing St
1600 Pennsylvania Ave
33 1/2 W 42nd St.
454 N 38 1/2
21A Deer Run Drive
256K Memory Lane
12-1/2 Lincoln
23N W Loop South
23 N W Loop South

I took your list and added them to the test cases, which broke a few
lines in the grammar. The current online version now includes support
for your new formats as well. Here is some sample output from the
pyparsing example:

100 South Street
['100', 'South', 'Street']
- name: South
- number: 100
- street: ['100', 'South', 'Street']
- name: South
- number: 100
- type: Street
- type: Street
Street is South

221B Baker Street
['221B', 'Baker', 'Street']
- name: Baker
- number: 221B
- street: ['221B', 'Baker', 'Street']
- name: Baker
- number: 221B
- type: Street
- type: Street
Street is Baker Street

10 Downing St
['10', 'Downing', 'St']
- name: Downing
- number: 10
- street: ['10', 'Downing', 'St']
- name: Downing
- number: 10
- type: St
- type: St
Street is Downing St

1600 Pennsylvania Ave
['1600', 'Pennsylvania', 'Ave']
- name: Pennsylvania
- number: 1600
- street: ['1600', 'Pennsylvania', 'Ave']
- name: Pennsylvania
- number: 1600
- type: Ave
- type: Ave
Street is Pennsylvania Ave

33 1/2 W 42nd St.
['33 1/2', 'W 42 nd', 'St']
- name: W 42 nd
- number: 33 1/2
- street: ['33 1/2', 'W 42 nd', 'St']
- name: W 42 nd
- number: 33 1/2
- type: St
- type: St
Street is W 42 nd St

454 N 38 1/2
['454', 'N 38 1/2']
- name: N 38 1/2
- number: 454
- street: ['454', 'N 38 1/2']
- name: N 38 1/2
- number: 454
Street is N 38 1/2

25 Main St
['25', 'Main', 'St']
- name: Main
- number: 25
- street: ['25', 'Main', 'St']
- name: Main
- number: 25
- type: St
- type: St
Street is Main St

2500 14th St
['2500', '14 th', 'St']
- name: 14 th
- number: 2500
- street: ['2500', '14 th', 'St']
- name: 14 th
- number: 2500
- type: St
- type: St
Street is 14 th St

12 Bennet Pkwy
['12', 'Bennet', 'Pkwy']
- name: Bennet
- number: 12
- street: ['12', 'Bennet', 'Pkwy']
- name: Bennet
- number: 12
- type: Pkwy
- type: Pkwy
Street is Bennet Pkwy

Pearl St
['Pearl', 'St']
- name: Pearl
- street: ['Pearl', 'St']
- name: Pearl
- type: St
- type: St
Street is Pearl St

Bennet Rd and Main St
['Bennet', 'Rd', 'and', 'Main', 'St']
- crossStreet: ['Bennet', 'Rd']
- name: Bennet
- type: Rd
- name: Main
- street: ['Main', 'St']
- name: Main
- type: St
- type: St
Street is Main St

19th St
['19 th', 'St']
- name: 19 th
- street: ['19 th', 'St']
- name: 19 th
- type: St
- type: St
Street is 19 th St


-- Paul
 
E

Eric

P:

I am working on a project that requires geocoding, and have written a
very simple geocoder that uses the Google service.

I would like to be able to extract the name of the street from the
addresses in my data, however they vary significantly. Here a some
examples:

25 Main St
2500 14th St
12 Bennet Pkwy
Pearl St
Bennet Rd and Main st
19th St

As you can see, sometimes I have the house number, and sometimes I do
not. Sometimes the street name is a number. Sometimes I simply have
the names of intersecting streets.

I would like to be able to parse the above into the following:

Main St
14th St
Bennet Pkwy
Pearl St
Bennet Rd
Main St
19th St

How might I approach this complex parsing problem?

-CJL

You might be able to use consistencies in your data to make this
simpler. If the examples you have there are representative, it looks
like what you should do is look for a word like 'St' or 'Rd' and then
return that word and the previous word.
 
J

John Machin

You might be able to use consistencies in your data to make this
simpler. If the examples you have there are representative, it looks
like what you should do is look for a word like 'St' or 'Rd' and then
return that word and the previous word.

The OP's data already contains
[corner|cnr [of]] Foo Rd and|& Bar St
and real world data will contain things like
1234 John F Kennedy Memorial Drive
456 Broadway

As Paul wrote, "Parsing street addresses is a very complex parsing
problem", even when you restrict yourself to one mostly-English-
speaking country. Software written under such restrictions rapidly
breaks down elsewhere (Rue de la Paix, Wilhelmstrasse, Avenida 9 de
Julio, etc) and blows up altogether when street names aren't used in
postal addresses (e.g. Japan).
 
E

Eric

You might be able to use consistencies in your data to make this
simpler. If the examples you have there are representative, it looks
like what you should do is look for a word like 'St' or 'Rd' and then
return that word and the previous word.

The OP's data already contains
[corner|cnr [of]] Foo Rd and|& Bar St
and real world data will contain things like
1234 John F Kennedy Memorial Drive
456 Broadway

As Paul wrote, "Parsing street addresses is a very complex parsing
problem", even when you restrict yourself to one mostly-English-
speaking country. Software written under such restrictions rapidly
breaks down elsewhere (Rue de la Paix, Wilhelmstrasse, Avenida 9 de
Julio, etc) and blows up altogether when street names aren't used in
postal addresses (e.g. Japan).

No doubt that address parsing is, in general, a very difficult
problem. However, it may not be necessary for him to solve the
general problem. If his dataset is more limited in formats then his
problem is much simpler.
 
J

John Machin

The OP's data already contains
[corner|cnr [of]] Foo Rd and|& Bar St
and real world data will contain things like
1234 John F Kennedy Memorial Drive
456 Broadway
As Paul wrote, "Parsing street addresses is a very complex parsing
problem", even when you restrict yourself to one mostly-English-
speaking country. Software written under such restrictions rapidly
breaks down elsewhere (Rue de la Paix, Wilhelmstrasse, Avenida 9 de
Julio, etc) and blows up altogether when street names aren't used in
postal addresses (e.g. Japan).

No doubt that address parsing is, in general, a very difficult
problem. However, it may not be necessary for him to solve the
general problem. If his dataset is more limited in formats then his
problem is much simpler.

Ignore the last sentence of my post. Restrict the application to
[sub]urban addresses in the USA. If the OP's dataset is real-world
data, it will contain cases of street addresses that don't fit "look
for a word like 'St' or 'Rd' and then return that word and the
previous word." To expect an OP in a newsgroup to provide
representative examples is charmingly naive :) and in any case the OP
had already provided a corner case [pun intended] that busted your
rule.

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top