Separate Address number and name

S

Shane Konings

I have the following sample from a data set and I am looking to split the address number and name into separate headings as seen below.

FarmID Address
1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
5 5172 Green Lane
6 500 Glenridge
7 471 Foss
8 758 Niagara Stone
9 3836 Main
10 1025 York


FarmID AddressNum AddressName
1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
5 5172 Green Lane
6 500 Glenridge
7 471 Foss
8 758 Niagara Stone
9 3836 Main
10 1025 York

I have struggled with this for a while and know there must be a simple method to achieve this result.
 
A

Anders Wegge Keller

Shane Konings said:
I have struggled with this for a while and know there must be a
simple method to achieve this result.

There are several. But without seeing the code you have already
written, it's har to help you improve it.
 
S

Shane Konings

I don't have any code to split that part up. There is other information following the street name such as street suffix, city, province, postal code, etc. I have been able to split the rest of it up based on certain criteria but have had no luck with splitting up the street name from the street number.
 
S

Shane Konings

inHandler = open(inFile, 'r')
outHandler = open(outFile, 'w')
outHandler.write('ID\tAddress\tStreetNum&Name\tSufType\tDir\tCity\tProvince\tPostalCode\n')
for line in inHandler:
str = line.replace('FarmID\tAddress','')
outHandler.write(str[0:-1])
str = str.replace(', ON', '\t ON\t')
str = str.replace(' Rd,', '\t Rd\t \t')
str = str.replace(' Rd ', '\t Rd\t \t')
str = str.replace(' St,', '\t St\t \t')
str = str.replace(' St ', '\t St\t \t')
str = str.replace(' Ave', '\t Ave\t \t')
str = str.replace(' Pky', '\t Pky\t \t')
str = str.replace(' Lane, ', '\t Lane\t \t')
str = str.replace(', Lane, , Rd,', ' Lane\t Rd\t')
str = str.replace(' Dr', '\t Dr\t \t')
str = str.replace(' Sq', '\t Sq\t \t')
str = str.replace(' Pl', '\t Pl\t \t')
str = str.replace('\t \tN,', '\tN\t')
str = str.replace('\t \t N,', '\tN\t')
str = str.replace(' , North ', ' N\t')
str = str.replace(' ,S ', ' S\t')
str = str.replace(' , South ', ' S\t')
str = str.replace('\t \tE,', '\tE\t')
str = str.replace(' , East ', ' E\t')
str = str.replace('\t \tW,', '\tW\t')
str = str.replace('\t \t West', '\tW\t')
str = str.replace(',.', '.')
str = str.replace(',', '\t')
str = str.replace(',,', '\t')
str = str.replace(', Service', ' Service')
str = str.replace('\t \t\t', '\t\t')
outHandler.write(str[1:])
inHandler.close()
outHandler.close()


That is the code i currently have.

The following is a sample of the data. There are hundreds of lines that need to have an automated process of splitting the strings into headings to be imported into excel with theses headings

ID Address StreetNum StreetName SufType Dir City Province PostalCode


1 1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0
2 4260 Mountainview Rd, Lincoln, ON L0R 1B2
3 25 Hunter Rd, Grimsby, E, ON L3M 4A3
4 1091 Hutchinson Rd, Haldimand, ON N0A 1K0
5 5172 Green Lane Rd, Lincoln, ON L0R 1B3
6 500 Glenridge Ave, East, St. Catharines, ON L2S 3A1
7 471 Foss Rd, Pelham, ON L0S 1C0
8 758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0
9 3836 Main St, North, Lincoln, ON L0R 1S0
10 1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0
 
A

Asaf Las

I have the following sample from a data set and I am
looking to split the address number and name into separate headings
as seen below.
I have struggled with this for a while and know there must be a simple method to achieve this result.

input = '''1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
5 5172 Green Lane
6 500 Glenridge
7 471 Foss
8 758 Niagara Stone
9 3836 Main
10 1025 York '''

tlist = input.splitlines()
for k in tlist:
print(k.split())



do with 'k' whatever you wish
 
A

Anders Wegge Keller

....
The following is a sample of the data. There are hundreds of lines
that need to have an automated process of splitting the strings into
headings to be imported into excel with theses headings
ID Address StreetNum StreetName SufType Dir City Province PostalCode


1 1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0
2 4260 Mountainview Rd, Lincoln, ON L0R 1B2
3 25 Hunter Rd, Grimsby, E, ON L3M 4A3
4 1091 Hutchinson Rd, Haldimand, ON N0A 1K0
5 5172 Green Lane Rd, Lincoln, ON L0R 1B3
6 500 Glenridge Ave, East, St. Catharines, ON L2S 3A1
7 471 Foss Rd, Pelham, ON L0S 1C0
8 758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0
9 3836 Main St, North, Lincoln, ON L0R 1S0
10 1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0

The input doesn't look consistent to me. Is Dir supposed to be an
optional value? If that is the only optional, it can be worked
around. But if the missing direction (I'm guessing) is due to
malformed input data, you have a hell of a job in front of you.

What do you want to do with incomplete or malformed data? Try to
parse it as a "best effort", or simply spew out an error message for
an operator to look at?

In the latter case, I suggest a stepwise approach:

* Split input by ',' ->res0

* Split the first result by ' ' -> res

-> Id = res[0]
-> Address = res[1:]
-> StreetNum = res[1]
-> StreetName= res [2:]
-> SufType = res[-1]

* Check if res0[1] looks like a cardinal direction
If so Dir = res0[1]
Otherwise, croak or use the default direction. Insert an element in
the list, so the remainder is shifted to match the following steps.

-> City = res0[2]

* Split res0[3] by ' ' -> respp

respp[0] -> Province
respp[1:] -> Postcode


And put in som basic sanitation of the resulting values, before
committing them as a parsed result. Provinces and post codes, should
be easy enough to validate against a fixed list.
 
J

John Gordon

In said:
I have the following sample from a data set and I am looking to split the address number and name into separate headings as seen below.
FarmID Address
1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
5 5172 Green Lane
6 500 Glenridge
7 471 Foss
8 758 Niagara Stone
9 3836 Main
10 1025 York

FarmID AddressNum AddressName
1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
5 5172 Green Lane
6 500 Glenridge
7 471 Foss
8 758 Niagara Stone
9 3836 Main
10 1025 York
I have struggled with this for a while and know there must be a simple
method to achieve this result.

for line in input_lines:
fields = line.split()
farm_id = fields[0]
address_num = fields[1]
address_name = ' '.join(fields[2:])
 
T

Tim Chase

FarmID AddressNum AddressName
1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
I have struggled with this for a while and know there must be a
simple method to achieve this result.

for line in input_lines:
fields = line.split()
farm_id = fields[0]
address_num = fields[1]
address_name = ' '.join(fields[2:])

Or, you can split of just the parts you need:

for line in input_lines:
farm_id, street_no, street_name = line.split(None, 2)

It doesn't address the issues that Ben raised about the crazy formats
you can find in addresses, but address-parsing is an twisty maze of
passages all alike.

-tkc
 
D

Denis McMahon

I have the following sample from a data set and I am looking to split
the address number and name into separate headings as seen below.

FarmID Address 1 1067 Niagara Stone 2 4260 Mountainview 3 25 Hunter 4
1091 Hutchinson 5 5172 Green Lane 6 500 Glenridge 7 471 Foss 8 758
Niagara Stone 9 3836 Main 10 1025 York


FarmID AddressNum AddressName 1 1067 Niagara Stone 2 4260
Mountainview 3 25 Hunter 4 1091 Hutchinson 5
5172 Green Lane 6 500 Glenridge 7 471 Foss
8 758 Niagara Stone 9 3836 Main 10 1025 York

I have struggled with this for a while and know there must be a simple
method to achieve this result.

Unfortunately the vagaries of nntp, my client and the google nntp posting
host are such that I can't discern the format of your data from your post.

However, if as I think you have a text field that is always:

<1 or more digits><1 or more spaces><the rest>

where you want to capture the initial "1 or more digits" and "the rest"
as 2 data elements then this should be possible with a simple re:

(\d+)\s+(.*)

If you have numeric id, whitespace, numeric addr bit, whitespace, the
rest, then you may need something more like:

(\d+)\s+(\d+)\s+(.*)

The assumption is that it's not necessary to hold your hand through the
whole looping through the input and applying the re to each line, then
reading the captured bits and using them process.
 
D

Denis McMahon

The following is a sample of the data. There are hundreds of lines that
need to have an automated process of splitting the strings into headings
to be imported into excel with theses headings

ID Address StreetNum StreetName SufType Dir City Province
PostalCode

Ok, the following general method seems to work:

First, use a regex to capture two numeric groups and the rest of the line
separated by whitespace. If you can't find all three fields, you have
unexpected data format.

re.search( r"(\d+)\s+(\d+)\s+(.*)", data )

Second, split the rest of the line on a regex of comma + 0 or more
whitespace.

re.split( r",\s+", data )

Check that the rest of the line has 3 or 4 bits, otherwise you have an
unexpected lack or excess of data fields.

Split the first bit of the rest of the line into street name and suffix/
type. If you can't split it, use it as the street name and set the suffix/
type to blank.

re.search( r"(.*)\s+(\w+)", data )

If there are 3 bits in rest of line, set direction to blank, otherwise
set direction to the second bit.

Set the city to the last but one bit of the rest of the line.

Capture one word followed by two words in the last bit of the rest of the
line, and use these as the province and postcode.

re.search( r"(\w+)\s+(\w+\s+\w+)", data )

Providing none of the searches or the split errored, you should now have
the data fields you need to write. The easiest way to write them might be
to assemble them as a list and use the csv module.

I'm assuming you're capable of working out from the help on the python re
module what to use for each data, and how to access the captured results
of a search, and the results of a split. I'm also assuming you're capable
of working out how to use the csv module from the documentation. If
you're not, then either go back and ask your lecturer for help, or tell
your boss to hire a real programmer for his quick and easy coding jobs.
 
D

Denis McMahon

A mechanism using regexes

Just to follow up, using regexes I transformed the sample data that I
believe is as follows:

inputData = [
"1 1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0",
"2 4260 Mountainview Rd, Lincoln, ON L0R 1B2",
"3 25 Hunter Rd, Grimsby, E, ON L3M 4A3",
"4 1091 Hutchinson Rd, Haldimand, ON N0A 1K0",
"5 5172 Green Lane Rd, Lincoln, ON L0R 1B3",
"6 500 Glenridge Ave, East, St. Catharines, ON L2S 3A1",
"7 471 Foss Rd, Pelham, ON L0S 1C0",
"8 758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0",
"9 3836 Main St, North, Lincoln, ON L0R 1S0",
"10 1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0"
]

Into the following:

1,1067,Niagara Stone,Rd,W,Niagara-On-The-Lake,ON,L0S 1J0
2,4260,Mountainview,Rd,,Lincoln,ON,L0R 1B2
3,25,Hunter,Rd,Grimsby,E,ON,L3M 4A3
4,1091,Hutchinson,Rd,,Haldimand,ON,N0A 1K0
5,5172,Green Lane,Rd,,Lincoln,ON,L0R 1B3
6,500,Glenridge,Ave,East,St. Catharines,ON,L2S 3A1
7,471,Foss,Rd,,Pelham,ON,L0S 1C0
8,758,Niagara Stone,Rd,,Niagara-On-The-Lake,ON,L0S 1J0
9,3836,Main,St,North,Lincoln,ON,L0R 1S0
10,1025,York,Rd,W,Niagara-On-The-Lake,ON,L0S 1P0

Which should then read into Excel as CSV just fine.

One final question though, why are you using excel to manipulate data
that looks as if it would be better held in and manipulated by a database?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top