Newbie asking, interesting question

W

Wondering

I'm struggling to learn Perl, with some degree of success. I have a
question that's a bit more advanced than I am, but I hope someone can
help (thanks in advance to all who read this and biger thanks to
responders).

I'm trying to match name and address records in a large (~300,000
record) database with potential new records to avoid duplicates. Anyone
who has tried this knows that there are problems with exact matching,
especially if no convention has been followed for entering data.
(Consider all the possible variations of "avenue" - "avenue", "av",
"ave", etc., and when you consider drive, boulevard, etc. and all their
possible abbreviations, you begin to get the picture). So, I want to be
able to extract just the numeric characters in a strings so I can do
the matching on those (it's fuzzy, but with other feilds being
considered, too, we can get a fairly high matching rate). Anyone know
how to extract just the numeric charaters?
I'll also accept any other ideas for doing the match.
 
W

Wondering

Right on. I know tr from *nix, just didn't occur to me to use it for
this. Big thanks!
 
T

Tad McClellan

Wondering said:
Subject: Newbie asking, interesting question


Please put the subject of your article in the Subject of your article.

Your article was not about a newbie asking interesting questions.
 
A

Anno Siegel

Wondering said:
I'm struggling to learn Perl, with some degree of success. I have a
question that's a bit more advanced than I am, but I hope someone can
help (thanks in advance to all who read this and biger thanks to
responders).

I'm trying to match name and address records in a large (~300,000
record) database with potential new records to avoid duplicates. Anyone
who has tried this knows that there are problems with exact matching,
especially if no convention has been followed for entering data.
(Consider all the possible variations of "avenue" - "avenue", "av",
"ave", etc., and when you consider drive, boulevard, etc. and all their
possible abbreviations, you begin to get the picture). So, I want to be
able to extract just the numeric characters in a strings so I can do
the matching on those (it's fuzzy, but with other feilds being
considered, too, we can get a fairly high matching rate). Anyone know
how to extract just the numeric charaters?

tr/0..9//cd;

That will delete everything except digits.
I'll also accept any other ideas for doing the match.

There's the Soundex method with a corresponding standard module
Text::Soundex. It tries to map words so that similar-sounding ones
map to the same thing. It may also map different-sounding words to
the same thing, but you're not overly concerned about false positives.
Your fields may need some pre-processing (as breaking into words in
a useful way).

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top