Newbie asking, interesting question

Discussion in 'Perl Misc' started by Wondering, Feb 4, 2005.

  1. Wondering

    Wondering Guest

    I'm struggling to learn Perl, with some degree of success. I have a
    question that's a bit more advanced than I am, but I hope someone can
    help (thanks in advance to all who read this and biger thanks to
    responders).

    I'm trying to match name and address records in a large (~300,000
    record) database with potential new records to avoid duplicates. Anyone
    who has tried this knows that there are problems with exact matching,
    especially if no convention has been followed for entering data.
    (Consider all the possible variations of "avenue" - "avenue", "av",
    "ave", etc., and when you consider drive, boulevard, etc. and all their
    possible abbreviations, you begin to get the picture). So, I want to be
    able to extract just the numeric characters in a strings so I can do
    the matching on those (it's fuzzy, but with other feilds being
    considered, too, we can get a fairly high matching rate). Anyone know
    how to extract just the numeric charaters?
    I'll also accept any other ideas for doing the match.
     
    Wondering, Feb 4, 2005
    #1
    1. Advertising

  2. Wondering

    Wondering Guest

    Right on. I know tr from *nix, just didn't occur to me to use it for
    this. Big thanks!
     
    Wondering, Feb 4, 2005
    #2
    1. Advertising

  3. Wondering <> wrote:

    > Subject: Newbie asking, interesting question



    Please put the subject of your article in the Subject of your article.

    Your article was not about a newbie asking interesting questions.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 4, 2005
    #3
  4. Wondering

    Anno Siegel Guest

    Wondering <> wrote in comp.lang.perl.misc:
    > I'm struggling to learn Perl, with some degree of success. I have a
    > question that's a bit more advanced than I am, but I hope someone can
    > help (thanks in advance to all who read this and biger thanks to
    > responders).
    >
    > I'm trying to match name and address records in a large (~300,000
    > record) database with potential new records to avoid duplicates. Anyone
    > who has tried this knows that there are problems with exact matching,
    > especially if no convention has been followed for entering data.
    > (Consider all the possible variations of "avenue" - "avenue", "av",
    > "ave", etc., and when you consider drive, boulevard, etc. and all their
    > possible abbreviations, you begin to get the picture). So, I want to be
    > able to extract just the numeric characters in a strings so I can do
    > the matching on those (it's fuzzy, but with other feilds being
    > considered, too, we can get a fairly high matching rate). Anyone know
    > how to extract just the numeric charaters?


    tr/0..9//cd;

    That will delete everything except digits.

    > I'll also accept any other ideas for doing the match.


    There's the Soundex method with a corresponding standard module
    Text::Soundex. It tries to map words so that similar-sounding ones
    map to the same thing. It may also map different-sounding words to
    the same thing, but you're not overly concerned about false positives.
    Your fields may need some pre-processing (as breaking into words in
    a useful way).

    Anno
     
    Anno Siegel, Feb 6, 2005
    #4
  5. Anno Siegel <-berlin.de> wrote:

    > tr/0..9//cd;
    >
    > That will delete everything except digits.



    Make that

    tr/0-9//cd;

    please.

    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 6, 2005
    #5
  6. Wondering

    Anno Siegel Guest

    Tad McClellan <> wrote in comp.lang.perl.misc:
    > Anno Siegel <-berlin.de> wrote:
    >
    > > tr/0..9//cd;
    > >
    > > That will delete everything except digits.

    >
    >
    > Make that
    >
    > tr/0-9//cd;
    >
    > please.


    Yes. Oh boy. Looks like I violated the copy/paste rule.

    Anno
     
    Anno Siegel, Feb 6, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. walala
    Replies:
    2
    Views:
    1,083
    walala
    Sep 13, 2003
  2. Edwin Knoppert
    Replies:
    2
    Views:
    314
    Edwin Knoppert
    Nov 25, 2005
  3. Klaus
    Replies:
    2
    Views:
    401
    Roedy Green
    Jun 29, 2004
  4. MO
    Replies:
    0
    Views:
    410
  5. Matt Gessner
    Replies:
    1
    Views:
    1,406
Loading...

Share This Page