Address search algorithms

Discussion in 'Ruby' started by Mikel Lindsaar, May 30, 2008.

  1. [Note: parts of this message were removed to make it a legal post.]

    Heya all,
    Got an interesting little problem, not specifically Ruby, but I hope to
    solve it with some nice ruby :)

    I need to be able to do searches on a data set of a street address.

    The address has three fields, what is needed is an efficient way to search
    this address with partial data.

    The addresses are in a database, what I am thinking is going through and pre
    generating a search string based on the three concatenated address values
    and then index this search column and search off that using the output from
    the same algorithm on the entered search values.

    Would full text search be the best solution here? The DB is on postgres by
    the well.

    The problem is that someone might want to search for "back st" and expect to
    match "123 back street", or worse, "123 back lane" and get "123 back street"

    Anyone else run into these sorts of problems, any feed back?


    Mikel
    Mikel Lindsaar, May 30, 2008
    #1
    1. Advertising

  2. On 30 May 2008, at 10:09, Mikel Lindsaar wrote:
    > Heya all,
    > Got an interesting little problem, not specifically Ruby, but I hope
    > to
    > solve it with some nice ruby :)
    >
    > I need to be able to do searches on a data set of a street address.
    >
    > The address has three fields, what is needed is an efficient way to
    > search
    > this address with partial data.


    Assuming these are UK-only addresses, at the very least you should
    have four fields:

    * building name/number;
    * street;
    * town;
    * postcode

    and the national postcode database allows for seven if I remember
    correctly - although it's usual for a couple of those to be blank.
    Depending on how well normalised you want the database it can be a
    good idea to split out the last three into their own tables and only
    store keys in your main address table. You'll get the biggest gain
    from doing that with towns, but streets are normally a good gain as
    well. Oh, and obviously use indexes.

    > The addresses are in a database, what I am thinking is going through
    > and pre
    > generating a search string based on the three concatenated address
    > values
    > and then index this search column and search off that using the
    > output from
    > the same algorithm on the entered search values.
    >
    > Would full text search be the best solution here? The DB is on
    > postgres by
    > the well.
    >
    > The problem is that someone might want to search for "back st" and
    > expect to
    > match "123 back street", or worse, "123 back lane" and get "123 back
    > street"


    * Keep a dictionary of well known abbreviations and massage the query
    to expand these;
    * store the street number in its own column (which won't necessarily
    be numeric thanks to all the Flat 1's and 22B's lurking in the system,
    not to mention named properties without a number etc.);
    * store the postcode in its own column and filter by postcode whenever
    possible;
    * use a soundex algorithm to store all words as phonetic
    approximations to deal with simple misspellings (bock vs. back, bak
    vs. back);
    * AND MOST IMPORTANTLY: constrain the way the user can enter a query
    so that it keeps the query as simple as possible;
    * use whatever full text search facility postgres has

    > Anyone else run into these sorts of problems, any feed back?


    Many many years ago :)
    Out of interest, are you using the postcode database to generate your
    address information, or is it user-submitted data? If the latter, make
    sure you constrain how data can be submitted. And either way, do as
    much pre-processing as possible to get the addresses into a normalised
    form.


    Ellie

    Eleanor McHugh
    Games With Brains
    http://slides.games-with-brains.net
    ----
    raise ArgumentError unless @reality.responds_to? :reason
    Eleanor McHugh, May 30, 2008
    #2
    1. Advertising

  3. [Note: parts of this message were removed to make it a legal post.]

    >
    > Many many years ago :)
    >


    Thanks all for your tips!

    I have already implemented most of them... but some of them are also good
    ideas.

    > Out of interest, are you using the postcode database to generate your

    address
    > information, or is it user-submitted data? If the latter, make sure you

    constrain
    > how data can be submitted. And either way, do as much pre-processing as
    > possible to get the addresses into a normalised form.


    This is pre-existing data... the worst kind :)

    Mikel
    Mikel Lindsaar, May 30, 2008
    #3
  4. On 30 May 2008, at 13:08, Mikel Lindsaar wrote:
    > This is pre-existing data... the worst kind :)


    You have my sympathy ;)


    Ellie

    Eleanor McHugh
    Games With Brains
    http://slides.games-with-brains.net
    ----
    raise ArgumentError unless @reality.responds_to? :reason
    Eleanor McHugh, May 30, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. avnrao
    Replies:
    1
    Views:
    7,217
    avnrao
    May 4, 2004
  2. Patrice
    Replies:
    0
    Views:
    880
    Patrice
    May 4, 2004
  3. Raterus
    Replies:
    0
    Views:
    432
    Raterus
    Jun 2, 2004
  4. Chris M. Thomasson

    automated test for sub-string search algorithms...

    Chris M. Thomasson, Feb 20, 2010, in forum: C Programming
    Replies:
    3
    Views:
    522
    Chris M. Thomasson
    Feb 23, 2010
  5. Abby Lee
    Replies:
    5
    Views:
    393
    Abby Lee
    Aug 2, 2004
Loading...

Share This Page