Problem with comparing huge amount of strings

Discussion in 'Ruby' started by Jan Fischer, Oct 6, 2008.

  1. Jan Fischer

    Jan Fischer Guest

    Hello together,

    I got a problem grouping rows in a Database by similarity. What I try to
    reach is the following:
    I have a table looking like (just an example):

    ID Companyname Group
    1 Mickeysoft Ltd. NULL
    2 Mickysoft LTD NULL
    3 Mickeysft Limited NULL
    4 Aple Inc NULL
    5 APPLE INC NULL

    and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
    for the IDs 4 and 5.

    At the moment I compare two strings by making them lowercase, deleting
    dots etc., deleting the substrings 'Inc', 'Ltd' etc. and then building
    the Levenshtein-Distance of the metaphone-key of the two strings.
    Works not really good and is damn slow, but it's okay and best I could
    figure out. (Nevertheless your hints on that are welcome too.)

    My problem is, that I don't know how to apply my compare-method in an
    efficient way. What I'm doing now is selecting the first row where Group
    is NULL and then selecting each row (where Group is NULL) in my database
    in a loop again, comparing with the first selected string and setting
    Group to a certain number if comapare method says they match.

    That lasts a lang time and - worse - my code looks very ugly, even to a
    very beginner as I am.

    So if somebody of you has a good idea how to improve that, some docs on
    how to implement such a thing or even a totally different approach to
    get the results I want, I would be very glad.

    Jan
     
    Jan Fischer, Oct 6, 2008
    #1
    1. Advertising

  2. On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <> wrote:
    > At the moment I compare two strings by making them lowercase, deleting dots
    > etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
    > Levenshtein-Distance of the metaphone-key of the two strings.
    > Works not really good and is damn slow, but it's okay and best I could
    > figure out. (Nevertheless your hints on that are welcome too.)
    >
    > My problem is, that I don't know how to apply my compare-method in an
    > efficient way. What I'm doing now is selecting the first row where Group is
    > NULL and then selecting each row (where Group is NULL) in my database in a
    > loop again, comparing with the first selected string and setting Group to a
    > certain number if comapare method says they match.


    The problem with relying on a compare function for grouping is that it
    leads naturally to an O(n^2) solution. Ideally, what you want is some
    sort ofo "characteristic" function, so that two entries with the same
    characteristic are likely to be in the same grouping. Of course,
    coming up with the characteristic is not always easy, and may not even
    be possible, but it's a useful direction to think in. Look up fuzzy
    searching techniques, too, and see if they have anything adaptable.

    martin
     
    Martin DeMello, Oct 6, 2008
    #2
    1. Advertising

  3. 2008/10/6 Martin DeMello <>:
    > On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <> wrote:
    >> At the moment I compare two strings by making them lowercase, deleting dots
    >> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
    >> Levenshtein-Distance of the metaphone-key of the two strings.
    >> Works not really good and is damn slow, but it's okay and best I could
    >> figure out. (Nevertheless your hints on that are welcome too.)
    >>
    >> My problem is, that I don't know how to apply my compare-method in an
    >> efficient way. What I'm doing now is selecting the first row where Group is
    >> NULL and then selecting each row (where Group is NULL) in my database in a
    >> loop again, comparing with the first selected string and setting Group to a
    >> certain number if comapare method says they match.

    >
    > The problem with relying on a compare function for grouping is that it
    > leads naturally to an O(n^2) solution. Ideally, what you want is some
    > sort ofo "characteristic" function, so that two entries with the same
    > characteristic are likely to be in the same grouping. Of course,
    > coming up with the characteristic is not always easy, and may not even
    > be possible, but it's a useful direction to think in. Look up fuzzy
    > searching techniques, too, and see if they have anything adaptable.


    Also, look into your database's feature set. Most modern RDBMS come
    with text searching capabilities and dealing with large volumes of
    data is where those beasts excel. If such a characteristic function
    can be found chances are that you can implement it as function in the
    DB and resolve the issue via database queries / updates.

    Kind regards

    robert

    --
    remember.guy do |as, often| as.you_can - without end
     
    Robert Klemme, Oct 6, 2008
    #3
  4. Martin DeMello wrote:
    > On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <> wrote:
    >> certain number if comapare method says they match.

    > The problem with relying on a compare function for grouping is that it
    > leads naturally to an O(n^2) solution. Ideally, what you want is some
    > sort ofo "characteristic" function, so that two entries with the same
    > characteristic are likely to be in the same grouping.


    Soundex on the first word, perhaps? (See wikipedia)

    I don't know how appropriate it would be for your dataset, but it's easy
    to try out. Maybe at least it will give you some smaller groups which in
    turn can be subject to more detailled analysis. If it works well then
    you can make an extra column for this value, with an index, so your DB
    can search and group on it efficiently.
    --
    Posted via http://www.ruby-forum.com/.
     
    Brian Candler, Oct 6, 2008
    #4
  5. Jan Fischer

    Ragav Satish Guest

    Jan Fischer wrote:
    > Hello together,
    >
    > I got a problem grouping rows in a Database by similarity. What I try to
    > reach is the following:
    > I have a table looking like (just an example):
    >
    > ID Companyname Group
    > 1 Mickeysoft Ltd. NULL
    > 2 Mickysoft LTD NULL
    > 3 Mickeysft Limited NULL
    > 4 Aple Inc NULL
    > 5 APPLE INC NULL
    >
    > and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
    > for the IDs 4 and 5.
    >
    > At the moment I compare two strings by making them lowercase, deleting
    > dots etc., deleting the substrings 'Inc', 'Ltd' etc. and then building
    > the Levenshtein-Distance of the metaphone-key of the two strings.
    > Works not really good and is damn slow, but it's okay and best I could
    > figure out. (Nevertheless your hints on that are welcome too.)
    >
    > My problem is, that I don't know how to apply my compare-method in an
    > efficient way. What I'm doing now is selecting the first row where Group
    > is NULL and then selecting each row (where Group is NULL) in my database
    > in a loop again, comparing with the first selected string and setting
    > Group to a certain number if comapare method says they match.
    >
    > That lasts a lang time and - worse - my code looks very ugly, even to a
    > very beginner as I am.
    >
    > So if somebody of you has a good idea how to improve that, some docs on
    > how to implement such a thing or even a totally different approach to
    > get the results I want, I would be very glad.
    >
    > Jan


    First at the minimum if you apply Lev distance for each pair of keys you
    might want to implement this as a custom database function. For eg in
    mysql you can create custom user functions
    http://dev.mysql.com/doc/refman/6.0/en/adding-functions.html

    The naive implementation of comparing each key with every other is
    quadratic. You need a way of "blocking/chunking" the dataset.

    1. For eg can you assume that the first letter will never be misspelled?
    Then you can group by first letters and create a smaller search space.
    2. Or if you don't accept a Lev distance > 2 then maybe you can first
    sort the names by length and only compute Lev distance on those which
    differ in size < 3?
    3. Perhaps the record has other fields which can be used individually or
    in combination to generate a sort key?
    4. There are more complicated blocking schemes - like n-gram(chunk by N
    common continuous characters) or even sampling based ones.

    A good blocking algorithm will be the key to scalability.

    There is lots of literature out there on this .. search for sorted
    neighborhood, record-linkage, merge-purge, hardening soft databases.

    --Cheers
    --Ragav
    --
    Posted via http://www.ruby-forum.com/.
     
    Ragav Satish, Oct 6, 2008
    #5
  6. On Mon, Oct 6, 2008 at 12:12 PM, Ragav Satish <> wrote:
    > 4. There are more complicated blocking schemes - like n-gram(chunk by N
    > common continuous characters) or even sampling based ones.


    n-gram were my first thought - indeed, I'd started writing out a
    bigram-based scheme, then I realised that it'd fail badly if there was
    a common word like "systems" or "computers" that a lot of the entries
    had. Maybe some sort of multipass scheme to first reduce each entry to
    a characteristic word, then do n-gram frequency analysis on those
    words (my idea was this: pass 1: make up a frequency table of bigrams,
    pass 2: characterise each entry by the presence/multiplicity of the
    six or so most common ones)

    martin
     
    Martin DeMello, Oct 7, 2008
    #6
  7. Jan Fischer

    Ragav Satish Guest

    Martin DeMello wrote:
    > On Mon, Oct 6, 2008 at 12:12 PM, Ragav Satish <>
    > wrote:
    >> 4. There are more complicated blocking schemes - like n-gram(chunk by N
    >> common continuous characters) or even sampling based ones.

    >
    > n-gram were my first thought - indeed, I'd started writing out a
    > bigram-based scheme, then I realised that it'd fail badly if there was
    > a common word like "systems" or "computers" that a lot of the entries
    > had. Maybe some sort of multipass scheme to first reduce each entry to
    > a characteristic word, then do n-gram frequency analysis on those
    > words (my idea was this: pass 1: make up a frequency table of bigrams,
    > pass 2: characterise each entry by the presence/multiplicity of the
    > six or so most common ones)
    >
    > martin


    Yes indeed. The choice of what blocking scheme to use is largely
    dependent on what kinds of key variations are expected and the size of
    the database.

    1. As a first stage of normalization, OP mentioned he was stripping of
    INC, LLC, Limited and the like. As you mentioned earlier it might
    possible to use a single token .. so the second token like Systems,
    Computers etc can be stripped off during blocking and only get applied
    in Phase II while comparing intra block records using a more exact
    scheme like lev distance.

    2. Before we even go to a bigram scheme, simple schemes like Soundex,
    First Two character truncation, NYSIIS etc should be tried.

    3. Better yet two or more of these should be applied in multiple passes
    and the blocks built out of the union of blocks produced with each
    method.

    4. If bigrams are used then I would go with the full bigram indexing(not
    just the six most common) with a threshold value and reverse indexes .
    http://datamining.anu.edu.au/publications/2003/kdd03-3pages.pdf (sec
    2.1) (This paper talks of python code so possibly it could be easily
    converted to ruby).

    --Cheers
    --Ragav

    --
    Posted via http://www.ruby-forum.com/.
     
    Ragav Satish, Oct 7, 2008
    #7
  8. Jan Fischer wrote:
    > Hello together,
    >
    > I got a problem grouping rows in a Database by similarity. What I try to
    > reach is the following:
    > I have a table looking like (just an example):
    >
    > ID Companyname Group
    > 1 Mickeysoft Ltd. NULL
    > 2 Mickysoft LTD NULL
    > 3 Mickeysft Limited NULL
    > 4 Aple Inc NULL
    > 5 APPLE INC NULL
    >
    > and so on, you get the point. Group should be 1 for the IDs 1 to 3 and 2
    > for the IDs 4 and 5.


    If you are trying to figure out every kind of typo then I do not think
    that there is an algorithm that will suffice.

    You said that they are in a database. What if you were to do a

    select distinct Companyname from theTable
    order by Companyname

    Then just go over it yourself. Perhaps you can add a field calling it
    newID or something like that. Then, enter a value in that field to
    identify which ones go together. Perhaps you can even add another field
    that has the correct spelling. Then, when you have gone through it all,
    you can write something that has the correct spelling and have users go
    through a pick list to prevent spelling errors.

    On the other hand, depending on how the table is set up, you could just
    mark all but one for deletion.

    You need to do this by hand IMO. GL!
    --
    Posted via http://www.ruby-forum.com/.
     
    Lloyd Linklater, Oct 7, 2008
    #8
  9. Jan Fischer

    Jan Fischer Guest

    Hello again,

    Robert Klemme schrieb am 6.10.08 um 13:05:
    > 2008/10/6 Martin DeMello <>:
    >> On Mon, Oct 6, 2008 at 1:03 AM, Jan Fischer <> wrote:
    >>> At the moment I compare two strings by making them lowercase, deleting dots
    >>> etc., deleting the substrings 'Inc', 'Ltd' etc. and then building the
    >>> Levenshtein-Distance of the metaphone-key of the two strings.
    >>> Works not really good and is damn slow, but it's okay and best I could
    >>> figure out. (Nevertheless your hints on that are welcome too.)
    >>>
    >>> My problem is, that I don't know how to apply my compare-method in an
    >>> efficient way. What I'm doing now is selecting the first row where Group is
    >>> NULL and then selecting each row (where Group is NULL) in my database in a
    >>> loop again, comparing with the first selected string and setting Group to a
    >>> certain number if comapare method says they match.

    >> The problem with relying on a compare function for grouping is that it
    >> leads naturally to an O(n^2) solution. Ideally, what you want is some
    >> sort ofo "characteristic" function, so that two entries with the same
    >> characteristic are likely to be in the same grouping. Of course,
    >> coming up with the characteristic is not always easy, and may not even
    >> be possible, but it's a useful direction to think in. Look up fuzzy
    >> searching techniques, too, and see if they have anything adaptable.

    >
    > Also, look into your database's feature set. Most modern RDBMS come
    > with text searching capabilities and dealing with large volumes of
    > data is where those beasts excel. If such a characteristic function
    > can be found chances are that you can implement it as function in the
    > DB and resolve the issue via database queries / updates.


    thank you very much for your hints. I think you both suggest using sth.
    like the soundex-function to characterize each single row in a certain
    way and then group by this chracteristic. (Like Brian Chandler suggests
    in another post too, thanks Brian.)
    Fact is that I already tried the "soundex-solution", but the results
    don't even come close to what I expected. I also tried to build the
    metaphone-key of every string and group by that, but that wasn't enough too.

    I think a mix of your hints and what Ragev Satish suggests in his post
    will prabably lead me into the right direction:
    First step is to "normalize" my strings (strip "Ltd." etc.), build the
    metaphone key and then write that back into the database instead of
    calculating everything again for each comparison. Should be O(n), right?
    (Read about Landau notation for the first time...)
    After this step I'll try to group my strings by other information -
    maybe by country of origin, zip code (if available) or first letter -
    before I'll compare every string to each other.

    I have to be honest: I never heard of n- or bi-gram schemes, so I think
    I have to read a lot about that before. But it sounds very interesting
    and I'm very thankful for these hints.

    Kind regards
    Jan
     
    Jan Fischer, Oct 8, 2008
    #9
  10. Jan Fischer

    Jan Fischer Guest

    Hi, thank you very much, Ragav and Martin. You both helped me a lot!
    I went into your posts on the other subthread. (MId:
    <gchq3f$1ai$>)

    Jan
     
    Jan Fischer, Oct 8, 2008
    #10
  11. On 08.10.2008 10:13, Jan Fischer wrote:

    > thank you very much for your hints. I think you both suggest using sth.
    > like the soundex-function to characterize each single row in a certain
    > way and then group by this chracteristic. (Like Brian Chandler suggests
    > in another post too, thanks Brian.)
    > Fact is that I already tried the "soundex-solution", but the results
    > don't even come close to what I expected. I also tried to build the
    > metaphone-key of every string and group by that, but that wasn't enough
    > too.
    >
    > I think a mix of your hints and what Ragev Satish suggests in his post
    > will prabably lead me into the right direction:


    Yep, I agree.

    > First step is to "normalize" my strings (strip "Ltd." etc.), build the
    > metaphone key and then write that back into the database instead of
    > calculating everything again for each comparison. Should be O(n), right?


    Yes, kind of. In databases O calculus is not so important because SQL
    is descriptive and not procedural. In other words, you have no control
    (in reality there is some control you can exert but it is important to
    keep in mind that SQL is not a programming language) over what
    operations the DB will execute in order to give you the result set you
    requested. It is more important to care about access paths so you
    definitively want to index the result of this function.

    You did not state which RDBMS you are using. I know Oracle very well
    and SQL Server pretty well, so I speak only about the two. Chances are
    that you can do similar things with other products. In both cases you
    can define a function for this. In SQL Server you can then store a
    calculated column which uses this function and in Oracle you can
    simulate the calculated column with a trigger or create a function based
    index. Then you can query that calculated column / function efficiently
    and group by or order by that column. If you add analytic SQL to the
    mix you can even do some nice additional tricks like counting per column
    value etc.

    Cheers

    robert
     
    Robert Klemme, Oct 9, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=

    Trouble with huge amount of State Server Sessions Timed out

    =?Utf-8?B?RGFuaWVsIFdhbHplbmJhY2g=?=, Jul 20, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    2,688
    Jerald Carter
    Sep 28, 2006
  2. Gabriel Genellina

    Efficient format for huge amount of data

    Gabriel Genellina, Jan 20, 2004, in forum: Java
    Replies:
    21
    Views:
    840
    Alan Chen
    Jan 23, 2004
  3. MichiMichi
    Replies:
    2
    Views:
    429
    Alexey Smirnov
    Mar 14, 2007
  4. Replies:
    9
    Views:
    134
  5. Ishmael
    Replies:
    2
    Views:
    120
    Ted Zlatanov
    Mar 5, 2009
Loading...

Share This Page