Algorithm Question

Discussion in 'Python' started by Andrew McLean, Sep 10, 2006.

  1. This really an algorithm question more that a Python question, but it
    would be implemented in Python....

    I have a list of strings, A. I want to find a set of strings B such that
    for any "a in A" there exists "b in B" such that b is a sub-string of a.
    But I also want to minimise T = sum_j t_j where

    t_j = count of the number of elements in A which have b[j] as a sub-string

    My guess is that finding the smallest possible T satisfying the
    constraint would be hard. However, for my application just keeping it
    reasonably small would help.

    In my case the list A contains over two million addresses.

    The (top down) heuristic approach I am tempted to employ is to start by
    dividing the entries in A into sets of tokens, then take the union of
    all these sets as a starting point for B. Then I would try to trim B by

    1. looking for elements that I could remove while still satisfying the
    constraint

    2. replacing two elements by a common sub-string if that reduced T

    Anyway. It occurred to me that this might be a known problem. Any
    pointers gratefully received.

    - Andrew
     
    Andrew McLean, Sep 10, 2006
    #1
    1. Advertising

  2. Andrew McLean

    Neil Cerutti Guest

    On 2006-09-10, Andrew McLean <> wrote:
    > This really an algorithm question more that a Python question,
    > but it would be implemented in Python....


    In that case, try comp.programming. But still...

    > I have a list of strings, A. I want to find a set of strings B
    > such that for any "a in A" there exists "b in B" such that b is
    > a sub-string of a. But I also want to minimise T = sum_j t_j
    > where
    >
    > t_j = count of the number of elements in A which have b[j] as a
    > sub-string
    >
    > My guess is that finding the smallest possible T satisfying the
    > constraint would be hard. However, for my application just
    > keeping it reasonably small would help.
    >
    > In my case the list A contains over two million addresses.


    I'm afraid I don't understand your description.

    How about an example of B for some A?

    A = ["abb", "bbc"]
    B = ["a", "ab", "abb", "b", "bb", "bbc", "bc", "c"]

    Is that what you're after?

    --
    Neil Cerutti
     
    Neil Cerutti, Sep 11, 2006
    #2
    1. Advertising

  3. Andrew McLean

    Carl Banks Guest

    Andrew McLean wrote:
    > I have a list of strings, A. I want to find a set of strings B such that
    > for any "a in A" there exists "b in B" such that b is a sub-string of a.


    B=A?

    > But I also want to minimise T = sum_j t_j where
    > t_j = count of the number of elements in A which have b[j] as a sub-string


    If there not elements in A that are substrings of any other element in
    A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
    optimal (since elements of B have to be substring of at least one
    element in A). It looks like the B={set of all elements in A that are
    not a substring of any other element in A} is the generally optimal
    solution.

    I suspect you mistyped or omitted something--problem is underspecified
    at best.


    Carl Banks
     
    Carl Banks, Sep 11, 2006
    #3
  4. Carl Banks wrote:
    > Andrew McLean wrote:
    >> I have a list of strings, A. I want to find a set of strings B such that
    >> for any "a in A" there exists "b in B" such that b is a sub-string of a.

    >
    > B=A?
    >
    >> But I also want to minimise T = sum_j t_j where
    >> t_j = count of the number of elements in A which have b[j] as a sub-string

    >
    > If there not elements in A that are substrings of any other element in
    > A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
    > optimal (since elements of B have to be substring of at least one
    > element in A). It looks like the B={set of all elements in A that are
    > not a substring of any other element in A} is the generally optimal
    > solution.
    >
    > I suspect you mistyped or omitted something--problem is underspecified
    > at best.


    You are quite right. I was trying to generalise my real problem and
    missed out a constraint. I also want to keep length(B) small.
    Unfortunately, I'm a bit unsure about the relative importance of T and
    length(B), which makes the problem rather ill defined. I'll have to give
    this a bit more thought....
     
    Andrew McLean, Sep 11, 2006
    #4
  5. Andrew McLean

    John Machin Guest

    Andrew McLean wrote:
    > Carl Banks wrote:
    > > Andrew McLean wrote:
    > >> I have a list of strings, A. I want to find a set of strings B such that
    > >> for any "a in A" there exists "b in B" such that b is a sub-string of a.

    > >
    > > B=A?
    > >
    > >> But I also want to minimise T = sum_j t_j where
    > >> t_j = count of the number of elements in A which have b[j] as a sub-string

    > >
    > > If there not elements in A that are substrings of any other element in
    > > A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
    > > optimal (since elements of B have to be substring of at least one
    > > element in A). It looks like the B={set of all elements in A that are
    > > not a substring of any other element in A} is the generally optimal
    > > solution.
    > >
    > > I suspect you mistyped or omitted something--problem is underspecified
    > > at best.

    >
    > You are quite right. I was trying to generalise my real problem and
    > missed out a constraint. I also want to keep length(B) small.
    > Unfortunately, I'm a bit unsure about the relative importance of T and
    > length(B), which makes the problem rather ill defined. I'll have to give
    > this a bit more thought....


    A quick silly question: what is the problem that you are trying to
    solve?
     
    John Machin, Sep 12, 2006
    #5
  6. John Machin wrote:
    > A quick silly question: what is the problem that you are trying to
    > solve?


    A fair question :)

    The problem may seem a bit strange, but here it is:

    I have the ability to query a database in a legacy system and extract
    records which match a particular pattern. Specifically, I can perform
    queries for records that contain a given search term as a sub-string of
    a particular column. The specific column contains an address. This
    database can only be accessed through this particular interface (don't
    ask why, it's one of the reasons it's a *legacy* system).

    I also have access to a list that contains the vast majority (possibly
    all) the addresses which are stored in the database.

    Now I want to issue a series of queries, such that when I combine all
    the data returned I have accessed all the records in the database.
    However, I want to minimise the total number of queries and also want to
    keep the number of records returned by more than one query small.

    Now the current approach I use is to divide the addresses I have into
    tokens and take the last token in the address (excluding the postal
    code). The union of these "last tokens" forms my set of queries. The
    last token in the address is typically a county or a town in a UK address.

    This works, but I was wondering if I could do something more efficient.
    The problem is that while the search term "London" matches all the
    addresses in London it also returns all the addresses containing "London
    Road", and a lot of towns have a London Road. Perhaps I would be better
    off searching for "Road", "Street", "Avenue" ....

    It occurred to me that this my be isomorphic to a known problem, however
    given that I want to keep two things small, the problem isn't very well
    defined.

    The current approach works, I was just musing whether there was a faster
    approach, so don't think about it too hard.

    - Andrew
     
    Andrew McLean, Sep 15, 2006
    #6
  7. Andrew McLean wrote:

    > Now I want to issue a series of queries, such that when I combine all
    > the data returned I have accessed all the records in the database.
    > However, I want to minimise the total number of queries and also want to
    > keep the number of records returned by more than one query small.


    The first requirement makes sense, but what's the reason for the second
    one ? In any case, as you realized the problem is not well defined. For
    example, what's better:
    a) 2 queries each covering 2/3 of the total records (say R) whose
    overlap is 1/3*R, or
    b) 4 queries each covering 1/4*R with zero overlap ?

    George
     
    George Sakkis, Sep 15, 2006
    #7
  8. At Thursday 14/9/2006 20:31, Andrew McLean wrote:

    >Now I want to issue a series of queries, such that when I combine all
    >the data returned I have accessed all the records in the database.
    >However, I want to minimise the total number of queries and also want to
    >keep the number of records returned by more than one query small.


    This is known as a "set cover" algorithm. You have a set of subsets,
    and want to determine the smallest set of those subsets, whose union
    is the universal set - (uh, what a mess!)

    See "The Algorithm Design Manual" by Steven Skiena.
    http://www.cs.sunysb.edu/~algorith/files/set-cover.shtml
    Full text (not-so-legal, I presume):
    http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK/NODE4.HTM



    Gabriel Genellina
    Softlab SRL





    __________________________________________________
    Preguntá. Respondé. Descubrí.
    Todo lo que querías saber, y lo que ni imaginabas,
    está en Yahoo! Respuestas (Beta).
    ¡Probalo ya!
    http://www.yahoo.com.ar/respuestas
     
    Gabriel Genellina, Sep 15, 2006
    #8
  9. Gabriel Genellina wrote:

    > This is known as a "set cover" algorithm. You have a set of subsets,
    > and want to determine the smallest set of those subsets, whose union
    > is the universal set - (uh, what a mess!)


    I thought of that too, but he seems to be adding a second desired
    property: the intersections of the subsets should be as small as
    possible, or in other words the subsets should be as distinct as
    possible.

    George
     
    George Sakkis, Sep 15, 2006
    #9
  10. In message <eeconn$s2n$1$>, Andrew McLean wrote:

    > I have the ability to query a database in a legacy system and extract
    > records which match a particular pattern. Specifically, I can perform
    > queries for records that contain a given search term as a sub-string of
    > a particular column.


    What are the constraints on search terms? Are you allowed to specify the
    null search term, which would match everything?
     
    Lawrence D'Oliveiro, Sep 26, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ortoped

    Search algorithm question

    ortoped, Aug 19, 2003, in forum: Java
    Replies:
    1
    Views:
    402
    Miguel De Anda
    Aug 19, 2003
  2. Ahmed Moustafa
    Replies:
    0
    Views:
    814
    Ahmed Moustafa
    Nov 15, 2003
  3. Anna Smith

    Graph Algorithm Question

    Anna Smith, Apr 24, 2004, in forum: Java
    Replies:
    2
    Views:
    938
    Mladen Adamovic
    Apr 24, 2004
  4. BlueBall
    Replies:
    12
    Views:
    654
    Dale King
    Apr 15, 2006
  5. Bapaiah Katepalli
    Replies:
    1
    Views:
    1,531
    Mike Treseler
    Jun 23, 2006
Loading...

Share This Page