Algorithm Question

Andrew McLean · Sep 10, 2006

This really an algorithm question more that a Python question, but it
would be implemented in Python....

I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
But I also want to minimise T = sum_j t_j where

t_j = count of the number of elements in A which have b[j] as a sub-string

My guess is that finding the smallest possible T satisfying the
constraint would be hard. However, for my application just keeping it
reasonably small would help.

In my case the list A contains over two million addresses.

The (top down) heuristic approach I am tempted to employ is to start by
dividing the entries in A into sets of tokens, then take the union of
all these sets as a starting point for B. Then I would try to trim B by

1. looking for elements that I could remove while still satisfying the
constraint

2. replacing two elements by a common sub-string if that reduced T

Anyway. It occurred to me that this might be a known problem. Any
pointers gratefully received.

- Andrew

Neil Cerutti · Sep 11, 2006

This really an algorithm question more that a Python question,
but it would be implemented in Python....

In that case, try comp.programming. But still...

I have a list of strings, A. I want to find a set of strings B
such that for any "a in A" there exists "b in B" such that b is
a sub-string of a. But I also want to minimise T = sum_j t_j
where

t_j = count of the number of elements in A which have b[j] as a
sub-string

My guess is that finding the smallest possible T satisfying the
constraint would be hard. However, for my application just
keeping it reasonably small would help.

In my case the list A contains over two million addresses.

I'm afraid I don't understand your description.

How about an example of B for some A?

A = ["abb", "bbc"]
B = ["a", "ab", "abb", "b", "bb", "bbc", "bc", "c"]

Is that what you're after?

Carl Banks · Sep 11, 2006

Andrew said:
I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
B=A?

But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string

If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.

Carl Banks

Andrew McLean · Sep 11, 2006

Carl said:
Andrew said:

I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
B=A?

But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string

Click to expand...

If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.

You are quite right. I was trying to generalise my real problem and
missed out a constraint. I also want to keep length(B) small.
Unfortunately, I'm a bit unsure about the relative importance of T and
length(B), which makes the problem rather ill defined. I'll have to give
this a bit more thought....

John Machin · Sep 12, 2006

Andrew said:
Carl said:

Andrew said:

I have a list of strings, A. I want to find a set of strings B such that
for any "a in A" there exists "b in B" such that b is a sub-string of a.
B=A?

But I also want to minimise T = sum_j t_j where
t_j = count of the number of elements in A which have b[j] as a sub-string

Click to expand...

If there not elements in A that are substrings of any other element in
A, and if B=A, then t_j would be 1 for all j. Which means B=A would be
optimal (since elements of B have to be substring of at least one
element in A). It looks like the B={set of all elements in A that are
not a substring of any other element in A} is the generally optimal
solution.

I suspect you mistyped or omitted something--problem is underspecified
at best.

Click to expand...

You are quite right. I was trying to generalise my real problem and
missed out a constraint. I also want to keep length(B) small.
Unfortunately, I'm a bit unsure about the relative importance of T and
length(B), which makes the problem rather ill defined. I'll have to give
this a bit more thought....

A quick silly question: what is the problem that you are trying to
solve?

Andrew McLean · Sep 15, 2006

John said:
A quick silly question: what is the problem that you are trying to
solve?

A fair question

The problem may seem a bit strange, but here it is:

I have the ability to query a database in a legacy system and extract
records which match a particular pattern. Specifically, I can perform
queries for records that contain a given search term as a sub-string of
a particular column. The specific column contains an address. This
database can only be accessed through this particular interface (don't
ask why, it's one of the reasons it's a *legacy* system).

I also have access to a list that contains the vast majority (possibly
all) the addresses which are stored in the database.

Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.

Now the current approach I use is to divide the addresses I have into
tokens and take the last token in the address (excluding the postal
code). The union of these "last tokens" forms my set of queries. The
last token in the address is typically a county or a town in a UK address.

This works, but I was wondering if I could do something more efficient.
The problem is that while the search term "London" matches all the
addresses in London it also returns all the addresses containing "London
Road", and a lot of towns have a London Road. Perhaps I would be better
off searching for "Road", "Street", "Avenue" ....

It occurred to me that this my be isomorphic to a known problem, however
given that I want to keep two things small, the problem isn't very well
defined.

The current approach works, I was just musing whether there was a faster
approach, so don't think about it too hard.

- Andrew

George Sakkis · Sep 15, 2006

Andrew said:
Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.

The first requirement makes sense, but what's the reason for the second
one ? In any case, as you realized the problem is not well defined. For
example, what's better:
a) 2 queries each covering 2/3 of the total records (say R) whose
overlap is 1/3*R, or
b) 4 queries each covering 1/4*R with zero overlap ?

George

Gabriel Genellina · Sep 15, 2006

Now I want to issue a series of queries, such that when I combine all
the data returned I have accessed all the records in the database.
However, I want to minimise the total number of queries and also want to
keep the number of records returned by more than one query small.

This is known as a "set cover" algorithm. You have a set of subsets,
and want to determine the smallest set of those subsets, whose union
is the universal set - (uh, what a mess!)

See "The Algorithm Design Manual" by Steven Skiena.
http://www.cs.sunysb.edu/~algorith/files/set-cover.shtml
Full text (not-so-legal, I presume):
http://www2.toki.or.id/book/AlgDesignManual/BOOK/BOOK/NODE4.HTM

Gabriel Genellina
Softlab SRL

__________________________________________________
Preguntá. Respondé. Descubrí.
Todo lo que querías saber, y lo que ni imaginabas,
está en Yahoo! Respuestas (Beta).
¡Probalo ya!
http://www.yahoo.com.ar/respuestas

George Sakkis · Sep 15, 2006

Gabriel said:
This is known as a "set cover" algorithm. You have a set of subsets,
and want to determine the smallest set of those subsets, whose union
is the universal set - (uh, what a mess!)

I thought of that too, but he seems to be adding a second desired
property: the intersections of the subsets should be as small as
possible, or in other words the subsets should be as distinct as
possible.

George

Lawrence D'Oliveiro · Sep 26, 2006

I have the ability to query a database in a legacy system and extract
records which match a particular pattern. Specifically, I can perform
queries for records that contain a given search term as a sub-string of
a particular column.

What are the constraints on search terms? Are you allowed to specify the
null search term, which would match everything?

Algorithm	1	Dec 15, 2019
Quick sort algorithm	1	Feb 22, 2023
Simplex Algorithm	10	Sep 2, 2013
While loop unclear, can someone help?	4	Dec 6, 2023
Rock paper scissors in python with "algorithm"	1	Feb 27, 2022
[newbie] Recursive algorithm - review	5	Jan 4, 2014
Math python question	10	Mar 31, 2023
How to create language translation program from dsl or bgl (by Open Source)	0	Nov 12, 2022

Algorithm Question

Andrew McLean

Neil Cerutti

Carl Banks

Andrew McLean

John Machin

Andrew McLean

George Sakkis

Gabriel Genellina

George Sakkis

Lawrence D'Oliveiro

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads