Algorithm on search

V

Vincent SHAO

Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Thank you.:)
 
P

Pascal J. Bourguignon

Vincent SHAO said:
Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Sounds like homework... Is it homework?


In real "search engine" queries, it's not strings, but _sets_ of
keywords that are searched, so already your data structures will be
suboptimal... That is, unless it's homework.

In anycase, have a look at http://en.wikipedia.org/wiki/Bloom_filter
 
J

Jim Langston

Vincent said:
Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Thank you.:)

My first attempt would be to stuff the query strings into a map with the
query string (or a hash of it) as the key, the number of times it occurs as
the data.

Then a loop to read the data and sort, or simply compare counts and store
the keys for the top 10.
 
J

James Kanze

Sounds like homework... Is it homework?

It sounds like a problem that really could come up
professionally (which doesn't mean it isn't homework---a well
conceived course will use realistic examples).
In real "search engine" queries, it's not strings, but _sets_ of
keywords that are searched, so already your data structures will be
suboptimal... That is, unless it's homework.

What about SQL or LDAP? The problem is that he probably should
consider requests that different only in white space to be the
same; for that matter (considering SQL), something like "SELECT
* FROM table1 WHERE a > 5 AND b < 10" is the same as "select *
from table1 where b < 10 and a > 5". He probably needs to
normalize the input somewhat, but how far depends on the project
requirements.

Anyway, the obvious solution is to use an std::map< Query, long >
for starters, then copy the tabluation into a vector (of
Map::value_type) and sort that on the criteria he's interested
in. If that's not fast enough, replace std::map with a not
yet standard hash_map, and only if that's not fast enough, to
start worrying about other solutions.
In anycase, have a look athttp://en.wikipedia.org/wiki/Bloom_filter

That saves space, not time, and probably isn't appropriate for
such a small table. (Space savings become time savings when the
result in less disk accesses. In his case, however, the
complete table will almost certainly fit into real memory.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top