Algorithm on search

Vincent SHAO · May 6, 2008

Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Thank you.

Pascal J. Bourguignon · May 6, 2008

Vincent SHAO said:
Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Sounds like homework... Is it homework?

In real "search engine" queries, it's not strings, but _sets_ of
keywords that are searched, so already your data structures will be
suboptimal... That is, unless it's homework.

In anycase, have a look at http://en.wikipedia.org/wiki/Bloom_filter

Jim Langston · May 6, 2008

Vincent said:
Search engine have to record all of the query string. Now i have a
search engine log which contains 10 milllion query strings, but almost
of them are repeated, not more than 3 million of them are non-
repeated.
My task is to pick the top 10 most popular query string, memory < 1G,
the length of the query string is no more than 255.

The faster, the better.
the principal solutions, algorithm and data structure.

Thank you.

My first attempt would be to stuff the query strings into a map with the
query string (or a hash of it) as the key, the number of times it occurs as
the data.

Then a loop to read the data and sort, or simply compare counts and store
the keys for the top 10.

James Kanze · May 7, 2008

Sounds like homework... Is it homework?

It sounds like a problem that really could come up
professionally (which doesn't mean it isn't homework---a well
conceived course will use realistic examples).

In real "search engine" queries, it's not strings, but _sets_ of
keywords that are searched, so already your data structures will be
suboptimal... That is, unless it's homework.

What about SQL or LDAP? The problem is that he probably should
consider requests that different only in white space to be the
same; for that matter (considering SQL), something like "SELECT
* FROM table1 WHERE a > 5 AND b < 10" is the same as "select *
from table1 where b < 10 and a > 5". He probably needs to
normalize the input somewhat, but how far depends on the project
requirements.

Anyway, the obvious solution is to use an std::map< Query, long >
for starters, then copy the tabluation into a vector (of
Map::value_type) and sort that on the criteria he's interested
in. If that's not fast enough, replace std::map with a not
yet standard hash_map, and only if that's not fast enough, to
start worrying about other solutions.

In anycase, have a look athttp://en.wikipedia.org/wiki/Bloom_filter

That saves space, not time, and probably isn't appropriate for
such a small table. (Space savings become time savings when the
result in less disk accesses. In his case, however, the
complete table will almost certainly fit into real memory.)

search algorithm	0	Jan 6, 2009
Optimize a discovery algorithm	15	Jan 1, 2014
Search algorithm for HTML files	2	Sep 29, 2010
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
Fast String search algorithm	10	Dec 16, 2011
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Just started coding and im stuck on a lesson?	1	Oct 30, 2022
two interesting data structure/algorithm questions	3	Nov 16, 2007

Algorithm on search

Vincent SHAO

Pascal J. Bourguignon

Jim Langston

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads