Algorithm on search

Discussion in 'C++' started by Vincent SHAO, May 6, 2008.

  1. Vincent SHAO

    Vincent SHAO Guest

    Search engine have to record all of the query string. Now i have a
    search engine log which contains 10 milllion query strings, but almost
    of them are repeated, not more than 3 million of them are non-
    repeated.
    My task is to pick the top 10 most popular query string, memory < 1G,
    the length of the query string is no more than 255.

    The faster, the better.
    the principal solutions, algorithm and data structure.

    Thank you.:)
    Vincent SHAO, May 6, 2008
    #1
    1. Advertising

  2. Vincent SHAO <> writes:

    > Search engine have to record all of the query string. Now i have a
    > search engine log which contains 10 milllion query strings, but almost
    > of them are repeated, not more than 3 million of them are non-
    > repeated.
    > My task is to pick the top 10 most popular query string, memory < 1G,
    > the length of the query string is no more than 255.
    >
    > The faster, the better.
    > the principal solutions, algorithm and data structure.


    Sounds like homework... Is it homework?


    In real "search engine" queries, it's not strings, but _sets_ of
    keywords that are searched, so already your data structures will be
    suboptimal... That is, unless it's homework.

    In anycase, have a look at http://en.wikipedia.org/wiki/Bloom_filter

    --
    __Pascal Bourguignon__
    Pascal J. Bourguignon, May 6, 2008
    #2
    1. Advertising

  3. Vincent SHAO

    Jim Langston Guest

    Vincent SHAO wrote:
    > Search engine have to record all of the query string. Now i have a
    > search engine log which contains 10 milllion query strings, but almost
    > of them are repeated, not more than 3 million of them are non-
    > repeated.
    > My task is to pick the top 10 most popular query string, memory < 1G,
    > the length of the query string is no more than 255.
    >
    > The faster, the better.
    > the principal solutions, algorithm and data structure.
    >
    > Thank you.:)


    My first attempt would be to stuff the query strings into a map with the
    query string (or a hash of it) as the key, the number of times it occurs as
    the data.

    Then a loop to read the data and sort, or simply compare counts and store
    the keys for the top 10.


    --
    Jim Langston
    Jim Langston, May 6, 2008
    #3
  4. Vincent SHAO

    James Kanze Guest

    On May 6, 2:15 pm, (Pascal J. Bourguignon)
    wrote:
    > Vincent SHAO <> writes:
    > > Search engine have to record all of the query string. Now i
    > > have a search engine log which contains 10 milllion query
    > > strings, but almost of them are repeated, not more than 3
    > > million of them are non-repeated. My task is to pick the
    > > top 10 most popular query string, memory < 1G, the length of
    > > the query string is no more than 255.


    > > The faster, the better. the principal solutions, algorithm
    > > and data structure.


    > Sounds like homework... Is it homework?


    It sounds like a problem that really could come up
    professionally (which doesn't mean it isn't homework---a well
    conceived course will use realistic examples).

    > In real "search engine" queries, it's not strings, but _sets_ of
    > keywords that are searched, so already your data structures will be
    > suboptimal... That is, unless it's homework.


    What about SQL or LDAP? The problem is that he probably should
    consider requests that different only in white space to be the
    same; for that matter (considering SQL), something like "SELECT
    * FROM table1 WHERE a > 5 AND b < 10" is the same as "select *
    from table1 where b < 10 and a > 5". He probably needs to
    normalize the input somewhat, but how far depends on the project
    requirements.

    Anyway, the obvious solution is to use an std::map< Query, long >
    for starters, then copy the tabluation into a vector (of
    Map::value_type) and sort that on the criteria he's interested
    in. If that's not fast enough, replace std::map with a not
    yet standard hash_map, and only if that's not fast enough, to
    start worrying about other solutions.

    > In anycase, have a look athttp://en.wikipedia.org/wiki/Bloom_filter


    That saves space, not time, and probably isn't appropriate for
    such a small table. (Space savings become time savings when the
    result in less disk accesses. In his case, however, the
    complete table will almost certainly fit into real memory.)

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, May 7, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Johann Blake
    Replies:
    0
    Views:
    328
    Johann Blake
    Jan 21, 2004
  2. Ben Fidge
    Replies:
    8
    Views:
    444
    Ben Fidge
    May 2, 2005
  3. Ahmed Moustafa
    Replies:
    0
    Views:
    774
    Ahmed Moustafa
    Nov 15, 2003
  4. Bapaiah Katepalli
    Replies:
    1
    Views:
    1,494
    Mike Treseler
    Jun 23, 2006
  5. Abby Lee
    Replies:
    5
    Views:
    398
    Abby Lee
    Aug 2, 2004
Loading...

Share This Page