Indexing large data

Discussion in 'C++' started by chandra.somesh@gmail.com, Jun 4, 2005.

  1. Guest

    Hi

    I am having problem indexing large amount of textual data(in range of
    200,000 to 1 million). I tried using map container but the compiler
    just hanged upon entering the data. On bit of searching i got reference
    to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
    implementing it but the results were far from satisfactory(maybe i
    implemented it wrong). Anyways i would appreciate if anyone could
    suggest method to index large data(even using external
    storage). Also if anyone has successfully tested signature files could
    give their comments.

    thanks in advance
    , Jun 4, 2005
    #1
    1. Advertising

  2. Rapscallion Guest

    wrote:
    > I am having problem indexing large amount of textual data(in range of
    > 200,000 to 1 million).


    What kind of 'textual data'? Structured or unstructured? What means
    'range of 200,000 to 1 million'?

    > I tried using map container but the compiler
    > just hanged upon entering the data. On bit of searching i got reference
    > to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
    > implementing it but the results were far from satisfactory(maybe i
    > implemented it wrong). Anyways i would appreciate if anyone could
    > suggest method to index large data(even using external
    > storage). Also if anyone has successfully tested signature files could
    > give their comments.


    For structured data SQLite is a widely used library
    (http://www.sqlite.org/), for unstructured data probably other free
    libraries exist.
    Rapscallion, Jun 4, 2005
    #2
    1. Advertising

  3. Calum Grant Guest

    wrote:
    > Hi
    >
    > I am having problem indexing large amount of textual data(in range of
    > 200,000 to 1 million). I tried using map container but the compiler
    > just hanged upon entering the data. On bit of searching i got reference
    > to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
    > implementing it but the results were far from satisfactory(maybe i
    > implemented it wrong). Anyways i would appreciate if anyone could
    > suggest method to index large data(even using external
    > storage). Also if anyone has successfully tested signature files could
    > give their comments.
    >
    > thanks in advance
    >


    I assume you mean create a container

    word1 -> (doc1, pos1) -> (doc2, pos2) -> (doc3, pos3)
    word2 -> (doc4, pos4)
    etc

    What size? A million words or a million documents.

    Why do you believe the "hang" is in the container (or even the compiler
    as you say)? Why not the parser? Do you mean "hang" or just "too slow"?

    How long does it take to parse the documents? Why don't you try with 1
    document, or 10 documents to start with???

    The STL should be able to cope with that, it will be faster than
    external storage. Unfortunately, you'd need to load/save your map all
    the time. Check http://tinyurl.com/77xax

    As for a data structure, I would suggest

    std::map<std::string, std::list<int> >

    where "int" is your document id. To add a document

    index[word].push_back(doc);

    Calum
    Calum Grant, Jun 4, 2005
    #3
  4. On 4 Jun 2005 00:23:06 -0700, wrote:

    >Anyways i would appreciate if anyone could
    >suggest method to index large data(even using external
    >storage).


    Most commercial databases seem to use some kind of B-tree indexing.

    --
    Bob Hairgrove
    Bob Hairgrove, Jun 4, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. C
    Replies:
    0
    Views:
    498
  2. Emin
    Replies:
    4
    Views:
    410
    Paul McGuire
    Jan 12, 2007
  3. mathieu
    Replies:
    0
    Views:
    287
    mathieu
    Aug 17, 2009
  4. Skybuck Flying
    Replies:
    30
    Views:
    1,103
    Bill Reid
    Sep 19, 2011
  5. C
    Replies:
    3
    Views:
    220
    Manohar Kamath [MVP]
    Oct 17, 2003
Loading...

Share This Page