Indexing large data

C

chandra.somesh

Hi

I am having problem indexing large amount of textual data(in range of
200,000 to 1 million). I tried using map container but the compiler
just hanged upon entering the data. On bit of searching i got reference
to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
implementing it but the results were far from satisfactory(maybe i
implemented it wrong). Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage). Also if anyone has successfully tested signature files could
give their comments.

thanks in advance
 
R

Rapscallion

I am having problem indexing large amount of textual data(in range of
200,000 to 1 million).

What kind of 'textual data'? Structured or unstructured? What means
'range of 200,000 to 1 million'?
I tried using map container but the compiler
just hanged upon entering the data. On bit of searching i got reference
to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
implementing it but the results were far from satisfactory(maybe i
implemented it wrong). Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage). Also if anyone has successfully tested signature files could
give their comments.

For structured data SQLite is a widely used library
(http://www.sqlite.org/), for unstructured data probably other free
libraries exist.
 
C

Calum Grant

Hi

I am having problem indexing large amount of textual data(in range of
200,000 to 1 million). I tried using map container but the compiler
just hanged upon entering the data. On bit of searching i got reference
to a paper regarding "Signature files" by FALOUTSOS C. 1992. I tried
implementing it but the results were far from satisfactory(maybe i
implemented it wrong). Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage). Also if anyone has successfully tested signature files could
give their comments.

thanks in advance

I assume you mean create a container

word1 -> (doc1, pos1) -> (doc2, pos2) -> (doc3, pos3)
word2 -> (doc4, pos4)
etc

What size? A million words or a million documents.

Why do you believe the "hang" is in the container (or even the compiler
as you say)? Why not the parser? Do you mean "hang" or just "too slow"?

How long does it take to parse the documents? Why don't you try with 1
document, or 10 documents to start with???

The STL should be able to cope with that, it will be faster than
external storage. Unfortunately, you'd need to load/save your map all
the time. Check http://tinyurl.com/77xax

As for a data structure, I would suggest

std::map<std::string, std::list<int> >

where "int" is your document id. To add a document

index[word].push_back(doc);

Calum
 
B

Bob Hairgrove

Anyways i would appreciate if anyone could
suggest method to index large data(even using external
storage).

Most commercial databases seem to use some kind of B-tree indexing.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top