huge dictionary

B

barbaros

Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,
since I need to keep the dictionary for future analysis.
Except if I save it in the end to another partition ...
why not, actually ?

Another solution would be to use an external database
like SQLite or Metakit.

Which solution would you recommend ?
Speed of execution is very important.

Thanks in advance for any hint !

Yours, Cristian Barbarosie
http://cmaf.fc.ul.pt/~barbaros
 
W

Willem

barbaros wrote:
) Hello everybody,
)
) I need some advice on the following problem.
)
) ...
)
) Another solution would be to use an external database
) like SQLite or Metakit.

Personally, I'd say that this is the reccommended solution.
Someone else already solved all the difficult problems for you,
so why not use it ? Should be quite easy to implement, as well.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
M

moi

barbaros said:
Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it

How many records ?
[ most of the records's space will be consumed by
the (variable length ?) strings, which could be
compressed by using tries , skip lists]

(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,

You could mmap, but the data still has to fit within your
address space, which will have to be biggen than 32bit,
if you want 10GB data
[There will also be a problem with appends/inserts, but that
can be handled, probably ]
since I need to keep the dictionary for future analysis.
Except if I save it in the end to another partition ...
why not, actually ?

Syntax error.
Another solution would be to use an external database
like SQLite or Metakit.

Which solution would you recommend ?
Speed of execution is very important.

Which speed: throughput, or response ?

Thanks in advance for any hint !

Yours, Cristian Barbarosie
http://cmaf.fc.ul.pt/~barbaros

One way or the other, you data will have to be stored on disk.

Even in the ideal case where you store the hashtable or index
in core, and the data on disk, each retrieve operation will
cost you one disk I/O.
You can only save on I/O 's by clustering your data: putting data that
is accessed together on the same disk block, or at least close to it.
Depends on your access pattern. (which you probably don't know in advance)

BTW for a similar problem:
http://www-db.stanford.edu/~backrub/google.html
They manage to keep the 'dictionary' in core.
(but they probably can afford to forget/ignore some 'strings' )

HTH,
AvK
 
I

Ian

barbaros said:
Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).
You don't provide enough detail.

Where do you require the performance, on create, read or modify? Are
you constrained by performance or budget?

If access performance is key and you have the budget, use enough RAM to
hold the data and serialise it to disk on modify.

Ian
 
S

Sandeep

I want to write a program (in C or C++) which will
You can consider writing ( or using an existing code if there is one
freely available) a B+ Tree. This is typically how database writers
implement the indexing mechanism. You can then choose where to create
the file in this case.
Again, B+ tree is just an example. Based on your needs you can optimize
it to different datastructures.

Using an existing database is a great idea ! (If getting a license is
not an issue.)
 
B

Branimir Maksimovic

barbaros said:
Hello everybody,

I need some advice on the following problem.

I want to write a program (in C or C++) which will
build a huge dictionary (hash array, if you prefer).
The keys are strings, the values are floating point numbers.
The dictionary will soon become larger than the RAM memory
(I expect more than ten gygabytes). So I need to save it
(or parts of it) to the disk, while still being able to
read from and write to it (efficiently). A detail which
may be useful: no entries will ever be deleted. That is,
changes to the dictionary consist only in appending an
entry or modifiyng an existing value (no key will disappear).

One solution would be to have a very large swap partition,
under linux. But this is not a good idea in my case,
since I need to keep the dictionary for future analysis.

Well no need for swap just mmap ordinary file and that's it.
Just remember not to store real pointers but offset from
what mmap returns. On 64 bit system you can easily
mmap 10gb but on 32 bit not so. There you have to mmap
piece at a time.
Another solution would be to use an external database
like SQLite or Metakit.

That would be easy.
Which solution would you recommend ?
Speed of execution is very important.

Go for mmap if you wan't speed, but this has cost of
additional work :)

Greetings, Bane.
 
B

barbaros

Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
meaningfull), so I need relatively quick write access.
Because it's so large I need to save (parts of) it on
the hard disk, and in the end to save it all on disk.
As about budget restrictions, I am commited to use
only free software (open source, to be more precise).

mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)

Thank you everybody for your answers.
Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros
 
M

Maxim Yegorushkin

barbaros said:
Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
meaningfull), so I need relatively quick write access.
Because it's so large I need to save (parts of) it on
the hard disk, and in the end to save it all on disk.
As about budget restrictions, I am commited to use
only free software (open source, to be more precise).

Use boost::multi_index for that. hashed indexes might be what you need.
mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?

Check uname command output.
In C/C++ code you can get sizeof(void*).

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)

This is called digital search tree (DST). It consumes a lot of memory
for nodes. A ternary tree has similar performance charasteristics to
that of DST but consumes less memory.
 
B

Branimir Maksimovic

barbaros said:
Some say I don't provide enough detail.
Let me see what can I add ...

mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.
How do I know whether linux on a recent intel box is
32 or 64 bits ?

Well you don't need too. Just organize pages.
Let's say you have page table in memory with pointer
to mmaped block , and flag for in memory
/out of memory status. Index into table represents
page number. Then you only need to store page number
and offset into page in say two integers, which will represent
pointer in you map object. Let's say page = 0x1 offset = 0x4000
then table[1].mmaped_pointer + 0x4000 gives real address.
You have to check if table[1].in_memory = true if not
mmap that page and update
table[page_num].mmaped_pointer = mmap(...,page_num*page_size);
then table[1].in_memory = true. You can put also access counter
table[1].acc_count and unmap pages that are not used so you
keep some maximum number of in memory pages.
This is normaly done by os but if you don't have enough address
space you have to do that too:)
When retreiving data you just have to know where is first node of
your data structure. Let's say it's on page 0 offset 0x0.
Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.

You can use some hashing algorithm, and just store
hash values instead of strings.

Greetings, Bane.
 
M

moi

barbaros said:
Some say I don't provide enough detail.
Let me see what can I add ...
I need to build a large dictionary, in the sense
of the "map" class in C++ STL. I need to build it
quickly (if not it may take years to get something
mmap seems to be the soltution, since it seems to be
really fast (thank you, moi and bane). I confess I am
not familiar with mmap or address space. I understand
on a 64 bit machine there would be no problem.

Mmap() *can* be elegant, but gets ugly in the
presence of writes/appends to the same file (as in your
case).
Also, there is *no performance benefit* over write().
Mmap just maps a piece of your diskfile into memory,
but underneath just as much disk I/O is needed.
In the ultimate case (unclustered read access) you end up with one read
per record-fetch.
Writing/appending is always faster (since multiple records can fit into
one one disk block)
Using a DB library does not change this, the library still has to
do the reading/writing on your behalf, and you en up with 10ms readtime
per block. It *can* save you some development effort.

How do I know whether linux on a recent intel box is
32 or 64 bits ?

What another replyer suggested :
printf("%u", (unsigned) sizeof (void*));

If the machine is 64 bits you can install/compile a 64 bits
linux-distribution onto it.
Linux has 64 bit support since about 1994 (alpha/PPC)

Actually, now I can imagine a way to compress a little
bit my dictionary. Since the keys are all strings,
instead of keeping all strings in memory I can have
a map with keys 'a','b','c',...,'z'.
Then the values of this map would be other maps,
again having letters as keys, and so on. That is,
I could work with a map os maps of maps ...

This is basically a tree-like structure.
Google for tree, trie, patricia or skiplist for more.
As another replyer suggested: tree-like structures can take
a lot of (memory) space, when badly designed/applied (eg when most of
the pointers are NULL)
I guess this is more or less what moi suggested.
(By the way, moi, I see no syntax error in my message.)

It was more about semantics. The paragraph made no sense to me.
Thank you everybody for your answers.
Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros


Good luck,
AvK
 
B

Branimir Maksimovic

moi said:
Mmap() *can* be elegant, but gets ugly in the
presence of writes/appends to the same file (as in your
case).

Not neccessarily: ftruncate then mmap additional page(s)
(or lseek then write zeros to avoid fragmentation of file).
But, better is to just preallocate large enough file,
(write zero bytes, not ftruncate because of fragmentation)
if not enough then expand.
Also, there is *no performance benefit* over write().

Disk write is disk write, but with mmap you don't
have to manually read/write from file to app memory.
Mmap just maps a piece of your diskfile into memory,
but underneath just as much disk I/O is needed.

See the difference. No need for memory to file buffering
and data conversion in application code, therefore mmap
is most natural way to implement persistent data structure.
In the ultimate case (unclustered read access) you end up with one read
per record-fetch.

Same as with read. Of course caching strategies can be different ,
resulting that either mmap or read can be faster depending on situation.
In my experince mmap is better when dealing with large random access
files (large swap file).
For pure sequential reading, read() should be better.
Writing/appending is always faster (since multiple records can fit into
one one disk block)
Using a DB library does not change this, the library still has to
do the reading/writing on your behalf, and you en up with 10ms readtime
per block. It *can* save you some development effort.

Of course, except that everything goes through db interface and application
itself is limited by it. In this case this is not problem since db
eliminates
the need for hash table implementation, one just uses db interface
instead.

Greetings, Bane.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top