huge dictionary

Discussion in 'C Programming' started by barbaros, Oct 9, 2005.

  1. barbaros

    barbaros Guest

    Hello everybody,

    I need some advice on the following problem.

    I want to write a program (in C or C++) which will
    build a huge dictionary (hash array, if you prefer).
    The keys are strings, the values are floating point numbers.
    The dictionary will soon become larger than the RAM memory
    (I expect more than ten gygabytes). So I need to save it
    (or parts of it) to the disk, while still being able to
    read from and write to it (efficiently). A detail which
    may be useful: no entries will ever be deleted. That is,
    changes to the dictionary consist only in appending an
    entry or modifiyng an existing value (no key will disappear).

    One solution would be to have a very large swap partition,
    under linux. But this is not a good idea in my case,
    since I need to keep the dictionary for future analysis.
    Except if I save it in the end to another partition ...
    why not, actually ?

    Another solution would be to use an external database
    like SQLite or Metakit.

    Which solution would you recommend ?
    Speed of execution is very important.

    Thanks in advance for any hint !

    Yours, Cristian Barbarosie
    http://cmaf.fc.ul.pt/~barbaros
     
    barbaros, Oct 9, 2005
    #1
    1. Advertising

  2. barbaros

    Willem Guest

    barbaros wrote:
    ) Hello everybody,
    )
    ) I need some advice on the following problem.
    )
    ) ...
    )
    ) Another solution would be to use an external database
    ) like SQLite or Metakit.

    Personally, I'd say that this is the reccommended solution.
    Someone else already solved all the difficult problems for you,
    so why not use it ? Should be quite easy to implement, as well.


    SaSW, Willem
    --
    Disclaimer: I am in no way responsible for any of the statements
    made in the above text. For all I know I might be
    drugged or something..
    No I'm not paranoid. You all think I'm paranoid, don't you !
    #EOT
     
    Willem, Oct 9, 2005
    #2
    1. Advertising

  3. barbaros

    barbaros Guest

    > Personally, I'd say that this is the reccommended solution.

    OK, but would you reccommend any of the two (SQLite or Metakit) ?
    Thanks. Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros
     
    barbaros, Oct 9, 2005
    #3
  4. If their license matches your needs, I would recommend
    Berkley bsd from sleepycat.

    http://www.sleepycat.com

    Mike


    barbaros wrote:
    > Hello everybody,
    >
    > I need some advice on the following problem.
    >
    > I want to write a program (in C or C++) which will
    > build a huge dictionary (hash array, if you prefer).
    > The keys are strings, the values are floating point numbers.
    > The dictionary will soon become larger than the RAM memory
    > (I expect more than ten gygabytes). So I need to save it
    > (or parts of it) to the disk, while still being able to
    > read from and write to it (efficiently). A detail which
    > may be useful: no entries will ever be deleted. That is,
    > changes to the dictionary consist only in appending an
    > entry or modifiyng an existing value (no key will disappear).
    >
    > One solution would be to have a very large swap partition,
    > under linux. But this is not a good idea in my case,
    > since I need to keep the dictionary for future analysis.
    > Except if I save it in the end to another partition ...
    > why not, actually ?
    >
    > Another solution would be to use an external database
    > like SQLite or Metakit.
    >
    > Which solution would you recommend ?
    > Speed of execution is very important.
    >
    > Thanks in advance for any hint !
    >
    > Yours, Cristian Barbarosie
    > http://cmaf.fc.ul.pt/~barbaros
    >



    --
    The greatest performance improvement occurs on the transition of from
    the non-working state to the working state.
     
    Michael Schneider, Oct 9, 2005
    #4
  5. barbaros

    moi Guest

    barbaros wrote:
    > Hello everybody,
    >
    > I need some advice on the following problem.
    >
    > I want to write a program (in C or C++) which will
    > build a huge dictionary (hash array, if you prefer).
    > The keys are strings, the values are floating point numbers.
    > The dictionary will soon become larger than the RAM memory
    > (I expect more than ten gygabytes). So I need to save it


    How many records ?
    [ most of the records's space will be consumed by
    the (variable length ?) strings, which could be
    compressed by using tries , skip lists]


    > (or parts of it) to the disk, while still being able to
    > read from and write to it (efficiently). A detail which
    > may be useful: no entries will ever be deleted. That is,
    > changes to the dictionary consist only in appending an
    > entry or modifiyng an existing value (no key will disappear).
    >
    > One solution would be to have a very large swap partition,
    > under linux. But this is not a good idea in my case,


    You could mmap, but the data still has to fit within your
    address space, which will have to be biggen than 32bit,
    if you want 10GB data
    [There will also be a problem with appends/inserts, but that
    can be handled, probably ]

    > since I need to keep the dictionary for future analysis.
    > Except if I save it in the end to another partition ...
    > why not, actually ?


    Syntax error.

    > Another solution would be to use an external database
    > like SQLite or Metakit.
    >
    > Which solution would you recommend ?
    > Speed of execution is very important.


    Which speed: throughput, or response ?


    >
    > Thanks in advance for any hint !
    >
    > Yours, Cristian Barbarosie
    > http://cmaf.fc.ul.pt/~barbaros
    >


    One way or the other, you data will have to be stored on disk.

    Even in the ideal case where you store the hashtable or index
    in core, and the data on disk, each retrieve operation will
    cost you one disk I/O.
    You can only save on I/O 's by clustering your data: putting data that
    is accessed together on the same disk block, or at least close to it.
    Depends on your access pattern. (which you probably don't know in advance)

    BTW for a similar problem:
    http://www-db.stanford.edu/~backrub/google.html
    They manage to keep the 'dictionary' in core.
    (but they probably can afford to forget/ignore some 'strings' )

    HTH,
    AvK
     
    moi, Oct 9, 2005
    #5
  6. In article <>,
    Michael Schneider <> wrote:

    >If their license matches your needs, I would recommend
    >Berkley bsd from sleepycat.


    You mean Berkeley DB.

    -- Richard
     
    Richard Tobin, Oct 9, 2005
    #6
  7. barbaros

    Ian Guest

    barbaros wrote:
    > Hello everybody,
    >
    > I need some advice on the following problem.
    >
    > I want to write a program (in C or C++) which will
    > build a huge dictionary (hash array, if you prefer).
    > The keys are strings, the values are floating point numbers.
    > The dictionary will soon become larger than the RAM memory
    > (I expect more than ten gygabytes). So I need to save it
    > (or parts of it) to the disk, while still being able to
    > read from and write to it (efficiently). A detail which
    > may be useful: no entries will ever be deleted. That is,
    > changes to the dictionary consist only in appending an
    > entry or modifiyng an existing value (no key will disappear).
    >

    You don't provide enough detail.

    Where do you require the performance, on create, read or modify? Are
    you constrained by performance or budget?

    If access performance is key and you have the budget, use enough RAM to
    hold the data and serialise it to disk on modify.

    Ian
     
    Ian, Oct 9, 2005
    #7
  8. Richard Tobin wrote:
    > In article <>,
    > Michael Schneider <> wrote:
    >
    >
    >>If their license matches your needs, I would recommend
    >>Berkley bsd from sleepycat.

    >
    >
    > You mean Berkeley DB.
    >
    > -- Richard


    Yes, fat fingered the DB,

    Thanks for the catch,
    Mike

    --
    The greatest performance improvement occurs on the transition of from
    the non-working state to the working state.
     
    Michael Schneider, Oct 10, 2005
    #8
  9. barbaros

    Sandeep Guest

    >>I want to write a program (in C or C++) which will
    >>build a huge dictionary (hash array, if you prefer).
    >>The keys are strings, the values are floating point numbers.


    You can consider writing ( or using an existing code if there is one
    freely available) a B+ Tree. This is typically how database writers
    implement the indexing mechanism. You can then choose where to create
    the file in this case.
    Again, B+ tree is just an example. Based on your needs you can optimize
    it to different datastructures.

    Using an existing database is a great idea ! (If getting a license is
    not an issue.)
     
    Sandeep, Oct 10, 2005
    #9
  10. "barbaros" <> wrote in message
    news:...
    > Hello everybody,
    >
    > I need some advice on the following problem.
    >
    > I want to write a program (in C or C++) which will
    > build a huge dictionary (hash array, if you prefer).
    > The keys are strings, the values are floating point numbers.
    > The dictionary will soon become larger than the RAM memory
    > (I expect more than ten gygabytes). So I need to save it
    > (or parts of it) to the disk, while still being able to
    > read from and write to it (efficiently). A detail which
    > may be useful: no entries will ever be deleted. That is,
    > changes to the dictionary consist only in appending an
    > entry or modifiyng an existing value (no key will disappear).
    >
    > One solution would be to have a very large swap partition,
    > under linux. But this is not a good idea in my case,
    > since I need to keep the dictionary for future analysis.


    Well no need for swap just mmap ordinary file and that's it.
    Just remember not to store real pointers but offset from
    what mmap returns. On 64 bit system you can easily
    mmap 10gb but on 32 bit not so. There you have to mmap
    piece at a time.

    >
    > Another solution would be to use an external database
    > like SQLite or Metakit.


    That would be easy.

    >
    > Which solution would you recommend ?
    > Speed of execution is very important.


    Go for mmap if you wan't speed, but this has cost of
    additional work :)

    Greetings, Bane.
     
    Branimir Maksimovic, Oct 10, 2005
    #10
  11. barbaros

    barbaros Guest

    Some say I don't provide enough detail.
    Let me see what can I add ...
    I need to build a large dictionary, in the sense
    of the "map" class in C++ STL. I need to build it
    quickly (if not it may take years to get something
    meaningfull), so I need relatively quick write access.
    Because it's so large I need to save (parts of) it on
    the hard disk, and in the end to save it all on disk.
    As about budget restrictions, I am commited to use
    only free software (open source, to be more precise).

    mmap seems to be the soltution, since it seems to be
    really fast (thank you, moi and bane). I confess I am
    not familiar with mmap or address space. I understand
    on a 64 bit machine there would be no problem.
    How do I know whether linux on a recent intel box is
    32 or 64 bits ?

    Actually, now I can imagine a way to compress a little
    bit my dictionary. Since the keys are all strings,
    instead of keeping all strings in memory I can have
    a map with keys 'a','b','c',...,'z'.
    Then the values of this map would be other maps,
    again having letters as keys, and so on. That is,
    I could work with a map os maps of maps ...
    I guess this is more or less what moi suggested.
    (By the way, moi, I see no syntax error in my message.)

    Thank you everybody for your answers.
    Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros
     
    barbaros, Oct 10, 2005
    #11
  12. barbaros wrote:
    > Some say I don't provide enough detail.
    > Let me see what can I add ...
    > I need to build a large dictionary, in the sense
    > of the "map" class in C++ STL. I need to build it
    > quickly (if not it may take years to get something
    > meaningfull), so I need relatively quick write access.
    > Because it's so large I need to save (parts of) it on
    > the hard disk, and in the end to save it all on disk.
    > As about budget restrictions, I am commited to use
    > only free software (open source, to be more precise).


    Use boost::multi_index for that. hashed indexes might be what you need.

    > mmap seems to be the soltution, since it seems to be
    > really fast (thank you, moi and bane). I confess I am
    > not familiar with mmap or address space. I understand
    > on a 64 bit machine there would be no problem.
    > How do I know whether linux on a recent intel box is
    > 32 or 64 bits ?


    Check uname command output.
    In C/C++ code you can get sizeof(void*).


    > Actually, now I can imagine a way to compress a little
    > bit my dictionary. Since the keys are all strings,
    > instead of keeping all strings in memory I can have
    > a map with keys 'a','b','c',...,'z'.
    > Then the values of this map would be other maps,
    > again having letters as keys, and so on. That is,
    > I could work with a map os maps of maps ...
    > I guess this is more or less what moi suggested.
    > (By the way, moi, I see no syntax error in my message.)


    This is called digital search tree (DST). It consumes a lot of memory
    for nodes. A ternary tree has similar performance charasteristics to
    that of DST but consumes less memory.
     
    Maxim Yegorushkin, Oct 10, 2005
    #12
  13. barbaros

    barbaros Guest

    moi wrote:
    > BTW for a similar problem:
    > http://www-db.stanford.edu/~backrub/google.html
    > They manage to keep the 'dictionary' in core.
    > (but they probably can afford to forget/ignore some 'strings' )


    Yes, you are perfectly right to point out this similar problem.
    Thank you for the link.
     
    barbaros, Oct 10, 2005
    #13
  14. "barbaros" <> wrote in message
    news:...
    > Some say I don't provide enough detail.
    > Let me see what can I add ...
    >
    > mmap seems to be the soltution, since it seems to be
    > really fast (thank you, moi and bane). I confess I am
    > not familiar with mmap or address space. I understand
    > on a 64 bit machine there would be no problem.
    > How do I know whether linux on a recent intel box is
    > 32 or 64 bits ?


    Well you don't need too. Just organize pages.
    Let's say you have page table in memory with pointer
    to mmaped block , and flag for in memory
    /out of memory status. Index into table represents
    page number. Then you only need to store page number
    and offset into page in say two integers, which will represent
    pointer in you map object. Let's say page = 0x1 offset = 0x4000
    then table[1].mmaped_pointer + 0x4000 gives real address.
    You have to check if table[1].in_memory = true if not
    mmap that page and update
    table[page_num].mmaped_pointer = mmap(...,page_num*page_size);
    then table[1].in_memory = true. You can put also access counter
    table[1].acc_count and unmap pages that are not used so you
    keep some maximum number of in memory pages.
    This is normaly done by os but if you don't have enough address
    space you have to do that too:)
    When retreiving data you just have to know where is first node of
    your data structure. Let's say it's on page 0 offset 0x0.

    >
    > Actually, now I can imagine a way to compress a little
    > bit my dictionary. Since the keys are all strings,
    > instead of keeping all strings in memory I can have
    > a map with keys 'a','b','c',...,'z'.


    You can use some hashing algorithm, and just store
    hash values instead of strings.

    Greetings, Bane.
     
    Branimir Maksimovic, Oct 10, 2005
    #14
  15. barbaros

    moi Guest

    barbaros wrote:
    > Some say I don't provide enough detail.
    > Let me see what can I add ...
    > I need to build a large dictionary, in the sense
    > of the "map" class in C++ STL. I need to build it
    > quickly (if not it may take years to get something


    >
    > mmap seems to be the soltution, since it seems to be
    > really fast (thank you, moi and bane). I confess I am
    > not familiar with mmap or address space. I understand
    > on a 64 bit machine there would be no problem.


    Mmap() *can* be elegant, but gets ugly in the
    presence of writes/appends to the same file (as in your
    case).
    Also, there is *no performance benefit* over write().
    Mmap just maps a piece of your diskfile into memory,
    but underneath just as much disk I/O is needed.
    In the ultimate case (unclustered read access) you end up with one read
    per record-fetch.
    Writing/appending is always faster (since multiple records can fit into
    one one disk block)
    Using a DB library does not change this, the library still has to
    do the reading/writing on your behalf, and you en up with 10ms readtime
    per block. It *can* save you some development effort.


    > How do I know whether linux on a recent intel box is
    > 32 or 64 bits ?


    What another replyer suggested :
    printf("%u", (unsigned) sizeof (void*));

    If the machine is 64 bits you can install/compile a 64 bits
    linux-distribution onto it.
    Linux has 64 bit support since about 1994 (alpha/PPC)


    > Actually, now I can imagine a way to compress a little
    > bit my dictionary. Since the keys are all strings,
    > instead of keeping all strings in memory I can have
    > a map with keys 'a','b','c',...,'z'.
    > Then the values of this map would be other maps,
    > again having letters as keys, and so on. That is,
    > I could work with a map os maps of maps ...


    This is basically a tree-like structure.
    Google for tree, trie, patricia or skiplist for more.
    As another replyer suggested: tree-like structures can take
    a lot of (memory) space, when badly designed/applied (eg when most of
    the pointers are NULL)

    > I guess this is more or less what moi suggested.
    > (By the way, moi, I see no syntax error in my message.)


    It was more about semantics. The paragraph made no sense to me.

    > Thank you everybody for your answers.
    > Cristian Barbarosie http://cmaf.fc.ul.pt/~barbaros
    >



    Good luck,
    AvK
     
    moi, Oct 10, 2005
    #15
  16. "moi" <avk@localhost> wrote in message
    news:...
    > barbaros wrote:
    >> Some say I don't provide enough detail.
    >> Let me see what can I add ...
    >> I need to build a large dictionary, in the sense
    >> of the "map" class in C++ STL. I need to build it
    >> quickly (if not it may take years to get something

    >
    >>
    >> mmap seems to be the soltution, since it seems to be
    >> really fast (thank you, moi and bane). I confess I am
    >> not familiar with mmap or address space. I understand
    >> on a 64 bit machine there would be no problem.

    >
    > Mmap() *can* be elegant, but gets ugly in the
    > presence of writes/appends to the same file (as in your
    > case).


    Not neccessarily: ftruncate then mmap additional page(s)
    (or lseek then write zeros to avoid fragmentation of file).
    But, better is to just preallocate large enough file,
    (write zero bytes, not ftruncate because of fragmentation)
    if not enough then expand.

    > Also, there is *no performance benefit* over write().


    Disk write is disk write, but with mmap you don't
    have to manually read/write from file to app memory.

    > Mmap just maps a piece of your diskfile into memory,
    > but underneath just as much disk I/O is needed.


    See the difference. No need for memory to file buffering
    and data conversion in application code, therefore mmap
    is most natural way to implement persistent data structure.

    > In the ultimate case (unclustered read access) you end up with one read
    > per record-fetch.


    Same as with read. Of course caching strategies can be different ,
    resulting that either mmap or read can be faster depending on situation.
    In my experince mmap is better when dealing with large random access
    files (large swap file).
    For pure sequential reading, read() should be better.

    > Writing/appending is always faster (since multiple records can fit into
    > one one disk block)
    > Using a DB library does not change this, the library still has to
    > do the reading/writing on your behalf, and you en up with 10ms readtime
    > per block. It *can* save you some development effort.


    Of course, except that everything goes through db interface and application
    itself is limited by it. In this case this is not problem since db
    eliminates
    the need for hash table implementation, one just uses db interface
    instead.

    Greetings, Bane.
     
    Branimir Maksimovic, Oct 10, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ilias Lazaridis
    Replies:
    6
    Views:
    442
    Ilias Lazaridis
    Feb 21, 2006
  2. barbaros

    huge dictionary

    barbaros, Oct 9, 2005, in forum: C++
    Replies:
    15
    Views:
    567
    Branimir Maksimovic
    Oct 10, 2005
  3. lazy
    Replies:
    2
    Views:
    342
    Alex Martelli
    Jun 15, 2007
  4. Replies:
    3
    Views:
    501
  5. keobox
    Replies:
    3
    Views:
    673
    Peter Otten
    May 22, 2010
Loading...

Share This Page