How to store the data of a large map<string, int>?

Discussion in 'C++' started by liujiaping, Aug 6, 2007.

  1. liujiaping

    liujiaping Guest

    Hi, all.
    I have a dictionary-like file which has the following format:
    first 4
    column 7
    is 9
    a 23
    word 134
    ....

    Every line has two columns. The first column is always an English
    word, and the second is an integer number. The file is a text file,
    and contains tens of thousands of lines. My program has to read this
    file, and I use the container map<string, int> to store the data.
    Every time my program runs, the file is read. But since the file is
    too large, the speed is very slow. Is there any other efficient way to
    organize the data of the file to make it fast to read?
    Any help is appreciated. Thank you.
    liujiaping, Aug 6, 2007
    #1
    1. Advertising

  2. liujiaping

    Ian Collins Guest

    liujiaping wrote:
    > Hi, all.
    > I have a dictionary-like file which has the following format:
    > first 4
    > column 7
    > is 9
    > a 23
    > word 134
    > ....
    >
    > Every line has two columns. The first column is always an English
    > word, and the second is an integer number. The file is a text file,
    > and contains tens of thousands of lines. My program has to read this
    > file, and I use the container map<string, int> to store the data.
    > Every time my program runs, the file is read. But since the file is
    > too large, the speed is very slow. Is there any other efficient way to
    > organize the data of the file to make it fast to read?
    > Any help is appreciated. Thank you.
    >

    How are you reading the data and where is your current bottleneck?

    --
    Ian Collins.
    Ian Collins, Aug 6, 2007
    #2
    1. Advertising

  3. liujiaping

    Barry Guest

    liujiaping wrote:
    > Hi, all.
    > I have a dictionary-like file which has the following format:
    > first 4
    > column 7
    > is 9
    > a 23
    > word 134
    > ...
    >
    > Every line has two columns. The first column is always an English
    > word, and the second is an integer number. The file is a text file,
    > and contains tens of thousands of lines. My program has to read this
    > file, and I use the container map<string, int> to store the data.
    > Every time my program runs, the file is read. But since the file is
    > too large, the speed is very slow. Is there any other efficient way to
    > organize the data of the file to make it fast to read?
    > Any help is appreciated. Thank you.
    >


    I think you have to convert your text file into binary mode, built as a
    dictionary indexed file.

    You can have such structure to serialize your data into the dictionary file

    struct Foo {
    unsigned int word_len;
    char* word;
    int key;
    };

    and index to Foo object into an integral value so you can search it
    fast, like hashing, to build a index file.

    You can reference StarDict, an open source dictionary, it will give you
    some hints.
    Barry, Aug 6, 2007
    #3
  4. liujiaping

    liujiaping Guest

    On Aug 6, 11:25 am, Ian Collins <> wrote:
    > liujiaping wrote:
    > > Hi, all.
    > > I have a dictionary-like file which has the following format:
    > > first 4
    > > column 7
    > > is 9
    > > a 23
    > > word 134
    > > ....

    >
    > > Every line has two columns. The first column is always an English
    > > word, and the second is an integer number. The file is a text file,
    > > and contains tens of thousands of lines. My program has to read this
    > > file, and I use the container map<string, int> to store the data.
    > > Every time my program runs, the file is read. But since the file is
    > > too large, the speed is very slow. Is there any other efficient way to
    > > organize the data of the file to make it fast to read?
    > > Any help is appreciated. Thank you.

    >
    > How are you reading the data and where is your current bottleneck?
    >
    > --
    > Ian Collins.


    I just use the file stream to read the data, simply like this:

    fstream data_file("data.txt");
    string word;
    int key;
    map<string, int> keymap;
    while(!data_file.eof())
    {
    data_file >> word;
    data_file >> key;
    key_map[word] = key;
    }
    liujiaping, Aug 6, 2007
    #4
  5. liujiaping

    Ian Collins Guest

    liujiaping wrote:
    > On Aug 6, 11:25 am, Ian Collins <> wrote:
    >> How are you reading the data and where is your current bottleneck?
    >>


    Please don't quote signatures.

    >
    > I just use the file stream to read the data, simply like this:
    >
    > fstream data_file("data.txt");
    > string word;
    > int key;
    > map<string, int> keymap;
    > while(!data_file.eof())
    > {
    > data_file >> word;
    > data_file >> key;
    > key_map[word] = key;
    > }
    >

    Have you profiled the reading to see where most of the time is being used?

    --
    Ian Collins.
    Ian Collins, Aug 6, 2007
    #5
  6. liujiaping

    liujiaping Guest

    On Aug 6, 11:31 am, Barry <> wrote:
    > liujiaping wrote:
    > > Hi, all.
    > > I have a dictionary-like file which has the following format:
    > > first 4
    > > column 7
    > > is 9
    > > a 23
    > > word 134
    > > ...

    >
    > > Every line has two columns. The first column is always an English
    > > word, and the second is an integer number. The file is a text file,
    > > and contains tens of thousands of lines. My program has to read this
    > > file, and I use the container map<string, int> to store the data.
    > > Every time my program runs, the file is read. But since the file is
    > > too large, the speed is very slow. Is there any other efficient way to
    > > organize the data of the file to make it fast to read?
    > > Any help is appreciated. Thank you.

    >
    > I think you have to convert your text file into binary mode, built as a
    > dictionary indexed file.
    >
    > You can have such structure to serialize your data into the dictionary file
    >
    > struct Foo {
    > unsigned int word_len;
    > char* word;
    > int key;
    >
    > };
    >
    > and index to Foo object into an integral value so you can search it
    > fast, like hashing, to build a index file.
    >
    > You can reference StarDict, an open source dictionary, it will give you
    > some hints.


    Thanks for ur advice. But how to write the struct Foo to a binary
    file?
    Can the function fwrite() do that? And given the binary file, how do
    you
    read from it to the struct Foo? Is there any example about it?
    Thank you~
    liujiaping, Aug 6, 2007
    #6
  7. liujiaping

    Barry Guest

    liujiaping wrote:
    > On Aug 6, 11:31 am, Barry <> wrote:
    >> liujiaping wrote:
    >>> Hi, all.
    >>> I have a dictionary-like file which has the following format:
    >>> first 4
    >>> column 7
    >>> is 9
    >>> a 23
    >>> word 134
    >>> ...
    >>> Every line has two columns. The first column is always an English
    >>> word, and the second is an integer number. The file is a text file,
    >>> and contains tens of thousands of lines. My program has to read this
    >>> file, and I use the container map<string, int> to store the data.
    >>> Every time my program runs, the file is read. But since the file is
    >>> too large, the speed is very slow. Is there any other efficient way to
    >>> organize the data of the file to make it fast to read?
    >>> Any help is appreciated. Thank you.

    >> I think you have to convert your text file into binary mode, built as a
    >> dictionary indexed file.
    >>
    >> You can have such structure to serialize your data into the dictionary file
    >>
    >> struct Foo {
    >> unsigned int word_len;
    >> char* word;
    >> int key;
    >>
    >> };
    >>
    >> and index to Foo object into an integral value so you can search it
    >> fast, like hashing, to build a index file.
    >>
    >> You can reference StarDict, an open source dictionary, it will give you
    >> some hints.

    >
    > Thanks for ur advice. But how to write the struct Foo to a binary
    > file?


    since you load the text file into vector<map<string, int> >
    say word_map

    struct Serializer {
    Serializer(ofstream& ofs) : ofs(ofs) {}
    void operator() (pair<string, int> const& p) const {
    string::size_type len = p.first.size();
    ofs.write((char const*)&len, sizeof(string::size_type));
    ofs.write(p.first.data(), len);
    ofs.write((char const*)&p.second, sizeof(int));
    }
    ofstream& ofs;
    };

    ofstream ofs("out.dict", ios::binary);
    for_each (word_map.begin(), word_map.end(), Serializer(ofs)));

    > Can the function fwrite() do that? And given the binary file, how do
    > you
    > read from it to the struct Foo? Is there any example about it?


    word_map;
    void load(iftream& ifs) {
    while (!ifs.eof()) {
    string::size_type len = -1;
    ifs.read((char*)&len, sizeof(string::size_type));
    assert(len <= 1024);
    char buf[1024]; // maximum buffer for a word
    ifs.read(buf, len);
    string word(buf, len);
    int key;
    ifs.read((char*)&key, sizeof(int));
    word_map.insert(make_pair(word, key));
    }
    }

    ////////

    In addiction, if you wanna have index file for your data (then you can
    load just one specific word data), you have to write the hash code(have
    to be unique) and the file offset in the dict data file, when you're
    loading one word, just look up in the index file, then locate the word
    data with offset in the data file.
    Barry, Aug 6, 2007
    #7
  8. liujiaping

    Guest

    On Aug 6, 12:38 pm, liujiaping <> wrote:
    > ...
    > fstream data_file("data.txt");
    > string word;
    > int key;
    > map<string, int> keymap;
    > while(!data_file.eof())
    > {
    > data_file >> word;
    > data_file >> key;
    > key_map[word] = key;
    >
    > }


    Ian Collins is giving good advice: when trying to performance tune,
    use your profiler.

    Re Barry's suggestion: a binary format may help, but the variable
    length strings are a significant complication, which makes the whole
    thing unlikely to offer any benefit. Think it's barking up the wrong
    tree.

    Some other ideas:

    - are you building with optimisation enabled?

    - try: "while (data_file >> word >> key) key_map{word] = key;", which
    may just help a little especially if you're not building with
    optimisation.

    - if the hashing is the bottleneck, try a hash-map / unordered-set
    (there's another recent thread about them).

    - if the I/O is the bottleneck, in my experience using memory-mapped I/
    O can speed things up by an order of magnitude or two. You can google
    for details.

    Cheers,

    Tony
    , Aug 6, 2007
    #8
  9. liujiaping

    liujiaping Guest

    On Aug 6, 12:27 pm, Barry <> wrote:
    > liujiaping wrote:
    > > On Aug 6, 11:31 am, Barry <> wrote:
    > >> liujiaping wrote:
    > >>> Hi, all.
    > >>> I have a dictionary-like file which has the following format:
    > >>> first 4
    > >>> column 7
    > >>> is 9
    > >>> a 23
    > >>> word 134
    > >>> ...
    > >>> Every line has two columns. The first column is always an English
    > >>> word, and the second is an integer number. The file is a text file,
    > >>> and contains tens of thousands of lines. My program has to read this
    > >>> file, and I use the container map<string, int> to store the data.
    > >>> Every time my program runs, the file is read. But since the file is
    > >>> too large, the speed is very slow. Is there any other efficient way to
    > >>> organize the data of the file to make it fast to read?
    > >>> Any help is appreciated. Thank you.
    > >> I think you have to convert your text file into binary mode, built as a
    > >> dictionary indexed file.

    >
    > >> You can have such structure to serialize your data into the dictionary file

    >
    > >> struct Foo {
    > >> unsigned int word_len;
    > >> char* word;
    > >> int key;

    >
    > >> };

    >
    > >> and index to Foo object into an integral value so you can search it
    > >> fast, like hashing, to build a index file.

    >
    > >> You can reference StarDict, an open source dictionary, it will give you
    > >> some hints.

    >
    > > Thanks for ur advice. But how to write the struct Foo to a binary
    > > file?

    >
    > since you load the text file into vector<map<string, int> >
    > say word_map
    >
    > struct Serializer {
    > Serializer(ofstream& ofs) : ofs(ofs) {}
    > void operator() (pair<string, int> const& p) const {
    > string::size_type len = p.first.size();
    > ofs.write((char const*)&len, sizeof(string::size_type));
    > ofs.write(p.first.data(), len);
    > ofs.write((char const*)&p.second, sizeof(int));
    > }
    > ofstream& ofs;
    >
    > };
    >
    > ofstream ofs("out.dict", ios::binary);
    > for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
    >
    > > Can the function fwrite() do that? And given the binary file, how do
    > > you
    > > read from it to the struct Foo? Is there any example about it?

    >
    > word_map;
    > void load(iftream& ifs) {
    > while (!ifs.eof()) {
    > string::size_type len = -1;
    > ifs.read((char*)&len, sizeof(string::size_type));
    > assert(len <= 1024);
    > char buf[1024]; // maximum buffer for a word
    > ifs.read(buf, len);
    > string word(buf, len);
    > int key;
    > ifs.read((char*)&key, sizeof(int));
    > word_map.insert(make_pair(word, key));
    > }
    >
    > }
    >
    > ////////
    >
    > In addiction, if you wanna have index file for your data (then you can
    > load just one specific word data), you have to write the hash code(have
    > to be unique) and the file offset in the dict data file, when you're
    > loading one word, just look up in the index file, then locate the word
    > data with offset in the data file.


    Nice! Thanks for your help. I'll try it.
    liujiaping, Aug 6, 2007
    #9
  10. liujiaping

    Jerry Coffin Guest

    In article <>,
    says...

    [ ... ]

    > fstream data_file("data.txt");
    > string word;
    > int key;
    > map<string, int> keymap;
    > while(!data_file.eof())
    > {
    > data_file >> word;
    > data_file >> key;
    > key_map[word] = key;
    > }


    There are a number of possibilities to consider. First of all, in many
    implementations, stream-based I/O is fairly slow -- you may gain quite a
    bit of speed by reading the data with something like fscanf or
    fgets/strtok.

    Second, it sounds like the dictionary remains constant after you build
    it. If so, you're probably better off reading the data into a vector,
    sorting it, and then using a binary search to retrieve it later. This
    will often improve speed somewhat during the reading phase, and even
    more when you're using the data.

    --
    Later,
    Jerry.

    The universe is a figment of its own imagination.
    Jerry Coffin, Aug 6, 2007
    #10
  11. liujiaping

    liujiaping Guest

    On Aug 6, 11:41 am, Ian Collins <> wrote:
    > liujiaping wrote:
    > > On Aug 6, 11:25 am, Ian Collins <> wrote:
    > >> How are you reading the data and where is your current bottleneck?

    >
    > Please don't quote signatures.
    >
    >
    >
    > > I just use the file stream to read the data, simply like this:

    >
    > > fstream data_file("data.txt");
    > > string word;
    > > int key;
    > > map<string, int> keymap;
    > > while(!data_file.eof())
    > > {
    > > data_file >> word;
    > > data_file >> key;
    > > key_map[word] = key;
    > > }

    >
    > Have you profiled the reading to see where most of the time is being used?
    >
    > --
    > Ian Collins.


    Nope. Is there any other way to read this file?
    liujiaping, Aug 6, 2007
    #11
  12. liujiaping

    BobR Guest

    liujiaping <> wrote in message...
    > On Aug 6, 11:25 am, Ian Collins <> wrote:
    > > How are you reading the data and where is your current bottleneck?

    >
    > I just use the file stream to read the data, simply like this:
    >
    > fstream data_file("data.txt");
    > string word;
    > int key;
    > map<string, int> keymap;

    /*
    > while(!data_file.eof())
    > {
    > data_file >> word;
    > data_file >> key;
    > key_map[word] = key;
    > }
    >

    */
    while( data_file >> word >> key ){
    key_map[word] = key;
    }

    --
    Bob R
    POVrookie
    BobR, Aug 6, 2007
    #12
  13. liujiaping

    Ian Collins Guest

    liujiaping wrote:
    > On Aug 6, 11:41 am, Ian Collins <> wrote:
    >> liujiaping wrote:
    >>> On Aug 6, 11:25 am, Ian Collins <> wrote:
    >>>> How are you reading the data and where is your current bottleneck?

    >> Please don't quote signatures.
    >>
    >>
    >>
    >>> I just use the file stream to read the data, simply like this:
    >>> fstream data_file("data.txt");
    >>> string word;
    >>> int key;
    >>> map<string, int> keymap;
    >>> while(!data_file.eof())
    >>> {
    >>> data_file >> word;
    >>> data_file >> key;
    >>> key_map[word] = key;
    >>> }

    >> Have you profiled the reading to see where most of the time is being used?
    >>


    As I said before, *please don't quote signatures*.

    >
    > Nope. Is there any other way to read this file?
    >

    Why do you refuse to profile it? Before you do, any optimisation
    suggestions are just guesses. The bottleneck might be in the I/O, or it
    might be in loading the map. How you optimise your code should be based
    on real data, not speculation.

    --
    Ian Collins.
    Ian Collins, Aug 6, 2007
    #13
  14. liujiaping

    liujiaping Guest

    On Aug 6, 1:44 pm, Ian Collins <> wrote:
    > liujiaping wrote:
    > > On Aug 6, 11:41 am, Ian Collins <> wrote:
    > >> liujiaping wrote:
    > >>> On Aug 6, 11:25 am, Ian Collins <> wrote:
    > >>>> How are you reading the data and where is your current bottleneck?
    > >> Please don't quote signatures.

    >
    > >>> I just use the file stream to read the data, simply like this:
    > >>> fstream data_file("data.txt");
    > >>> string word;
    > >>> int key;
    > >>> map<string, int> keymap;
    > >>> while(!data_file.eof())
    > >>> {
    > >>> data_file >> word;
    > >>> data_file >> key;
    > >>> key_map[word] = key;
    > >>> }
    > >> Have you profiled the reading to see where most of the time is being used?

    >
    > As I said before, *please don't quote signatures*.
    >
    >
    >
    > > Nope. Is there any other way to read this file?



    > Why do you refuse to profile it? Before you do, any optimisation
    > suggestions are just guesses. The bottleneck might be in the I/O, or it
    > might be in loading the map. How you optimise your code should be based
    > on real data, not speculation.
    >

    Thanks for your advice. I haven't use gprof before. Now I'm reading
    the documentation of gprof. I'll profile it as you suggested.
    liujiaping, Aug 6, 2007
    #14
  15. liujiaping

    James Kanze Guest

    On Aug 6, 5:41 am, Ian Collins <> wrote:
    > liujiaping wrote:
    > > On Aug 6, 11:25 am, Ian Collins <> wrote:
    > >> How are you reading the data and where is your current bottleneck?


    > Please don't quote signatures.


    > > I just use the file stream to read the data, simply like this:


    > > fstream data_file("data.txt");
    > > string word;
    > > int key;
    > > map<string, int> keymap;
    > > while(!data_file.eof())
    > > {
    > > data_file >> word;
    > > data_file >> key;
    > > key_map[word] = key;
    > > }


    > Have you profiled the reading to see where most of the time is
    > being used?


    Before profiling it, he should probably modify it to ensure that
    it is correct. As it is, it happens to work if the file has no
    syntax errors in it, but only because processing the last line
    twice doesn't cause an error. I'd write something more along
    the lines of:

    std::string line ;
    while ( std::getline( data_file, line ) ) {
    std::istringstream s( line ) ;
    s >> word >> key >> std::ws ;
    if ( ! s || s.get() != EOF ) {
    // Syntax error...
    } else {
    key_map.insert( Map::value_type( word, key ) ) ;
    }
    }

    That's likely to be even slower:). But at least it's correct.

    Once that's done, of course, there's no way to reliably speed it
    up without the profiling data. And depending on how much
    speeding up is necessary, there may not even be a portable
    solution.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Aug 6, 2007
    #15
  16. liujiaping

    James Kanze Guest

    On Aug 6, 7:06 am, Jerry Coffin <> wrote:
    > In article <>,
    > says...


    > [ ... ]


    > > fstream data_file("data.txt");
    > > string word;
    > > int key;
    > > map<string, int> keymap;
    > > while(!data_file.eof())
    > > {
    > > data_file >> word;
    > > data_file >> key;
    > > key_map[word] = key;
    > > }


    > There are a number of possibilities to consider. First of all, in many
    > implementations, stream-based I/O is fairly slow -- you may gain quite a
    > bit of speed by reading the data with something like fscanf or
    > fgets/strtok.


    That's not been my experience (except with Sun CC), but it
    depends on the implementation. Given that he's having trouble
    getting iostream input correct, however, I definitly wouldn't
    suggest his trying stdio.h.

    > Second, it sounds like the dictionary remains constant after you build
    > it. If so, you're probably better off reading the data into a vector,
    > sorting it, and then using a binary search to retrieve it later. This
    > will often improve speed somewhat during the reading phase, and even
    > more when you're using the data.


    Another question is where the file is coming from. If he could
    generate it sorted to begin with, or sort it off line, that
    could speed up the initialization of the std::vector or lot. Or
    even of std::map, since he could give an accurate hint for each
    insertion.

    But before considering any of that, we really need profiling
    data. It's also possible (or even probable) that his
    initialization is IO bound anyway, so no changes in code can
    improve it.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    James Kanze, Aug 6, 2007
    #16
  17. liujiaping

    Ismo Salonen Guest

    James Kanze wrote:
    > On Aug 6, 7:06 am, Jerry Coffin <> wrote:
    >> In article <>,
    >> says...

    >
    >> [ ... ]

    >
    >>> fstream data_file("data.txt");
    >>> string word;
    >>> int key;
    >>> map<string, int> keymap;
    >>> while(!data_file.eof())
    >>> {
    >>> data_file >> word;
    >>> data_file >> key;
    >>> key_map[word] = key;
    >>> }

    >
    >> There are a number of possibilities to consider. First of all, in many
    >> implementations, stream-based I/O is fairly slow -- you may gain quite a
    >> bit of speed by reading the data with something like fscanf or
    >> fgets/strtok.

    >
    > That's not been my experience (except with Sun CC), but it
    > depends on the implementation. Given that he's having trouble
    > getting iostream input correct, however, I definitly wouldn't
    > suggest his trying stdio.h.
    >
    >> Second, it sounds like the dictionary remains constant after you build
    >> it. If so, you're probably better off reading the data into a vector,
    >> sorting it, and then using a binary search to retrieve it later. This
    >> will often improve speed somewhat during the reading phase, and even
    >> more when you're using the data.

    >
    > Another question is where the file is coming from. If he could
    > generate it sorted to begin with, or sort it off line, that
    > could speed up the initialization of the std::vector or lot. Or
    > even of std::map, since he could give an accurate hint for each
    > insertion.
    >
    > But before considering any of that, we really need profiling
    > data. It's also possible (or even probable) that his
    > initialization is IO bound anyway, so no changes in code can
    > improve it.
    >
    > --
    > James Kanze (GABI Software) email:
    > Conseils en informatique orientée objet/
    > Beratung in objektorientierter Datenverarbeitung
    > 9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
    >


    Long time ago I tested map with sorted data and it did not perform well,
    much better performance with random (order)data. But maybe current
    implementations are better, always measure (=profile) before taking
    any actions.

    ismo
    Ismo Salonen, Aug 6, 2007
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Learner
    Replies:
    2
    Views:
    8,127
    Erik Funkenbusch
    Mar 10, 2006
  2. Schnoffos
    Replies:
    2
    Views:
    1,199
    Martien Verbruggen
    Jun 27, 2003
  3. Hal Styli
    Replies:
    14
    Views:
    1,617
    Old Wolf
    Jan 20, 2004
  4. Angus
    Replies:
    3
    Views:
    323
  5. albert kao
    Replies:
    12
    Views:
    582
    Roedy Green
    Oct 7, 2011
Loading...

Share This Page