How to store the data of a large map<string, int>?

L

liujiaping

Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, int> to store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
 
I

Ian Collins

liujiaping said:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
....

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, int> to store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.
How are you reading the data and where is your current bottleneck?
 
B

Barry

liujiaping said:
Hi, all.
I have a dictionary-like file which has the following format:
first 4
column 7
is 9
a 23
word 134
...

Every line has two columns. The first column is always an English
word, and the second is an integer number. The file is a text file,
and contains tens of thousands of lines. My program has to read this
file, and I use the container map<string, int> to store the data.
Every time my program runs, the file is read. But since the file is
too large, the speed is very slow. Is there any other efficient way to
organize the data of the file to make it fast to read?
Any help is appreciated. Thank you.

I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;
};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.
 
L

liujiaping

How are you reading the data and where is your current bottleneck?

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
 
I

Ian Collins

Please don't quote signatures.
I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?
 
L

liujiaping

I think you have to convert your text file into binary mode, built as a
dictionary indexed file.

You can have such structure to serialize your data into the dictionary file

struct Foo {
unsigned int word_len;
char* word;
int key;

};

and index to Foo object into an integral value so you can search it
fast, like hashing, to build a index file.

You can reference StarDict, an open source dictionary, it will give you
some hints.

Thanks for ur advice. But how to write the struct Foo to a binary
file?
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?
Thank you~
 
B

Barry

liujiaping said:
Thanks for ur advice. But how to write the struct Foo to a binary
file?

since you load the text file into vector<map<string, int> >
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, int> const& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;
};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?

word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}
}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.
 
T

tony_in_da_uk

...
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;

}

Ian Collins is giving good advice: when trying to performance tune,
use your profiler.

Re Barry's suggestion: a binary format may help, but the variable
length strings are a significant complication, which makes the whole
thing unlikely to offer any benefit. Think it's barking up the wrong
tree.

Some other ideas:

- are you building with optimisation enabled?

- try: "while (data_file >> word >> key) key_map{word] = key;", which
may just help a little especially if you're not building with
optimisation.

- if the hashing is the bottleneck, try a hash-map / unordered-set
(there's another recent thread about them).

- if the I/O is the bottleneck, in my experience using memory-mapped I/
O can speed things up by an order of magnitude or two. You can google
for details.

Cheers,

Tony
 
L

liujiaping

Thanks for ur advice. But how to write the struct Foo to a binary
file?

since you load the text file into vector<map<string, int> >
say word_map

struct Serializer {
Serializer(ofstream& ofs) : ofs(ofs) {}
void operator() (pair<string, int> const& p) const {
string::size_type len = p.first.size();
ofs.write((char const*)&len, sizeof(string::size_type));
ofs.write(p.first.data(), len);
ofs.write((char const*)&p.second, sizeof(int));
}
ofstream& ofs;

};

ofstream ofs("out.dict", ios::binary);
for_each (word_map.begin(), word_map.end(), Serializer(ofs)));
Can the function fwrite() do that? And given the binary file, how do
you
read from it to the struct Foo? Is there any example about it?

word_map;
void load(iftream& ifs) {
while (!ifs.eof()) {
string::size_type len = -1;
ifs.read((char*)&len, sizeof(string::size_type));
assert(len <= 1024);
char buf[1024]; // maximum buffer for a word
ifs.read(buf, len);
string word(buf, len);
int key;
ifs.read((char*)&key, sizeof(int));
word_map.insert(make_pair(word, key));
}

}

////////

In addiction, if you wanna have index file for your data (then you can
load just one specific word data), you have to write the hash code(have
to be unique) and the file offset in the dict data file, when you're
loading one word, just look up in the index file, then locate the word
data with offset in the data file.

Nice! Thanks for your help. I'll try it.
 
J

Jerry Coffin

[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}

There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.
 
L

liujiaping

Please don't quote signatures.


I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}

Have you profiled the reading to see where most of the time is being used?

Nope. Is there any other way to read this file?
 
B

BobR

liujiaping said:
How are you reading the data and where is your current bottleneck?

I just use the file stream to read the data, simply like this:

fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap; /*
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
*/
while( data_file >> word >> key ){
key_map[word] = key;
}
 
I

Ian Collins

liujiaping said:
liujiaping said:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.


I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?

As I said before, *please don't quote signatures*.
Nope. Is there any other way to read this file?
Why do you refuse to profile it? Before you do, any optimisation
suggestions are just guesses. The bottleneck might be in the I/O, or it
might be in loading the map. How you optimise your code should be based
on real data, not speculation.
 
L

liujiaping

liujiaping said:
liujiaping wrote:
How are you reading the data and where is your current bottleneck?
Please don't quote signatures.
I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is being used?

As I said before, *please don't quote signatures*.


Nope. Is there any other way to read this file?

Why do you refuse to profile it? Before you do, any optimisation
suggestions are just guesses. The bottleneck might be in the I/O, or it
might be in loading the map. How you optimise your code should be based
on real data, not speculation.
Thanks for your advice. I haven't use gprof before. Now I'm reading
the documentation of gprof. I'll profile it as you suggested.
 
J

James Kanze

Please don't quote signatures.
I just use the file stream to read the data, simply like this:
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
Have you profiled the reading to see where most of the time is
being used?

Before profiling it, he should probably modify it to ensure that
it is correct. As it is, it happens to work if the file has no
syntax errors in it, but only because processing the last line
twice doesn't cause an error. I'd write something more along
the lines of:

std::string line ;
while ( std::getline( data_file, line ) ) {
std::istringstream s( line ) ;
s >> word >> key >> std::ws ;
if ( ! s || s.get() != EOF ) {
// Syntax error...
} else {
key_map.insert( Map::value_type( word, key ) ) ;
}
}

That's likely to be even slower:). But at least it's correct.

Once that's done, of course, there's no way to reliably speed it
up without the profiling data. And depending on how much
speeding up is necessary, there may not even be a portable
solution.
 
J

James Kanze

[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

That's not been my experience (except with Sun CC), but it
depends on the implementation. Given that he's having trouble
getting iostream input correct, however, I definitly wouldn't
suggest his trying stdio.h.
Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.

Another question is where the file is coming from. If he could
generate it sorted to begin with, or sort it off line, that
could speed up the initialization of the std::vector or lot. Or
even of std::map, since he could give an accurate hint for each
insertion.

But before considering any of that, we really need profiling
data. It's also possible (or even probable) that his
initialization is IO bound anyway, so no changes in code can
improve it.
 
I

Ismo Salonen

James said:
[ ... ]
fstream data_file("data.txt");
string word;
int key;
map<string, int> keymap;
while(!data_file.eof())
{
data_file >> word;
data_file >> key;
key_map[word] = key;
}
There are a number of possibilities to consider. First of all, in many
implementations, stream-based I/O is fairly slow -- you may gain quite a
bit of speed by reading the data with something like fscanf or
fgets/strtok.

That's not been my experience (except with Sun CC), but it
depends on the implementation. Given that he's having trouble
getting iostream input correct, however, I definitly wouldn't
suggest his trying stdio.h.
Second, it sounds like the dictionary remains constant after you build
it. If so, you're probably better off reading the data into a vector,
sorting it, and then using a binary search to retrieve it later. This
will often improve speed somewhat during the reading phase, and even
more when you're using the data.

Another question is where the file is coming from. If he could
generate it sorted to begin with, or sort it off line, that
could speed up the initialization of the std::vector or lot. Or
even of std::map, since he could give an accurate hint for each
insertion.

But before considering any of that, we really need profiling
data. It's also possible (or even probable) that his
initialization is IO bound anyway, so no changes in code can
improve it.

--
James Kanze (GABI Software) email:[email protected]
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Long time ago I tested map with sorted data and it did not perform well,
much better performance with random (order)data. But maybe current
implementations are better, always measure (=profile) before taking
any actions.

ismo
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top