Storing pairs of (int,int) in a database : which db to choose ?

S

Stormbringer

Hi,

I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.

For starters I used PySqLite, using a table with index on wordId. It
works fine but it consumes way too much space on the disk. For example
for 19103 pairs it takes 1686528 bytes, which amounts to around 88
bytes per pair of integers, way too much in my opinion (especially
considering the range of those integers - one is in the range
1..100000 and the other in the range 1..500000).

So my problem now is what should I use to store my data and still be
able to retrieve all messages for a given word very quickly. Either I
find a workaround in sqlite to make it store values on less bytes, or
I use something else.

I looked at the (g)dbm family but they store one entry per word and
both must be strings. To use it would mean to store a list of messages
for a word as a string, then when I find a new message containing this
word to replace this string with old string + the new message id ->
and I think this is inefficient.

At the moment I am looking at BerkeleyDB - still checking if this will
suit my needs.

Any advices and hints ? I'm looking only for embedded solutions like
PySqLite, if a database needs to install a server then it's not an
option.

Thanks in advance,
Andrei
 
P

Paul Rubin

I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.

Using any kind of database in a big message system may thrash your
disk pretty badly if you update the database every time a new message
is entered. That means making hundreds of new database entries, one
for each word in the message, each entry needing a disk seek. If
you're adding a few new messages a second, it will be hard for your
disk to keep up.

Anyway you don't need a real database. You're not doing that kind of
accesses, you don't need multi-phase transactions, blah blah blah. I
think the way to do it instead is keep several tables:

1) a static table on disk, simply (wordId, msgId) pairs of 4-bit
binary ints, sorted on wordId. You can also have an index for
it: a table with one entry for each wordId, saying where in
the big table the msgId's for that word start.

2) an update log on disk, (wordId, msgId) pairs like above, but
unsorted.

3) an in-memory table (Python dict) containing the same pairs as
the update log, but using a dict structure for instant lookup.

When a new message comes in, you just append all its (wordId,msgId)
pairs to the update log (just append, no random seeking needed), and
also update the in-memory table. Actually you don't need to store
(wordId,msgId) for every pair. Just append a record that says
"next nnn records are for msgId xxx" and then only store the wordId's.

When a search query comes in, look in the in-memory table by hashing,
and if the word isn't found, look in the static table by binary search.

Then once a night, update the static table by running the update log
through a sort utility and then merging it with the static table.
Then you can delete the update log and in-memory table and begin them
again. The sort-merge phase is relatively fast because sorting
utilities are written to sling around large buffers, minimizing disk
head motion.

If the system crashes or you stop the server for some reason, on
restart you can simply read the update log into memory.

You probably don't even need the in-memory structure if you're not
taking too many search queries--the update log should be small enough
that you can just sequentially scan it when a query comes in.

The main idea is that you replace a lot of random small updates with
occasional big sorting operations. The big sorts are far more
i/o-efficient than the small updates. These are the methods they
always used in the old days, when even random-access disk space was
very expensive, so big data sets were always stored on completely
sequential media (magnetic tape).
 
J

John Hunter

Stormbringer> Hi, I want to implement a fulltext search for
Stormbringer> messages in a forum. More exactly for each message I
Stormbringer> store pairs (wordId, msgId) for each identified word
Stormbringer> and when I search something I want to be able to
Stormbringer> retrieve very quickly all msgId for a given wordId.

Stormbringer> For starters I used PySqLite, using a table with
Stormbringer> index on wordId. It works fine but it consumes way
Stormbringer> too much space on the disk. For example for 19103
Stormbringer> pairs it takes 1686528 bytes, which amounts to
Stormbringer> around 88 bytes per pair of integers, way too much
Stormbringer> in my opinion (especially considering the range of
Stormbringer> those integers - one is in the range 1..100000 and
Stormbringer> the other in the range 1..500000).

What about using a binary file of unsigned ints which you load into a
python dictionary and do everything in memory? There would be no
extra overhead in the file and it would be very fast, if you are able
to hold the 100,000 ints in memory.

JDH
 
J

Jp Calderone

Hi,

I want to implement a fulltext search for messages in a forum. More
exactly for each message I store pairs (wordId, msgId) for each
identified word and when I search something I want to be able to
retrieve very quickly all msgId for a given wordId.

A pure Python fulltext indexer - http://divmod.org/Lupy/index.html

Jp

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/6FDRedcO2BJA+4YRAl4YAKC6OhBLLQd4z1KaZXTqhUhGGMLz1QCeNhhb
gKJD62SRklhyHX/kI5U6rP0=
=G0P5
-----END PGP SIGNATURE-----
 
P

Paul Rubin

John Hunter said:
Stormbringer> in my opinion (especially considering the range of
Stormbringer> those integers - one is in the range 1..100000 and
Stormbringer> the other in the range 1..500000).

What about using a binary file of unsigned ints which you load into a
python dictionary and do everything in memory? There would be no
extra overhead in the file and it would be very fast, if you are able
to hold the 100,000 ints in memory.

No it's much worse than that. The 100,000 ints are index numbers
for individual words. The 500,000 ints are articles and there can
be thousands of words in each article. So you may need to store
billions of ints, not just 100,000 of them.
 
S

Skip Montanaro

andrei> So my problem now is what should I use to store my data and
andrei> still be able to retrieve all messages for a given word very
andrei> quickly.

There's the rub. PySQLlite consumes a lot of disk space presumably to
support efficient access to the data (hashing keys, etc). If you want
compact storage you aren't likely to do much better than

wordid1:msgid1
wordid2:msgid2
...

Unfortunately, searching that will probably be a bit slow.

If the word ids and msg ids have fixed maximum sizes you might try zero- or
space- padding them so your records are of fixed length. You can then store
the file sorted by wordid and do a binary search in the file looking records
of interest.

Another possibility, assuming your database keys are integers, is to try the
bsddb recno format database. I've never used it before (not as much call
for a database file with integer keys), but it seems to be *relatively*
space-efficient. I believe it doesn't actually need to store the keys,
since they are implicit.

I built a recno db file containing the pickle of the squares of the numbers
in range(1, 100000):
... db1[n] = pickle.dumps(n*n)
...

The total string length of the pickled squares was 1.3MB. The file size was
2.1MB.

Skip
 
S

Stormbringer

Jp Calderone said:
A pure Python fulltext indexer - http://divmod.org/Lupy/index.html

Thanks ! This is exactly what I needed, and the size of the indexes is
around 30%, much much less than what I could have achieved with my
code. Not to mention the fact that I get phrase search and some other
goodies :)

The only thing that bothers me a little is the speed for building the
index, I tried with around 5000 messages and I am not quite thrilled,
it's not _extremly_ slow but it has to be faster for what I need.
Perhaps I'll use the C++ version with some Python bindings.

Cheers,
Andrei
 
S

Stormbringer

Paul Rubin said:
No it's much worse than that. The 100,000 ints are index numbers
for individual words. The 500,000 ints are articles and there can
be thousands of words in each article. So you may need to store
billions of ints, not just 100,000 of them.

Yes, I agree. The messages aren't very large (under 5K each, although
I've seen one or two of 500K) but there are lots of words.

Turns out there is a much better solution than what I could have
written, and it's named Lupy.

Andrei
 
P

Paul Rubin

The only thing that bothers me a little is the speed for building the
index, I tried with around 5000 messages and I am not quite thrilled,
it's not _extremly_ slow but it has to be faster for what I need.
Perhaps I'll use the C++ version with some Python bindings.

Why not do some profiling first. Maybe it's limited by i/o traffic
rather than cpu cycles. I don't know how Lupy works but the one time
I messed with full text indexing, the bottleneck was definitely the
random disk accesses needed for every word of each update. The
solution is to batch the updates. Sorting is much less seek intensive
than random updates.
 
J

Jp Calderone

Thanks ! This is exactly what I needed, and the size of the indexes is
around 30%, much much less than what I could have achieved with my
code. Not to mention the fact that I get phrase search and some other
goodies :)

The only thing that bothers me a little is the speed for building the
index, I tried with around 5000 messages and I am not quite thrilled,
it's not _extremly_ slow but it has to be faster for what I need.
Perhaps I'll use the C++ version with some Python bindings.

Yea, I hear that. Work is being done on speeding it up (pretty much the
only development on it now is optimization). I don't know how it will end
up, but things look promising so far. On the other hand, if you don't want
to wait for that to be finished...

Jp


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)

iD8DBQE/6MeVedcO2BJA+4YRAvI3AKCfpRQLuxGBdva8ySG7vKqpOGTsGQCfYmbZ
SnwJ1xxCZnnE7CBcqAMRX7k=
=oZ33
-----END PGP SIGNATURE-----
 
S

Stormbringer

Paul Rubin said:
Why not do some profiling first. Maybe it's limited by i/o traffic
rather than cpu cycles. I don't know how Lupy works but the one time
I messed with full text indexing, the bottleneck was definitely the
random disk accesses needed for every word of each update. The
solution is to batch the updates. Sorting is much less seek intensive
than random updates.

Sounds like a poorly designed system the one you messed with.
Even if it was limited i/o traffic I am not sure how to do profiling
to find out if this is the case (although I do have doubts about this)
- any suggesions how ?

Andrei
 
S

Stormbringer

Jp Calderone said:
Yea, I hear that. Work is being done on speeding it up (pretty much the
only development on it now is optimization). I don't know how it will end
up, but things look promising so far. On the other hand, if you don't want
to wait for that to be finished...

Well - that depends. If there will be a faster version of lupy when
I'll really need it in 1-2 months then I will use that. Else if I can
find a faster equivalent I will use that. Just beeing practical.

Andrei
 
S

Stormbringer

Skip Montanaro said:
andrei> So my problem now is what should I use to store my data and
andrei> still be able to retrieve all messages for a given word very
andrei> quickly.

There's the rub. PySQLlite consumes a lot of disk space presumably to
support efficient access to the data (hashing keys, etc). If you want
compact storage you aren't likely to do much better than

wordid1:msgid1
wordid2:msgid2
...

Unfortunately, searching that will probably be a bit slow.

Well - I want to use this storage to speed up my searches so searching
will really need to be fast.

I didn't want the most compact storage, but at 88 bytes to store a
couple of integers is just way too much.

Another reason I wanted to use a database like sqlite is that it
presents some safety guarantees. If something goes wrong when updating
(like no diskspace left on end-user's disk) then the data is still
says safe, i.e. the latest transaction is not commited and that's it.

I wrote a similar version of what I wanted a couple of years ago not
using a database, just C/C++ code, something like this (I think a
similar method was suggested in this thread) : I was processing
messages in batches, so for each 10000 messages or so I would make in
memory a list of words and in what messages occur, and for each word I
would write all messages that contain it as consecutive entries in a
file. Then I would update the global word index (kept in another
file), which for each work kept a linked list of zones in the other
file where msgIds containg this word were.

Worked pretty well, but this was for a server I controlled, i.e. at
all times I was sure I had enough space and I was controling it. For
an application to be deployed to end-users I need more safety. That is
why I am playing with sql.
If the word ids and msg ids have fixed maximum sizes you might try zero- or
space- padding them so your records are of fixed length. You can then store
the file sorted by wordid and do a binary search in the file looking records
of interest.

Still would be too slow for a search, and search is the main reason
for this structure.
Another possibility, assuming your database keys are integers, is to try the
bsddb recno format database. I've never used it before (not as much call
for a database file with integer keys), but it seems to be *relatively*
space-efficient. I believe it doesn't actually need to store the keys,
since they are implicit.

I built a recno db file containing the pickle of the squares of the numbers
in range(1, 100000):
... db1[n] = pickle.dumps(n*n)
...

The total string length of the pickled squares was 1.3MB. The file size was
2.1MB.

Interesting.
Fortunately I found lupy which suits me, but if I were not I might
have been tempted to experiment with that.

Andrei
 
P

Paul Rubin

Sounds like a poorly designed system the one you messed with.

The system I messed with wasn't designed for high volumes of updates.
Even if it was limited i/o traffic I am not sure how to do profiling
to find out if this is the case (although I do have doubts about this)
- any suggesions how ?

A crude but maybe useful thing you can do is just observe the CPU load
while the program is running. If the CPU load is low but you hear the
disk banging around, you're i/o bound. As for profiling, there's a
Python profiler documented in the Python library manual, that's the
first thing to try.

There's a good article series on full text indexing at tbray.com which
might interest you, by the way. It was linked from Slashdot a couple
days ago.
 
F

Fred Pacquier

Paul Rubin said:
A crude but maybe useful thing you can do is just observe the CPU load
while the program is running. If the CPU load is low but you hear the
disk banging around, you're i/o bound.

I love this one. It would make a great intro quote for the next DrDobbs
weekly summary ! ;-)
 
S

Stormbringer

Fred Pacquier said:
I love this one. It would make a great intro quote for the next DrDobbs
weekly summary ! ;-)

I ment I didn't know how to do profiling in this case, to find out if
the program was cpu or disk bound, although if I would have sit down
and thought for 5 seconds it was obvious.

Profiling in general is another fish and I consider myself able to
spot the places where my programs consumes cpu cycles.

Cheers,
Andrei
 
K

Karol Zadora

Paul Rubin said:
The system I messed with wasn't designed for high volumes of updates.


A crude but maybe useful thing you can do is just observe the CPU load
while the program is running. If the CPU load is low but you hear the
disk banging around, you're i/o bound. As for profiling, there's a
Python profiler documented in the Python library manual, that's the
first thing to try.

There's a good article series on full text indexing at tbray.com which
might interest you, by the way. It was linked from Slashdot a couple
days ago.

On Windows, to determine if you are I/O bound, you might try to watch
these two counters:
- Physical Disk: Avg. Disk Queue Length (should be not more than three
times the number of physical disks (spindles) used by your program)
- Physical Disk: Avg. Disk sec/Transfer (should be less than 15-18 ms)

HTH, Karol
 
S

Samuel Walters

Yea, I hear that. Work is being done on speeding it up (pretty much the
only development on it now is optimization). I don't know how it will end
up, but things look promising so far. On the other hand, if you don't
want to wait for that to be finished...
Ye olde time mantra:
Make it work
Make it right
Make it fast
 
S

Samuel Walters

I wrote a similar version of what I wanted a couple of years ago not
using a database, just C/C++ code, something like this (I think a
similar method was suggested in this thread) : I was processing messages
in batches, so for each 10000 messages or so I would make in memory a
list of words and in what messages occur, and for each word I would
write all messages that contain it as consecutive entries in a file.
Then I would update the global word index (kept in another file), which
for each work kept a linked list of zones in the other file where msgIds
containing this word were.

Worked pretty well, but this was for a server I controlled, i.e. at all
times I was sure I had enough space and I was controlling it. For an
application to be deployed to end-users I need more safety. That is why
I am playing with sql.

For future reference:
If you're like me, and have a lot of c/c++ code that you've built and are
happy with, pyrex ( http://www.cosc.canterbury.ac.nz/~greg/python/Pyrex/ )
can come in handy for reusing that code within python. Pyrex is easier to
write and maintain than the direct c-to-python API, and with a little bit
of practice, you can learn how to easily marshal data back and forth
between the two languages. Essentially, it takes all the grunt-work out
of writing making c/c++ calls from python. It's especially tasty when you
need to make use of some shared library that python has no interface to.

Sam Walters
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,175
Latest member
Vinay Kumar_ Nevatia
Top