Databases and python

D

Dan Stromberg

I've been putting a little bit of time into a file indexing engine in
python, which you can find here:
http://dcs.nac.uci.edu/~strombrg/pyindex.html

It'll do 40,000 mail messages of varying lengths pretty well now, but I
want more :)

So far, I've been taking the approach of using a single-table database
like gdbm or dbhash (actually a small number of them, to map filenames to
numbers, numbers to filenames, words to numbers, numbers to words,
and numbered words to numbered filenames), and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).

However, despite using http://dcs.nac.uci.edu/~strombrg/cachedb.html module
to make the database use faster, bringing in psyco, and studying various
python optimization pages, the program just isn't performing like I'd like
it to.

And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.

So this leads me to wonder - is there a python database interface that
would allow me to define a -lot- of tables? Like, each word becomes a
table, and then the fields in that table are just the filenames that
contained that word. That way adding filenames to a word shouldn't bog
down much at all.

-But-, are there any database interfaces for python that aren't going to
get a bit upset if you try to give them hundreds of thousands of tables?

Thanks!
 
J

Jonathan Gardner

I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

You're going to want an index on pretty much every column in this
database. That's because you're going to lookup by any one of these
columns for the corresponding value.

I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate
the data. Then you build the indexes only as you need them. Let's say
you are scanning your files. You won't need an index on the
filename-to-ID table. That's because you are just putting data in
there. The word-to-ID table needs an index on the word, but not ID
(you're not looking up by ID yet.) And the word ID-to-filename ID table
doesn't need any indexes yet either. So build up the data without the
indexes. Once your scan is complete, then build up the indexes you'll
need for regular operation. You can probably incrementally add data as
you go.

As far as filename ID and word IDs go, just use a counter to generate
the next number. If you use base255 as the number, you're really not
going to save much space.

And your idea of hundreds of thousands of tables? Very bad. Don't do
it.
 
R

Rene Pijlman

Dan Stromberg:
is there a python database interface that would allow me to define a
-lot- of tables? Like, each word becomes a table, and then the fields
in that table are just the filenames that contained that word.

Give ZODB a try.
http://www.zope.org/Wikis/ZODB/FrontPage
http://www.python.org/workshops/2000-01/proceedings/papers/fulton/zodb3.html

My first attempt would be: a BTree with the word as key, and a 'list of
filenames' as value.
http://www.zope.org/Wikis/ZODB/FrontPage/guide/node6.html#SECTION000630000000000000000
 
B

bruno at modulix

Jonathan said:
I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

You're going to want an index on pretty much every column in this
database.

stop !

I'm not a db expert neither, but putting indexes everywhere is well
known DB antipattern. An index is only useful if the indexed field is
discriminant enough (ie: there must be the less possible records having
the same value for this field). Else, the indexed lookup may end up
taking far more time than a simple linear lookup. Also, indexes slow
down write operations.
That's because you're going to lookup by any one of these
columns for the corresponding value.

I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate
the data. Then you build the indexes only as you need them.

Yes. And only where it makes sens.

(snip)
And your idea of hundreds of thousands of tables? Very bad. Don't do
it.

+100 on this
 
B

Bryan Olson

Dan said:
I've been putting a little bit of time into a file indexing engine [...]

So far, I've been taking the approach of using a single-table database
like gdbm or dbhash [...] and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
[...]
the program just isn't performing like I'd like it to.
>
And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.

I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.
So this leads me to wonder - is there a python database interface that
would allow me to define a -lot- of tables? Like, each word becomes a
table, and then the fields in that table are just the filenames that
contained that word. That way adding filenames to a word shouldn't bog
down much at all.

Well, you could use simple files instead of fancy database tables.

Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).

--Bryan



import bsddb
import urllib


def add_words_from_file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
fname = urllib.quote_plus(fname)
s = set()
for word in word_iterator:
if word not in s:
s.update(word)
key = '%s %s' % (urllib.quote_plus(word), fname)
index[key] = ''
index.sync()


def lookup(index, word):
""" Pass the B-Tree (as built with add_words_from_file) and a
word to look up. Returns list of files containing the word.
"""
word = urllib.quote_plus(word)
fname_list = []
try:
current = index.set_location(word)
while True:
(key, _) = current
(w, fname) = key.split()
if w != word:
break
fname_list.append(urllib.unquote_plus(fname))
current = index.next()
except KeyError:
pass
return fname_list


def test():
index = bsddb.btopen('junktest.bdb', 'n')
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt', 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_file(index, fname, words)

for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(index, word))

test()
 
D

Dan Stromberg

Dan said:
I've been putting a little bit of time into a file indexing engine [...]

So far, I've been taking the approach of using a single-table database
like gdbm or dbhash [...] and making each entry keyed by
a word, and under the word in the database is a null terminated list of
filenames (in http://dcs.nac.uci.edu/~strombrg/base255.html representation).
[...]
the program just isn't performing like I'd like it to.

And I think it's because despite the caching and minimal representation
conversion, it's still just too slow converting linear lists to arrays
back and forth and back and forth.

I think I follow. The straightforward method of building the
list of files associated with a word is order n**2 time. On each
new entry, you look up the entire string, append one file-id to
it, and write the new version back.

Yes, essentially, with the twist that the most recently used words are
kept cached in memory, so they don't have to be converted from string to
list and back to string every time a filename is added to a word.
Well, you could use simple files instead of fancy database tables.

That's an interesting thought. Perhaps especially if australopithecine
were saved in a filename like:

~/indices/au/st/ra/lo/pi/th/ec/in/e
Below is a demo of an alternate technique that uses bsddb B-Trees,
and puts both the word and the file-id in the key. I don't know
how efficient it is for real data, but at least the time won't grow
as Theta(n**2).

Perhaps I'm missing something, but is it not roughly O(1) for
individual insertions, but O(n*m) (n == number of files, m == number of
words) for lookups?
--Bryan



import bsddb
import urllib


def add_words_from_file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
fname = urllib.quote_plus(fname)
s = set()
for word in word_iterator:
if word not in s:
s.update(word)
key = '%s %s' % (urllib.quote_plus(word), fname)
index[key] = ''
index.sync()


def lookup(index, word):
""" Pass the B-Tree (as built with add_words_from_file) and a
word to look up. Returns list of files containing the word.
"""
word = urllib.quote_plus(word)
fname_list = []
try:
current = index.set_location(word)
while True:
(key, _) = current
(w, fname) = key.split()
if w != word:
break
fname_list.append(urllib.unquote_plus(fname))
current = index.next()
except KeyError:
pass
return fname_list


def test():
index = bsddb.btopen('junktest.bdb', 'n')
data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt', 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_file(index, fname, words)

for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(index, word))

test()
 
D

Dan Stromberg


Aside from object persistence, what would this add for me?

Is each object saved in a separate file, or are they bunched together into
one or more files?
My first attempt would be: a BTree with the word as key, and a 'list of
filenames' as value.
http://www.zope.org/Wikis/ZODB/FrontPage/guide/node6.html#SECTION000630000000000000000

This is basically what I'm doing now, except the most recently used "lists
of filenames" are kept in RAM to avoid going from "on disk string" to "in
memory list" to "on disk string" quite as much.
 
D

Dan Stromberg

I'm no expert in BDBs, but I have spent a fair amount of time working
with PostgreSQL and Oracle. It sounds like you need to put some
optimization into your algorithm and data representation.

I would do pretty much like you are doing, except I would only have the
following relations:

- word to word ID
- filename to filename ID
- word ID to filename ID

I think I could ditch the "word ID to word" table, but I think I'll
probably eventually need the "filename ID to filename" table, so when I
pull a list of filename ID's by word I can get back the filenames.
You're going to want an index on pretty much every column in this
database. That's because you're going to lookup by any one of these
columns for the corresponding value.

I do need some pretty heavy indexing.

In fact, so far everything seems to be clicking along pretty well, except
the part that I'm having trouble indexing - the per-word list of
filenames. And that's because I already have one relation (word
number -> filenames' numbers) that's complicating getting another relation
for the list of filenames (just filename ID to '', but it'd be soooo much
faster if I could nest the relations somehow).
I said I wasn't an expert in BDBs. But I do have some experience
building up large databases. In the first stage, you just accumulate the
data. Then you build the indexes only as you need them. Let's say you
are scanning your files. You won't need an index on the filename-to-ID
table. That's because you are just putting data in there. The word-to-ID
table needs an index on the word, but not ID (you're not looking up by
ID yet.) And the word ID-to-filename ID table doesn't need any indexes
yet either. So build up the data without the indexes. Once your scan is
complete, then build up the indexes you'll need for regular operation.
You can probably incrementally add data as you go.

Interesting thought! I think I need to ponder this a bit more.

I think the filename <-> filename number tables are pretty inexpensive,
and the word <-> word number tables seem pretty inexpensive too. I think
it's primarily the word number to filename numbers table that's messing
up the performance really bad, and that's probably because the list of
filename numbers is growing so large in some cases, and adding a filename
to a word tends to be O(n), n==the number of filenames already on the word.

May I ask: In what representation would you keep this word number to
filename numbers relation stashed away until later, to build your indexes
from more rapidly in a later pass? Are you thinking to keep them all in
memory, or store them in a flat file somehow?
As far as filename ID and word IDs go, just use a counter to generate
the next number. If you use base255 as the number, you're really not
going to save much space.

That's basically what I'm doing - incrementing a counter for each word,
except I'm converting that counter to base255. The base255 representation
of that counter brings the following benefits:

1) Converting a set of numbers to a string, without losing number
boundaries, becomes just mapping them to base255, and
string.joinfields'ing them.

2) Converting a string to a set of numbers, again without losing number
boundaries, is just string.splitfields'ing them, and mapping them from
base255 back to a list or set of numbers.

3) The numbers can be arbitrarily large, unlike with a, say, 4 or 8 byte
fixed-length record representation
And your idea of hundreds of thousands of tables? Very bad. Don't do it.

Sigh. I was afraid of that. :-S

I'm thinking though... what if I were to use the current representation
for words that don't appear in that many files, and a table for each word
that has >= 1000 (or whatever threshold) filenames associated with it...?

That should greatly cut down on the number of tables, while still giving
the more efficient representation for the words that need it.

Or not?
 
J

Jonathan Gardner

About the filename ID - word ID table: Any good database (good with
large amounts of data) will handle the memory management for you. If
you get enough data, it may make sense to get bothered with PostgreSQL.
That has a pretty good record on handling very large sets of data, and
intermediate sets as well. Again, I can't speak about BDBs, but
something in the back of my head is saying that the entire table is
loaded into memory. If so, that's not good for large sets of data.

About base255: You can think of strings as numbers. At least, that's
how your computer sees them. Converting a number to base255 is really
silly. If it's a series of bytes (like a Python string is) then base255
is a string. You want to use an incremented number for the ID because
there are some indexes that deal with this kind of data very, very
well. If you have your number spread out like a shotgun, with clusters
here and there, the index might not be very good.

About using many tables: The best answer you are going to get is a
bunch of files---one for each word---with the names or IDs of the
files. You can store these in a single directory, or (as you'll find
out to be more efficient) a tree of directories. But this kind of data
is useless for reverse lookups. If you really do have so much data,
start using a real database that is designed to handle this kind of
data efficiently. The overhead of SQL parsing and network latency soon
gets dwarfed by lookup times anyway.
 
J

Jonathan Gardner

About indexes everywhere: Yes, you don't have to be a DB expert to know
that indexes everywhere is bad. But look at this example. There are
really two ways that the data is going to get accessed in regular use.
Either they are going to ask for all files that have a word (most
likely) or they are going to see which words are in a file.

I'm going to have to name the tables now, aren't I? Here's a simple
schema:

words
--------
word_id
word

files
------
file_id
filename

word_files
--------------
file_id
word_id

If you are going to lookup by word, you'll need an index on words.word.
You'll also need an index on word_files.word_id. And then you'll need
an index on files.file_id.

If you are going to lookup by file, you'll need an index on
files.filename, word_files.file_id, and words.word_id.

So it ends up in this situation you need indexes everywhere.

Now, when you are doing the initial population, you should drop all the
indexes you don't need during population. That means everything but
words.word has to go. (You'll need to find the word_id for previously
seen words.) After the initial population, then is the time to build
and add the indexes. it's much faster to build an index when you have
the entire set of data in front of you than to do it piece-by-piece.
Some indexes actually get built better than they would've piecemeal.

Unfortunately this is no longer strictly topical to Python. But if you
free your mind from thinking in terms of SQL databases and look at
indexes as dicts or whatnot, then you can see that this is really a
general programming problem.
 
B

Bryan Olson

Dan said:
Bryan Olson wrote: [...]
Well, you could use simple files instead of fancy database tables.

That's an interesting thought. Perhaps especially if australopithecine
were saved in a filename like:

~/indices/au/st/ra/lo/pi/th/ec/in/e

Right, though the better filesystems can efficiently support large
numbers of files per directory.
Perhaps I'm missing something, but is it not roughly O(1) for
individual insertions, but O(n*m) (n == number of files, m == number of
words) for lookups?

Insertion into a B-Tree with n elements is Theta(lg(n)). Finding the
files associated with a word, call the number of them 'k', should
be Theta(lg(n) + k). That assumes words and files are roughly
constant size.
 
R

Rene Pijlman

Dan Stromberg:
This is basically what I'm doing now,

Right. My second attempt would be: a BTree with the word as key, and a
BTree of filenames as value (or some other persistent container type
that's not (un)serialized as a whole when only one element changes).
 
D

Dan Stromberg

Dan Stromberg:

Right. My second attempt would be: a BTree with the word as key, and a
BTree of filenames as value (or some other persistent container type
that's not (un)serialized as a whole when only one element changes).

This is a fantastic mapping from the problem domain to the solution
domain, but... I don't know how to do a btree of btrees in python... Nor
any other database of databases.

Would ZODB let me do that?

I'm puzzled, because:

strombrg@blee:~$ python
Python 2.4.2 (#2, Sep 30 2005, 21:19:01)
[GCC 4.0.2 20050808 (prerelease) (Ubuntu 4.0.1-4ubuntu8)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Traceback (most recent call last):

Would I have to come up with some sort of hashable dict first?
 
R

Rene Pijlman

Dan Stromberg:
Rene Pijlman:
Would ZODB let me do that?
Yes.

I'm puzzled, because:
d1={}
d={}
d[d1] = ''
TypeError: dict objects are unhashable

This is using a dict as _key_, whereas I suggested to use a BTree as
_value_.
I don't know how to do a btree of btrees in python...

Read this first:
http://www.zope.org/Wikis/ZODB/guide/node6.html#SECTION000630000000000000000

On second thought, a BTree of TreeSet's may be a better idea.

The key of your outer BTree is a word. It's value is a TreeSet. The key of
that TreeSet is a filename. It has no values.

This provides you with a persistent scalable mapping:

word -> set of filenames

I estimate it takes 2 hours to learn ZODB, and 20 minutes to program a
benchmark. The hours are reusable :)
 
B

Bryan Olson

Dan said:
I've been putting a little bit of time into a file indexing engine
[...]

To solve the O.P.'s first problem, the facility we need is an
efficient externally-stored multimap. A multimap is like a map,
except that each key is associated with a collection of values,
not just a single value. Obviously we could simply encode
multiple values into a single string -- and that's what the
O.P. did -- but updating large strings is inefficient.

Fortunately, the standard Python distribution now includes an
efficient multimap facility, though the standard library doc
does not yet say so. The bsddb module is, in the current
version, built on bsddb3, which exposes far more features of
the Berkeley DB library than the bsddb module.

http://pybsddb.sourceforge.net/bsddb3.html


Sleepycat Software's Berkeley DB library: supports an option
of mapping keys to multiple values:

http://sleepycat.com/docs/ref/am_conf/dup.html


Below is a simple example.
--Bryan



import bsddb

def add_words_from_file(index, fname, word_iterator):
""" Pass the open-for-write bsddb B-Tree, a filename, and a list
(or any interable) of the words in the file.
"""
s = set()
for word in word_iterator:
if word not in s:
s.add(word)
index.put(word, fname)
index.sync()
print

def lookup(index, word):
""" Pass the index (as built with add_words_from_file) and a
word to look up. Returns list of files containing the word.
"""
l = []
cursor = index.cursor()
item = cursor.set(word)
while item != None:
l.append(item[1])
item = cursor.next_dup()
cursor.close()
return l


def test():
env = bsddb.db.DBEnv()
env.open('.', bsddb.db.DB_CREATE | bsddb.db.DB_INIT_MPOOL)
db = bsddb.db.DB(env)
db.set_flags(bsddb.db.DB_DUP)
db.open(
'junktest.bdb',
None,
bsddb.db.DB_HASH,
bsddb.db.DB_CREATE | bsddb.db.DB_TRUNCATE)

data =[
('bryfile.txt', 'nor heed the rumble of a distant drum'),
('junkfile.txt', 'this is the beast, the beast so sly'),
('word file.txt', 'is this the way it always is here in Baltimore')
]
for (fname, text) in data:
words = text.split()
add_words_from_file(db, fname, words)

for word in ['is', 'the', 'heed', 'this', 'way']:
print '"%s" is in files: %s' % (word, lookup(db, word))


test()
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,011
Latest member
AjaUqq1950

Latest Threads

Top