Berkely Db. How to iterate over large number of keys "quickly"

L

lazy

I have a berkely db and Im using the bsddb module to access it. The Db
is quite huge (anywhere from 2-30GB). I want to iterate over the keys
serially.
I tried using something basic like

for key in db.keys()

but this takes lot of time. I guess Python is trying to get the list
of all keys first and probbaly keep it in memory. Is there a way to
avoid this, since I just want to access keys serially. I mean is there
a way I can tell Python to not load all keys, but try to access it as
the loop progresses(like in a linked list). I could find any accessor
methonds on bsddb to this with my initial search.
I am guessing BTree might be a good choice here, but since while the
Dbs were written it was opened using hashopen, Im not able to use
btopen when I want to iterate over the db.
 
L

lazy

Sorry, Just a small correction,
a way I can tell Python to not load allkeys, but try to access it as
the loop progresses(like in a linked list). I could find any accessor
methonds on bsddb to this with my initial search.

I meant, "I couldn't find any accessor methonds on bsddb to do
this(i.e accesing like in a linked list) with my initial search"
 
M

marduk

I have a berkely db and Im using the bsddb module to access it. The Db
is quite huge (anywhere from 2-30GB). I want to iterate over the keys
serially.
I tried using something basic like

for key in db.keys()

but this takes lot of time. I guess Python is trying to get the list
of all keys first and probbaly keep it in memory. Is there a way to
avoid this, since I just want to access keys serially. I mean is there
a way I can tell Python to not load all keys, but try to access it as
the loop progresses(like in a linked list). I could find any accessor
methonds on bsddb to this with my initial search.
I am guessing BTree might be a good choice here, but since while the
Dbs were written it was opened using hashopen, Im not able to use
btopen when I want to iterate over the db.

try instead

key = db.firstkey()
while key != None:
# do something with db[key]
key = db.nextkey(key)
 
C

Christoph Haas

I have a berkely db and Im using the bsddb module to access it. The Db
is quite huge (anywhere from 2-30GB). I want to iterate over the keys
serially.
I tried using something basic like

for key in db.keys()

but this takes lot of time. I guess Python is trying to get the list
of all keys first and probbaly keep it in memory. Is there a way to
avoid this, since I just want to access keys serially.

Does db.iterkeys() work better?

Christoph
 
I

Ian Clark

lazy said:
I have a berkely db and Im using the bsddb module to access it. The Db
is quite huge (anywhere from 2-30GB). I want to iterate over the keys
serially.
I tried using something basic like

for key in db.keys()

but this takes lot of time. I guess Python is trying to get the list
of all keys first and probbaly keep it in memory. Is there a way to
avoid this, since I just want to access keys serially. I mean is there
a way I can tell Python to not load all keys, but try to access it as
the loop progresses(like in a linked list). I could find any accessor
methonds on bsddb to this with my initial search.
I am guessing BTree might be a good choice here, but since while the
Dbs were written it was opened using hashopen, Im not able to use
btopen when I want to iterate over the db.

db.iterkeys()

Looking at the doc for bsddb objects[1] it mentions that "Once
instantiated, hash, btree and record objects support the same methods as
dictionaries." Then looking at the dict documentation[2] you'll find the
dict.iterkeys() method that should do what you're asking.

Ian

[1] http://docs.python.org/lib/bsddb-objects.html
[2] http://docs.python.org/lib/typesmapping.html
 
L

lazy

lazy said:
I have a berkely db and Im using the bsddb module to access it. The Db
is quite huge (anywhere from 2-30GB). I want to iterate over the keys
serially.
I tried using something basic like
for key in db.keys()
but this takes lot of time. I guess Python is trying to get the list
of all keys first and probbaly keep it in memory. Is there a way to
avoid this, since I just want to access keys serially. I mean is there
a way I can tell Python to not load all keys, but try to access it as
the loop progresses(like in a linked list). I could find any accessor
methonds on bsddb to this with my initial search.
I am guessing BTree might be a good choice here, but since while the
Dbs were written it was opened using hashopen, Im not able to use
btopen when I want to iterate over the db.

db.iterkeys()

Looking at the doc for bsddb objects[1] it mentions that "Once
instantiated, hash, btree and record objects support the same methods as
dictionaries." Then looking at the dict documentation[2] you'll find the
dict.iterkeys() method that should do what you're asking.

Ian

[1]http://docs.python.org/lib/bsddb-objects.html
[2]http://docs.python.org/lib/typesmapping.html


Thanks. I tried using db.first and then db.next for subsequent keys.
seems to be faster. Thanks for the pointers
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,733
Messages
2,569,440
Members
44,830
Latest member
ZADIva7383

Latest Threads

Top