bsddb3 database file, what are the __db.001, __db.002, __db.003files for?

C

Claudio Grondi

I have just started to play around with the bsddb3 module interfacing
the Berkeley Database.

Beside the intended database file
databaseFile.bdb
I see in same directory also the
__db.001
__db.002
__db.003
files where
__db.003 is ten times as larger as the databaseFile.bdb
and
__db.001 has the same size as the databaseFile.bdb .

What are these files for and what could be the reason they occur?

If I delete them, the access to the database in databaseFile.bdb
still works as expected.

Any hints toward enlightenment?

Is there any _good_ documentation of the bsddb3 module around beside
this provided with this module itself, where it is not necessary e.g. to
guess, that C integer value of zero (0) is represented in Python by the
value None returned in case of success by db.open() ?

Claudio
 
K

Klaas

Claudio said:
Beside the intended database file
databaseFile.bdb
I see in same directory also the
__db.001
__db.002
__db.003
files where
__db.003 is ten times as larger as the databaseFile.bdb
and
__db.001 has the same size as the databaseFile.bdb .

I can't tell you exactly what each is, but they are the files that the
shared environment (DBEnv) uses to coordinate multi-process access to
the database. In particular, the big one is likely the mmap'd cache
(which defaults to 5Mb, I believe).

You can safely delete them, but probably shouldn't while your program
is executing.
Is there any _good_ documentation of the bsddb3 module around beside
this provided with this module itself, where it is not necessary e.g. to
guess, that C integer value of zero (0) is represented in Python by the
value None returned in case of success by db.open() ?

This is the only documentation available, AFAIK:
http://pybsddb.sourceforge.net/bsddb3.html

For most of the important stuff it is necessary to dig into the bdb
docs themselves.

-Mike
 
C

Claudio Grondi

Klaas said:
Claudio Grondi wrote:




I can't tell you exactly what each is, but they are the files that the
shared environment (DBEnv) uses to coordinate multi-process access to
the database. In particular, the big one is likely the mmap'd cache
(which defaults to 5Mb, I believe).

You can safely delete them, but probably shouldn't while your program
is executing.




This is the only documentation available, AFAIK:
http://pybsddb.sourceforge.net/bsddb3.html

For most of the important stuff it is necessary to dig into the bdb
docs themselves.
Thank you for the reply.

Probably to avoid admitting, that the documentation is weak a positive
way of stating this was found by using the phrase:

"Berkeley DB was designed by programmers, for programmers."

So I have to try to get an excavator ;-) to speed up digging the docs
and maybe even the source, right?

Are there online somewhere any useful simple examples of applications
using the Berkeley DB I could learn from?

I am especially interested in using the multimap feature activated using
db.set_flags(bsddb3.db.DB_DUPSORT) and fear, that after the database
file size will grow during mapping tokens to the files they occur in (I
have appr. 10 million files which I want to build a search index for) I
will hit some unexpected limits and the project will fail like it
happened to me once in the past when I tried to use MySQL for similar
purpose (after the database file has grown over 2 GByte MySQL just began
to hang when trying to add some more records).
I am on a Windows using the NTFS file system, so I don't expect problems
with too large file size. In between I have also already working Python
code performing the basic database operations I will need to feed and
query the database.
Has someone used Berkeley DB for similar purpose and can tell me, that
actually in practice (not in theory stated in the feature list of the
Berkeley DB) I must not fear any problems?
It took me some days of continuous updating the MySQL database to see,
that there is an unexpected strange limit for the database file size. I
still have no idea what the actual cause of the problem with MySQL was
(I suppose it in having only 256 MB RAM available that time) as it is
known that MySQL databases larger than 2 GByte exist and are in daily
use :-( .

This are the reasons why I would be glad to hear how to avoid running
into a similar problem again _before_ I start to torture my machine with
filling the appropriate Berkeley DB database with entries.

Claudio
 
K

Klaas

Claudio said:
I am on a Windows using the NTFS file system, so I don't expect problems
with too large file size.

how large can files grow on NTFS? I know little about it.
(I suppose it in having only 256 MB RAM available that time) as it is
known that MySQL databases larger than 2 GByte exist and are in daily
use :-( .

Do you have more ram now? I've used berkeley dbs up to around 5 gigs
in size and they performed fine. However, it is quite important that
the working set of the database (it's internal index pages) can fit
into available ram. If they are swapping in and out, there will be
problems.

-Mike
 
D

Dennis Lee Bieber

how large can files grow on NTFS? I know little about it.
From the XP help "Choosing between..." this is said for NTFS:
File size limited only by size of volume.

--
 
C

Claudio Grondi

Klaas said:
how large can files grow on NTFS? I know little about it.
No practical limit on current harddrives. i.e.:
Maximum file size
Theory: 16 exabytes minus 1 KB (2**64 bytes minus 1 KB)
Implementation: 16 terabytes minus 64 KB (2**44 bytes minus 64 KB)
Maximum volume size
Theory: 2**64 clusters minus 1
Implementation: 256 terabytes minus 64 KB (2**32 clusters minus 1)
Files per volume
4,294,967,295 (2**32 minus 1 file)
Do you have more ram now?
I have now 3 GByte RAM on my best machine, but Windows allows a process
not to exceed 2 GByte, so in practice a little bit less than 2 GByte are
the actual upper limit.

I've used berkeley dbs up to around 5 gigs
in size and they performed fine. However, it is quite important that
the working set of the database (it's internal index pages) can fit
into available ram. If they are swapping in and out, there will be
problems.
Thank you very much for your reply.

In my current project I expect the data to have much less volume than
the indexes. In my failed MySQL project the size of the indexes was
appr. same as the size of the indexed data (1 GByte).
In my current project I expect the total size of the indexes to exceed
by far the size of the data indexed, but because Berkeley does not
support multiple indexed columns (i.e. only one key value column as
index) if I access the database files one after another (not
simultaneously) it should work without problems with RAM, right?

Do the data volume required to store the key values have impact on the
size of the index pages or does the size of the index pages depend only
on the number of records and kind of the index (btree, hash)?

In last case, I were free to use for the key values also larger sized
data columns without running into the problems with RAM size for the
index itself, else I were forced to use key columns storing a kind of
hash to get their size down (and two dictionaries instead of one).

What is the upper limit of number of records in practice?

Theoretical, as given in the tutorial, Berkeley is capable of holding up
to billions of records with sizes of up to 4 GB each single record with
tables up to total storage size of 256 TB of data.
By the way: are billions in the given context multiple of 1.000.000.000
or of 1.000.000.000.000 i.e. in US or British sense?

I expect the number of records in my project in the order of tens of
millions (multiple of 10.000.000).

I would be glad to hear if someone has already successful run Berkeley
with this or larger amount of records and how much RAM and which OS had
the therefore used machine (I am on Windows XP with 3 GByte RAM).

Claudio
 
K

Klaas

In my current project I expect the total size of the indexes to exceed
by far the size of the data indexed, but because Berkeley does not
support multiple indexed columns (i.e. only one key value column as
index) if I access the database files one after another (not
simultaneously) it should work without problems with RAM, right?

You can maintain multiple secondary indices on a primary database. BDB
isn't a "relational" database, though, so speaking of columns confuses
the issue. But you can have one database with primary key -> value,
then multiple secondary key -> primary key databases (with bdb
transparently providing the secondary key -> value mapping if you
desire).
Do the data volume required to store the key values have impact on the
size of the index pages or does the size of the index pages depend only
on the number of records and kind of the index (btree, hash)?

For btree, it is the size of the keys that matters. I presume the same
is true for the hashtable, but I'm not certain.
What is the upper limit of number of records in practice?

Depends on sizes of the keys and values, page size, cache size, and
physical limitations of your machine.

-Mike
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top