personal document mgmt system idea

Sandy Norton · Jan 20, 2004

Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that:

- is of adequate and usable performance
- can accomodate data files of up to 50MB
- is simple and easy to use
- promotes maximum programmibility
- allows for the selective replication (or backup) of data
over a network
- allows for multiple (custom) classification schemes
- is portable across operating systems

The system should promote the following simple pattern:

receive file -> drop it into 'special' folder

after an arbitrary period of doing the above n times -> run
application

for each file in folder:
if automatic metadata extraction is possible:
scan file for metadata and populate fields accordingly
fill in missing metadata
else:
enter metadata
store file

every now and then:
run replicator function of application -> will backup data
over a network
# this will make specified files available to co-workers
# accessing a much larger web-based non-personal version of the
# docmanagement system.

My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Here's the code:

<code>

import sys, time, os, zlib
import MySQLdb, _mysql

def initDB(db='test'):
connection = MySQLdb.Connect("localhost", "sa")
cursor = connection.cursor()
cursor.execute("use %s;" % db)
return (connection, cursor)

def close(connection, cursor):
connection.close()
cursor.close()

def drop_table(cursor):
try:
cursor.execute("drop table tstable")
except:
pass

def create_table(cursor):
cursor.execute('''create table tstable
( id INTEGER PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
data BLOB
);''')

def process(data):
data = zlib.compress(data, 9)
return _mysql.escape_string(data)

def populate_table(cursor):
files = [(f, os.path.join('testdocs', f)) for f in
os.listdir('testdocs')]
for filename, filepath in files:
t1 = time.time()
data = open(filepath, 'rb').read()
data = process(data)
# IMPORTANT: you have to quote the binary txt even after
escaping it.
cursor.execute('''insert into tstable (id, name, data)
values (NULL, '%s', '%s')''' % (filename, data))
print time.time() - t1, 'seconds for ', filepath

def main ():
connection, cursor = initDB()
# doit
drop_table(cursor)
create_table(cursor)
populate_table(cursor)
close(connection, cursor)

if __name__ == "__main__":
t1 = time.time()
main ()
print '=> it took total ', time.time() - t1, 'seconds to complete'

</code>

pythonw -u "test_blob.py"

0.155999898911 seconds for testdocs\business plan.doc
0.0160000324249 seconds for testdocs\concept2businessprocess.pdf
0.0160000324249 seconds for testdocs\diagram.vsd
0.0149998664856 seconds for testdocs\logo.jpg
Traceback (most recent call last):
File "test_blob.py", line 59, in ?
main ()
File "test_blob.py", line 53, in main
populate_table(cursor)
File "test_blob.py", line 44, in populate_table
cursor.execute('''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 95, in execute
return self._execute(query, args)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 114, in _execute
self.errorhandler(self, exc, value)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\connections.py",
line 33, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
away')

Exit code: 1

</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?

- Am I using the wrong database? (or is the connector just buggy?)

Thanks to all.

best regards,

Sandy Norton

John J. Lee · Jan 20, 2004

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that: [...]
The system should promote the following simple pattern:

[...]

Pybliographer 2 is aiming at these features (but a lot more besides).
Work has been slow for a long while, but several new releases of
pyblio 1 have come out recently, and work is taking place on pyblio 2.
There are design documents on the web at pybliographer.org. Why not
muck in and implement what you want with Pyblio?

[...]

My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.

ATM Pyblio only runs on GNOME, but that's going to change.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Here's the code: [...]
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
away')

Exit code: 1

Click to expand...

</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?

Haven't read your code, but the error certainly strongly suggests a
MySQL configuration problem.

John

John Roth · Jan 20, 2004

I wouldn't put the individual files in a data base - that's what
file systems are for. The exception is small files (and by the
time you say ".doc" in MS Word, it's now longer a small
file) where you can save substantial space by consolidating
them.

John Roth

Sandy Norton said:
Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

I accumulate a lot of files (documents, archives, pdfs, images, etc.)
on a daily basis and storing them in a hierarchical file system is
simple but unsatisfactory:

- deeply nested hierarchies are a pain to navigate
and to reorganize
- different file systems have inconsistent and weak schemes
for storing metadata e.g. compare variety of incompatible
schemes in windows alone (office docs vs. pdfs etc.) .

I would like a personal document management system that:

- is of adequate and usable performance
- can accomodate data files of up to 50MB
- is simple and easy to use
- promotes maximum programmibility
- allows for the selective replication (or backup) of data
over a network
- allows for multiple (custom) classification schemes
- is portable across operating systems

The system should promote the following simple pattern:

receive file -> drop it into 'special' folder

after an arbitrary period of doing the above n times -> run
application

for each file in folder:
if automatic metadata extraction is possible:
scan file for metadata and populate fields accordingly
fill in missing metadata
else:
enter metadata
store file

every now and then:
run replicator function of application -> will backup data
over a network
# this will make specified files available to co-workers
# accessing a much larger web-based non-personal version of the
# docmanagement system.

My initial prototyping efforts involved creating a single test table
in
mysql (later to include fields for dublin-core metadata elements)
and a BLOB field for the data itself. My present dev platform is
windows XP pro, mysql 4.1.1-alpha, MySQL-python connector v.0.9.2
and python 2.3.3 . However, I will be testing the same app on Mac OS X
and Linux Mandrake 9.2 as well.

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Here's the code:

<code>

import sys, time, os, zlib
import MySQLdb, _mysql

def initDB(db='test'):
connection = MySQLdb.Connect("localhost", "sa")
cursor = connection.cursor()
cursor.execute("use %s;" % db)
return (connection, cursor)

def close(connection, cursor):
connection.close()
cursor.close()

def drop_table(cursor):
try:
cursor.execute("drop table tstable")
except:
pass

def create_table(cursor):
cursor.execute('''create table tstable
( id INTEGER PRIMARY KEY AUTO_INCREMENT,
name VARCHAR(100),
data BLOB
);''')

def process(data):
data = zlib.compress(data, 9)
return _mysql.escape_string(data)

def populate_table(cursor):
files = [(f, os.path.join('testdocs', f)) for f in
os.listdir('testdocs')]
for filename, filepath in files:
t1 = time.time()
data = open(filepath, 'rb').read()
data = process(data)
# IMPORTANT: you have to quote the binary txt even after
escaping it.
cursor.execute('''insert into tstable (id, name, data)
values (NULL, '%s', '%s')''' % (filename, data))
print time.time() - t1, 'seconds for ', filepath

def main ():
connection, cursor = initDB()
# doit
drop_table(cursor)
create_table(cursor)
populate_table(cursor)
close(connection, cursor)

if __name__ == "__main__":
t1 = time.time()
main ()
print '=> it took total ', time.time() - t1, 'seconds to complete'

</code>

pythonw -u "test_blob.py"

Click to expand...

0.155999898911 seconds for testdocs\business plan.doc
0.0160000324249 seconds for testdocs\concept2businessprocess.pdf
0.0160000324249 seconds for testdocs\diagram.vsd
0.0149998664856 seconds for testdocs\logo.jpg
Traceback (most recent call last):
File "test_blob.py", line 59, in ?
main ()
File "test_blob.py", line 53, in main
populate_table(cursor)
File "test_blob.py", line 44, in populate_table
cursor.execute('''insert into tstable (id, name, data) values
(NULL, '%s', '%s')''' % (filename, data))
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 95, in execute
return self._execute(query, args)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\cursors.py",
line 114, in _execute
self.errorhandler(self, exc, value)
File "C:\Engines\Python23\Lib\site-packages\MySQLdb\connections.py",
line 33, in defaulterrorhandler
raise errorclass, errorvalue
_mysql_exceptions.OperationalError: (2006, 'MySQL server has gone
away')

Exit code: 1

Click to expand...

</traceback>

My Questions are:

- Is my test code at fault?

- Is this the wrong approach to begin with: i.e. is it a bad idea to
store the data itself in the database?

- Am I using the wrong database? (or is the connector just buggy?)

Thanks to all.

best regards,

Sandy Norton

Stephan Diehl · Jan 20, 2004

Sandy Norton wrote:

Hi Sandy,

looks like this will be the year of personal document management projects.
Since I'm involved in a similar project (hope I can go Open Source with it),
here are some of my thoughts.

Hi folks,

I have been mulling over an idea for a very simple python-based
personal document management system. The source of this possible
solution is the following typical problem:

[...]

The first problem I've run into is that mysql or the MySQL
connector crashes when the size of one BLOB reaches a certain point:
in this case an .avi file of 7.2 mb .

Just dump your files somewhere in the filesystem and keep a record of it in
your database.

In addition, a real (text) search engine might be of help. I'm using swish-e
(www.swish-e.org) and are very pleased with it.

Maybe, before you invest to much time into such a project, you should check
out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for (and it's
written in Python)

ReiseFS (see www.namesys.com -> Future Vision)

Gnome Storage (http://www.gnome.org/~seth/storage)

WinFS
(http://msdn.microsoft.com/Longhorn/understanding/pillars/WinFS/default.aspx)

Hope that helps

Stephan

Sandy Norton · Jan 21, 2004

John J. Lee:

Pybliographer 2 is aiming at these features (but a lot more besides).
Work has been slow for a long while, but several new releases of
pyblio 1 have come out recently, and work is taking place on pyblio 2.
There are design documents on the web at pybliographer.org. Why not
muck in and implement what you want with Pyblio?

Thanks for the reference, Pyblio definitely seems interesting and I
will be looking into this project closely.

cheers.

Sandy

Sandy Norton · Jan 21, 2004

John Roth wrote :

I wouldn't put the individual files in a data base - that's what
file systems are for. The exception is small files (and by the
time you say ".doc" in MS Word, it's now longer a small
file) where you can save substantial space by consolidating
them.

There seems to be consensus that I shouldn't store files in the
database. This makes sense as filesystems seem to be optimized for,
um, files (-;

As I want to get away from deeply nested directories, I'm going to
test two approaches:

1. store everything in a single folder and hash each file name to give
a unique id

2. create a directory structure based upon a calendar year and store
the daily downloads automatically.

I can finally use some code I'd written before for something like this
purpose:

<code>

from pprint import pprint
import os
import calendar

class Calendirs:

months = {
1 : 'January',
2 : 'February',
3 : 'March',
4 : 'April',
5 : 'May',
6 : 'June',
7 : 'July',
8 : 'August',
9 : 'September',
10 : 'October',
11 : 'November',
12 : 'December'
}

wkdays = {
0 : 'Monday',
1 : 'Tuesday',
2 : 'Wednesday',
3 : 'Thursday',
4 : 'Friday',
5 : 'Saturday',
6 : 'Sunday'
}

def __init__(self, year):
self.year = year

def calendir(self):
'''returns list of calendar matrices'''
mc = calendar.monthcalendar
cal = [(self.year, m) for m in range(1,13)]
return [mc(y,m) for (y, m) in cal]

def yearList(self):
res=[]
weekday = calendar.weekday
m = 0
for month in self.calendir():
lst = []
m += 1
for week in month:
for day in week:
if day:
day_str = Calendirs.wkdays[weekday(self.year,
m, day)]
lst.append( (str(m)+'.'+Calendirs.months[m],
str(day)+'.'+day_str) )
res.append(lst)
return res

def make(self):
for month in self.yearList():
for m, day in month:
path = os.path.join(str(self.year), m, day)
os.makedirs(path)

Calendirs(2004).make()

</code>

I don't know which method will perform better or be more usable...
testing testing testing.

regards,

Sandy

Sandy Norton · Jan 21, 2004

Stephan Diehl wrote:

[...]

Just dump your files somewhere in the filesystem and keep a record of it in
your database.

I think I will go with this approach. (see other posting for details)

In addition, a real (text) search engine might be of help. I'm using swish-e
(www.swish-e.org) and are very pleased with it.

Just downloaded it... looks good. Now if it also had a python api (-;

Maybe, before you invest to much time into such a project, you should check
out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for (and
it's written in Python)

Still early stages... I see they dropped the ZODB.

ReiseFS (see www.namesys.com -> Future Vision)
Gnome Storage (http://www.gnome.org/~seth/storage)
WinFS
(http://msdn.microsoft.com/Longhorn/understanding/pillars/WinFS/default.aspx)

Wow! Very exciting stuff... I guess we'll just have to wait and see what develops.

Hope that helps

Yes. Very informative. Cheers for the help.

Stephan

Sandy

John Abel · Jan 21, 2004

Have you looked at the modules available from divmod.org for your text
searching?

Sandy said:
Stephan Diehl wrote:

[...]

Just dump your files somewhere in the filesystem and keep a record of it in
your database.

Click to expand...

I think I will go with this approach. (see other posting for details)

In addition, a real (text) search engine might be of help. I'm using swish-e
(www.swish-e.org) and are very pleased with it.

Click to expand...

Just downloaded it... looks good. Now if it also had a python api (-;

Maybe, before you invest to much time into such a project, you should check
out the following:

Chandler (http://www.osafoundation.org)
if it's finished, it will do excactly what you are aiming for (and
it's written in Python)

Click to expand...

Still early stages... I see they dropped the ZODB.

ReiseFS (see www.namesys.com -> Future Vision)
Gnome Storage (http://www.gnome.org/~seth/storage)
WinFS
(http://msdn.microsoft.com/Longhorn/understanding/pillars/WinFS/default.aspx)

Click to expand...

Wow! Very exciting stuff... I guess we'll just have to wait and see what develops.

Hope that helps

Click to expand...

Yes. Very informative. Cheers for the help.

Stephan

Click to expand...

Sandy

Stephan Diehl · Jan 21, 2004

Sandy said:
Stephan Diehl wrote:

[...]
[...]

In addition, a real (text) search engine might be of help. I'm using
swish-e (www.swish-e.org) and are very pleased with it.

Click to expand...

Just downloaded it... looks good. Now if it also had a python api (-;

I'm just using the command line interface via os.system and the popenX
calls.
The only thing that (unfortunatelly) not possible, is to remove a document
from the index :-(
If you need any help, just drop me a line.

Still early stages... I see they dropped the ZODB.

Did they? If they succeed, Chandler will rock. My personal opinion is that
they try doing too much at once. I guess that a better filesystem will make
most of the document management type applications obsolete.
The big problem, of course, is to define 'better' in a meaningfull way.

(http://msdn.microsoft.com/Longhorn/understanding/pillars/WinFS/default.aspx)

Wow! Very exciting stuff... I guess we'll just have to wait and see what
develops.

Or go the other way: build a new filesystem prototype application in python
and see, if it works out as intended and then build a proper file system.

MySQLdb: ValueError Something Stupid	5	Sep 7, 2007
ANN: eGenix mxODBC Connect 2.1.0 - Python ODBC Database Interface	0	May 28, 2014
MySQL blobs confusion	1	Nov 4, 2004
MySQL error from Python	3	Jun 27, 2004
MySQLdb syntax issues - HELP	10	Dec 16, 2007
error in inserting a longblob data	1	Feb 24, 2004
error in inserting a long blob data	3	Feb 25, 2004
Newbie MySQLdb / MySQL version problem, I think	7	Nov 28, 2004

personal document mgmt system idea

Sandy Norton

John J. Lee

John Roth

Stephan Diehl

Sandy Norton

Sandy Norton

Sandy Norton

John Abel

Stephan Diehl

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads