sync databse table based on current directory data without losignprevious values

  • Thread starter Íßêïò Ãêñ33ê
  • Start date

Í

Íßêïò Ãêñ33ê

I'am using this snipper to read a current directory and insert all filenames into a databse and then display them.

But what happens when files are get removed form the directory?
The inserted records into databse remain.
How can i update the databse to only contain the existing filenames without losing the previous stored data?

Here is what i ahve so far:

==================================
path = "/home/nikos/public_html/data/files/"

#read the containing folder and insert new filenames
for result in os.walk(path):
for filename in result[2]:
try:
#find the needed counter for the page URL
cur.execute('''SELECT URL FROM files WHERE URL = %s''', (filename,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
#first time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (URL, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, date) )
except MySQLdb.Error, e:
print ( "Query Error: ", sys.exc_info()[1].excepinfo()[2] )
======================

Thank you.
 
Ad

Advertisements

L

Lele Gaifax

Îίκος ΓκÏ33κ said:
How can i update the databse to only contain the existing filenames without losing the previous stored data?

Basically you need to keep a list (or better, a set) containing all
current filenames that you are going to insert, and finally do another
"inverse" loop where you scan all the records and delete those that are
not present anymore.

Of course, this assume you have a "bidirectional" identity between the
filenames you are loading and the records you are inserting, which is
not the case in the code you show:
#read the containing folder and insert new filenames
for result in os.walk(path):
for filename in result[2]:

Here "filename" is just that, not the full path: this could result in
collisions, if your are actually loading a *tree* instead of a flat
directory, that is multiple source files are squeezed into a single
record in your database (imagine "/foo/index.html" and
"/foo/subdir/index.html").

With that in mind, I would do something like the following:

# Compute a set of current fullpaths
current_fullpaths = set()
for root, dirs, files in os.walk(path):
for fullpath in files:
current_fullpaths.add(os.path.join(root, file))

# Load'em
for fullpath in current_fullpaths:

try:
#find the needed counter for the page URL
cur.execute('''SELECT URL FROM files WHERE URL = %s''', (fullpath,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
#first time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (URL, host, lastvisit) VALUES (%s, %s, %s)''', (fullpath, host, date) )
except MySQLdb.Error, e:
print ( "Query Error: ", sys.exc_info()[1].excepinfo()[2] )

# Delete spurious
cur.execute('''SELECT url FROM files''')
for rec in cur:
fullpath = rec[0]
if fullpath not in current_fullpaths:
other_cur.execute('''DELETE FROM files WHERE url = %s''', (fullpath,))

Of course here I am assuming a lot (a typical thing we do to answer your
questions :), in particular that the "url" field content matches the
filesystem layout, which may not be the case. Adapt it to your usecase.

hope this helps,
ciao, lele.
 
Í

Íßêïò Ãêñ33ê

Τη ΤετάÏτη, 6 ΜαÏτίου 2013 10:19:06 Ï€.μ. UTC+2, ο χÏήστης Lele Gaifax έγÏαψε:
How can i update the databse to only contain the existing filenames without losing the previous stored data?



Basically you need to keep a list (or better, a set) containing all

current filenames that you are going to insert, and finally do another

"inverse" loop where you scan all the records and delete those that are

not present anymore.



Of course, this assume you have a "bidirectional" identity between the

filenames you are loading and the records you are inserting, which is

not the case in the code you show:


#read the containing folder and insert new filenames
for result in os.walk(path):
for filename in result[2]:



Here "filename" is just that, not the full path: this could result in

collisions, if your are actually loading a *tree* instead of a flat

directory, that is multiple source files are squeezed into a single

record in your database (imagine "/foo/index.html" and

"/foo/subdir/index.html").



With that in mind, I would do something like the following:



# Compute a set of current fullpaths

current_fullpaths = set()

for root, dirs, files in os.walk(path):

for fullpath in files:

current_fullpaths.add(os.path.join(root, file))



# Load'em

for fullpath in current_fullpaths:



try:

#find the needed counter for the page URL

cur.execute('''SELECT URL FROM files WHERE URL = %s''', (fullpath,) )

data = cur.fetchone() #URL is unique, so should only be one



if not data:

#first time for file; primary key is automatic, hit is defaulted

cur.execute('''INSERT INTO files (URL, host, lastvisit) VALUES (%s, %s, %s)''', (fullpath, host, date) )

except MySQLdb.Error, e:

print ( "Query Error: ", sys.exc_info()[1].excepinfo()[2] )



# Delete spurious

cur.execute('''SELECT url FROM files''')

for rec in cur:

fullpath = rec[0]

if fullpath not in current_fullpaths:

other_cur.execute('''DELETE FROM files WHERE url = %s''', (fullpath,))



Of course here I am assuming a lot (a typical thing we do to answer your

questions :), in particular that the "url" field content matches the

filesystem layout, which may not be the case. Adapt it to your usecase.



hope this helps,

ciao, lele.

--

nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri

real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.

(e-mail address removed) | -- Fortunato Depero, 1929.

You are fantastic! Your straightforward logic amazes me!

Thank you very much for making things clear to me!!

But there is a slight problem when iam trying to run the code iam presenting this error ehre you can see its output here:

http://superhost.gr/cgi-bin/files.py
 
Í

Íßêïò Ãêñ33ê

Τη ΤετάÏτη, 6 ΜαÏτίου 2013 10:19:06 Ï€.μ. UTC+2, ο χÏήστης Lele Gaifax έγÏαψε:
How can i update the databse to only contain the existing filenames without losing the previous stored data?



Basically you need to keep a list (or better, a set) containing all

current filenames that you are going to insert, and finally do another

"inverse" loop where you scan all the records and delete those that are

not present anymore.



Of course, this assume you have a "bidirectional" identity between the

filenames you are loading and the records you are inserting, which is

not the case in the code you show:


#read the containing folder and insert new filenames
for result in os.walk(path):
for filename in result[2]:



Here "filename" is just that, not the full path: this could result in

collisions, if your are actually loading a *tree* instead of a flat

directory, that is multiple source files are squeezed into a single

record in your database (imagine "/foo/index.html" and

"/foo/subdir/index.html").



With that in mind, I would do something like the following:



# Compute a set of current fullpaths

current_fullpaths = set()

for root, dirs, files in os.walk(path):

for fullpath in files:

current_fullpaths.add(os.path.join(root, file))



# Load'em

for fullpath in current_fullpaths:



try:

#find the needed counter for the page URL

cur.execute('''SELECT URL FROM files WHERE URL = %s''', (fullpath,) )

data = cur.fetchone() #URL is unique, so should only be one



if not data:

#first time for file; primary key is automatic, hit is defaulted

cur.execute('''INSERT INTO files (URL, host, lastvisit) VALUES (%s, %s, %s)''', (fullpath, host, date) )

except MySQLdb.Error, e:

print ( "Query Error: ", sys.exc_info()[1].excepinfo()[2] )



# Delete spurious

cur.execute('''SELECT url FROM files''')

for rec in cur:

fullpath = rec[0]

if fullpath not in current_fullpaths:

other_cur.execute('''DELETE FROM files WHERE url = %s''', (fullpath,))



Of course here I am assuming a lot (a typical thing we do to answer your

questions :), in particular that the "url" field content matches the

filesystem layout, which may not be the case. Adapt it to your usecase.



hope this helps,

ciao, lele.

--

nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri

real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.

(e-mail address removed) | -- Fortunato Depero, 1929.

You are fantastic! Your straightforward logic amazes me!

Thank you very much for making things clear to me!!

But there is a slight problem when iam trying to run the code iam presenting this error ehre you can see its output here:

http://superhost.gr/cgi-bin/files.py
 
L

Lele Gaifax

Îίκος ΓκÏ33κ said:
Thank you very much for making things clear to me!!

You're welcome, even more if you spend 1 second to trim your answers
removing unneeded citation :)
But there is a slight problem when iam trying to run the code iam presenting this error ehre you can see its output here:

http://superhost.gr/cgi-bin/files.py

Sorry, this seems completely unrelated, and from the little snippet that
appear on that page I cannot understand what's going on there.

ciao, lele.
 
Í

Íßêïò Ãêñ33ê

Its about the following line of code:

current_fullpaths.add( os.path.join(root, files) )


that presents the following error:

<type 'exceptions.AttributeError'>: 'list' object has no attribute 'startswith'
args = ("'list' object has no attribute 'startswith'",)
message = "'list' object has no attribute 'startswith'"

join calls some module that find difficulty when parsing its line:

/usr/lib64/python2.6/posixpath.py in join(a='/home/nikos/public_html/data/files/', *p=(['\xce\x9a\xcf\x8d\xcf\x81\xce\xb9\xce\xb5 \xce\x99\xce\xb7\xcf\x83\xce\xbf\xcf\x8d \xce\xa7\xcf\x81\xce\xb9\xcf\x83\xcf\x84\xce\xad \xce\x95\xce\xbb\xce\xad\xce\xb7\xcf\x83\xce\xbf\xce\xbd \xce\x9c\xce\xb5.mp3', '\xce\xa0\xce\xb5\xcf\x81\xce\xaf \xcf\x84\xcf\x89\xce\xbd \xce\x9b\xce\xbf\xce\xb3\xce\xb9\xcf\x83\xce\xbc\xcf\x8e\xce\xbd.mp3'],))
63 path = a
64 for b in p:
65 if b.startswith('/'):
 
Ad

Advertisements

Í

Íßêïò Ãêñ33ê

Its about the following line of code:

current_fullpaths.add( os.path.join(root, files) )


that presents the following error:

<type 'exceptions.AttributeError'>: 'list' object has no attribute 'startswith'
args = ("'list' object has no attribute 'startswith'",)
message = "'list' object has no attribute 'startswith'"

join calls some module that find difficulty when parsing its line:

/usr/lib64/python2.6/posixpath.py in join(a='/home/nikos/public_html/data/files/', *p=(['\xce\x9a\xcf\x8d\xcf\x81\xce\xb9\xce\xb5 \xce\x99\xce\xb7\xcf\x83\xce\xbf\xcf\x8d \xce\xa7\xcf\x81\xce\xb9\xcf\x83\xcf\x84\xce\xad \xce\x95\xce\xbb\xce\xad\xce\xb7\xcf\x83\xce\xbf\xce\xbd \xce\x9c\xce\xb5.mp3', '\xce\xa0\xce\xb5\xcf\x81\xce\xaf \xcf\x84\xcf\x89\xce\xbd \xce\x9b\xce\xbf\xce\xb3\xce\xb9\xcf\x83\xce\xbc\xcf\x8e\xce\xbd.mp3'],))
63 path = a
64 for b in p:
65 if b.startswith('/'):
 
Í

Íßêïò Ãêñ33ê

Perhaps because my filenames is in greek letters that thsi error is presented but i'am not sure.....

Maybe we can join root+files and store it to the set() someway differenyl....
 
Í

Íßêïò Ãêñ33ê

Perhaps because my filenames is in greek letters that thsi error is presented but i'am not sure.....

Maybe we can join root+files and store it to the set() someway differenyl....
 
W

Wong Wah Meng-R32813

Hello there,

I am using python 2.7.1 built on HP-11.23 a Itanium 64 bit box.

I discovered following behavior whereby the python process doesn't seem to release memory utilized even after a variable is set to None, and "deleted". I use glance tool to monitor the memory utilized by this process. Obviously after the for loop is executed, the memory used by this process has hiked to a few MB. However, after "del" is executed to both I and str variables, the memory of that process still stays at where it was.

Any idea why?

... str=str+"%s"%(i,)
...
 
B

Bryan Devaney

Perhaps because my filenames is in greek letters that thsi error is presented but i'am not sure.....



Maybe we can join root+files and store it to the set() someway differenyl.....

well, the error refers to the line "if b.startswith('/'): " and states "'list' object has no attribute 'startswith'"

so b is assigned to a list type and list does not have a 'startswith' method or attribute.

I Thought .startswith() was a string method but if it's your own method then I apologize (though if it is, I personally would have made a class that inherited from list rather than adding it to list itself)

can you show where you are assigning b (or if its meant to be a list or string object)
 
Ad

Advertisements

B

Bryan Devaney

Perhaps because my filenames is in greek letters that thsi error is presented but i'am not sure.....



Maybe we can join root+files and store it to the set() someway differenyl.....

well, the error refers to the line "if b.startswith('/'): " and states "'list' object has no attribute 'startswith'"

so b is assigned to a list type and list does not have a 'startswith' method or attribute.

I Thought .startswith() was a string method but if it's your own method then I apologize (though if it is, I personally would have made a class that inherited from list rather than adding it to list itself)

can you show where you are assigning b (or if its meant to be a list or string object)
 
B

Bryan Devaney

Hello there,



I am using python 2.7.1 built on HP-11.23 a Itanium 64 bit box.



I discovered following behavior whereby the python process doesn't seem to release memory utilized even after a variable is set to None, and "deleted". I use glance tool to monitor the memory utilized by this process. Obviously after the for loop is executed, the memory used by this process has hiked to a few MB. However, after "del" is executed to both I and str variables, the memory of that process still stays at where it was.



Any idea why?




... str=str+"%s"%(i,)

...

Hi, I'm new here so I'm making mistakes too but I know they don't like it when you ask your question in someone else's question.

that being said, to answer your question:

Python uses a 'garbage collector'. When you delete something, all references are removed from the object in memory, the memory itself will not be freed until the next time the garbage collector runs. When that happens, all objects without references in memory are removed and the memory freed. If you wait a while you should see that memory free itself.
 
B

Bryan Devaney

Hello there,



I am using python 2.7.1 built on HP-11.23 a Itanium 64 bit box.



I discovered following behavior whereby the python process doesn't seem to release memory utilized even after a variable is set to None, and "deleted". I use glance tool to monitor the memory utilized by this process. Obviously after the for loop is executed, the memory used by this process has hiked to a few MB. However, after "del" is executed to both I and str variables, the memory of that process still stays at where it was.



Any idea why?




... str=str+"%s"%(i,)

...

Hi, I'm new here so I'm making mistakes too but I know they don't like it when you ask your question in someone else's question.

that being said, to answer your question:

Python uses a 'garbage collector'. When you delete something, all references are removed from the object in memory, the memory itself will not be freed until the next time the garbage collector runs. When that happens, all objects without references in memory are removed and the memory freed. If you wait a while you should see that memory free itself.
 
L

Lele Gaifax

Îίκος ΓκÏ33κ said:
Its about the following line of code:

current_fullpaths.add( os.path.join(root, files) )

I'm sorry, typo on my part.

That should have been "fullpath", not "file" (and neither "files" as you
wrongly reported back!):

# Compute a set of current fullpaths
current_fullpaths = set()
for root, dirs, files in os.walk(path):
for fullpath in files:
current_fullpaths.add(os.path.join(root, fullpath))

ciao, lele.
 
T

Terry Reedy

Hello there,

I am using python 2.7.1 built on HP-11.23 a Itanium 64 bit box.

I discovered following behavior whereby the python process doesn't
seem to release memory utilized even after a variable is set to None,
and "deleted". I use glance tool to monitor the memory utilized by
this process. Obviously after the for loop is executed, the memory
used by this process has hiked to a few MB. However, after "del" is
executed to both I and str variables, the memory of that process
still stays at where it was.

Whether memory freed by deleting an object is returned to and taken by
the OS depends on the OS and other factors like like the size and layout
of the freed memory, probably the history of memory use, and for
CPython, the C compiler's malloc/free implementation. At various times,
the Python memory handlers have been rewritten to encourage/facilitate
memory return, but Python cannot control the process.
for i in range(100000L):
str=str+"%s"%(i,)
i=None; str=None # not necessary
del i; del str

Reusing built-in names for unrelated purposes is generally a bad idea,
although the final deletion does restore access to the builtin.
 
Ad

Advertisements

M

Mark Lawrence

I'am using this snipper to read a current directory and insert all filenames into a databse and then display them.

But what happens when files are get removed form the directory?
The inserted records into databse remain.
How can i update the databse to only contain the existing filenames without losing the previous stored data?

Here is what i ahve so far:

==================================
path = "/home/nikos/public_html/data/files/"

#read the containing folder and insert new filenames
for result in os.walk(path):

You were told yesterday at least twice that os.walk returns a tuple but
you still insist on refusing to take any notice of our replies when it
suits you, preferring instead to waste everbody's time with these
questions. Or are you trying to get into the Guinness Book of World
Records for the laziest bastard on the planet?
for filename in result[2]:
try:
#find the needed counter for the page URL
cur.execute('''SELECT URL FROM files WHERE URL = %s''', (filename,) )
data = cur.fetchone() #URL is unique, so should only be one

if not data:
#first time for file; primary key is automatic, hit is defaulted
cur.execute('''INSERT INTO files (URL, host, lastvisit) VALUES (%s, %s, %s)''', (filename, host, date) )
except MySQLdb.Error, e:
print ( "Query Error: ", sys.exc_info()[1].excepinfo()[2] )
======================

Thank you.
 
W

Wong Wah Meng-R32813

Apologies as after I have left the group for a while I have forgotten how not to post a question on top of another question. Very sorry and appreciateyour replies.

I tried explicitly calling gc.collect() and didn't manage to see the memoryfootprint reduced. I probably haven't left the process idle long enough tosee the internal garbage collection takes place but I will leave it idle for more than 8 hours and check again. Thanks!

-----Original Message-----
From: Python-list [mailto:p[email protected]] On Behalf Of Bryan Devaney
Sent: Wednesday, March 06, 2013 6:25 PM
To: (e-mail address removed)
Cc: (e-mail address removed)
Subject: Re: Set x to to None and del x doesn't release memory in python 2.7.1 (HPUX 11.23, ia64)

Hello there,



I am using python 2.7.1 built on HP-11.23 a Itanium 64 bit box.



I discovered following behavior whereby the python process doesn't seem to release memory utilized even after a variable is set to None, and "deleted". I use glance tool to monitor the memory utilized by this process. Obviously after the for loop is executed, the memory used by this process has hiked to a few MB. However, after "del" is executed to both I and str variables, the memory of that process still stays at where it was.



Any idea why?




... str=str+"%s"%(i,)

...

Hi, I'm new here so I'm making mistakes too but I know they don't like it when you ask your question in someone else's question.

that being said, to answer your question:

Python uses a 'garbage collector'. When you delete something, all references are removed from the object in memory, the memory itself will not be freed until the next time the garbage collector runs. When that happens, all objects without references in memory are removed and the memory freed. If you wait a while you should see that memory free itself.
 
W

Wong Wah Meng-R32813

Thanks for youre reply. I built python 2.7.1 binary myself on the HP box and I wasn't aware there is any configuration or setup that I need to modify in order to activate or engage the garbage collection (or even setting the memory size used). Probably you are right it leaves it to the OS itself (inthis case HP-UX) to clean it up as after python removes the reference to the address of the variables the OS still thinks the python process should still owns it until the process exits.

Regards,
Wah Meng

-----Original Message-----
From: Python-list [mailto:p[email protected]] On Behalf Of Terry Reedy
Sent: Wednesday, March 06, 2013 7:00 PM
To: (e-mail address removed)
Subject: Re: Set x to to None and del x doesn't release memory in python 2.7.1 (HPUX 11.23, ia64)

Hello there,

I am using python 2.7.1 built on HP-11.23 a Itanium 64 bit box.

I discovered following behavior whereby the python process doesn't
seem to release memory utilized even after a variable is set to None,
and "deleted". I use glance tool to monitor the memory utilized by
this process. Obviously after the for loop is executed, the memory
used by this process has hiked to a few MB. However, after "del" is
executed to both I and str variables, the memory of that process still
stays at where it was.

Whether memory freed by deleting an object is returned to and taken by the OS depends on the OS and other factors like like the size and layout of thefreed memory, probably the history of memory use, and for CPython, the C compiler's malloc/free implementation. At various times, the Python memory handlers have been rewritten to encourage/facilitate memory return, but Python cannot control the process.
for i in range(100000L):
str=str+"%s"%(i,)
i=None; str=None # not necessary
del i; del str

Reusing built-in names for unrelated purposes is generally a bad idea, although the final deletion does restore access to the builtin.
 
Ad

Advertisements

D

Dave Angel

Python uses a 'garbage collector'. When you delete something, all references are removed from the object in memory, the memory itself will not be freed until the next time the garbage collector runs. When that happens, all objects without references in memory are removed and the memory freed. If you wait a while you should see that memory free itself.

Actually, no. The problem with monitoring memory usage from outside the
process is that memory "ownership" is hierarchical, and each hierarchy
deals in bigger chunks. So when the CPython runtime calls free() on a
particular piece of memory, the C runtime may or may not actually
release the memory for use by other processes. Since the C runtime
grabs big pieces from the OS, and parcels out little pieces to CPython,
a particular big piece can only be freed if ALL the little pieces are
free. And even then, it may or may not choose to do so.

Completely separate from that are the two mechanisms that CPython uses
to free its pieces. It does reference counting, and it does garbage
collecting. In this case, only the reference counting is relevant, as
when it's done there's no garbage left to collect. When an object is no
longer referenced by anything, its count will be zero, and it will be
freed by calling the C library function. GC is only interesting when
there are cycles in the references, such as when a list contains as one
of its elements a tuple, which in turn contains the original list.
Sound silly? No, it's quite common once complex objects are created
which reference each other. The counts don't go to zero, and the
objects wait for garbage collection.

OP: There's no need to set to None and also to del the name. Since
there's only one None object, keeping another named reference to that
object has very little cost.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top