4DOM eating all my memory

E

ewan

hello all -

I'm looping over a set of urls pulled from a database, fetching the
corresponding webpage, and building a DOM tree for it using
xml.dom.ext.reader.HtmlLib (then trying to match titles in a web library
catalogue). all the trees seem to be kept in memory,

however, when I get through fifty or so iterations the program has used
about half my memory and slowed the system to a crawl.

tried turning on all gc debugging flags. they produce lots of output, but it
all says 'collectable' - sounds fine to me.

I even tried doing gc.collect() at the end of every iteration. nothing.
everything seems to be being collected. so why does each iteration increase
the memory usage by several megabytes?

below is some code (and by the way, do I have those 'global's in the right
places?)

any suggestions would be appreciated immeasurably...
ewan



import MySQLdb

....

cursor = db.cursor()
result = cursor.execute("""SELECT CALLNO, TITLE FROM %s""" % table)
rows = cursor.fetchall()
cursor.close()

for row in rows:
current_callno = row[0]
title = row[1]
url = construct_url(title)
cf = callno_finder()
cf.find(title.decode('latin-1'), url)
...

(meanwhile, in another file)
....

class callno_finder:
def __init__(self):
global root
root = None

def find(self, title, uri):
global root

reader = HtmlLib.Reader()
root = reader.fromUri(uri)

# find what we're looking for
...
 
J

John J. Lee

ewan said:
I'm looping over a set of urls pulled from a database, fetching the
corresponding webpage, and building a DOM tree for it using
xml.dom.ext.reader.HtmlLib (then trying to match titles in a web library
catalogue).

Hmm, if this is open-source and it's more than a quick hack, let me
know when you have it working, I maintain a page on open-source stuff
of this nature (bibliographic and cataloguing).

all the trees seem to be kept in memory,

however, when I get through fifty or so iterations the program has used
about half my memory and slowed the system to a crawl.

tried turning on all gc debugging flags. they produce lots of output, but it
all says 'collectable' - sounds fine to me.

I've never had to resort to this... does it tell you what types /
classes are involved? IIRC, there was some code posted to python-dev
to give hints about this (though I guess that was mostly/always for
debugging leaks at the C level).

I even tried doing gc.collect() at the end of every iteration. nothing.
everything seems to be being collected. so why does each iteration increase
the memory usage by several megabytes?

below is some code (and by the way, do I have those 'global's in the right
places?)

Yes, they're in the right places. Not sure a global is really needed,
though...

any suggestions would be appreciated immeasurably... [...]
def find(self, title, uri):
global root

reader = HtmlLib.Reader()
root = reader.fromUri(uri)

# find what we're looking for
...

+ reader.releaseNode(root)

?


John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,566
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top