4DOM eating all my memory

ewan · Feb 1, 2004

hello all -

I'm looping over a set of urls pulled from a database, fetching the
corresponding webpage, and building a DOM tree for it using
xml.dom.ext.reader.HtmlLib (then trying to match titles in a web library
catalogue). all the trees seem to be kept in memory,

however, when I get through fifty or so iterations the program has used
about half my memory and slowed the system to a crawl.

tried turning on all gc debugging flags. they produce lots of output, but it
all says 'collectable' - sounds fine to me.

I even tried doing gc.collect() at the end of every iteration. nothing.
everything seems to be being collected. so why does each iteration increase
the memory usage by several megabytes?

below is some code (and by the way, do I have those 'global's in the right
places?)

any suggestions would be appreciated immeasurably...
ewan

import MySQLdb

....

cursor = db.cursor()
result = cursor.execute("""SELECT CALLNO, TITLE FROM %s""" % table)
rows = cursor.fetchall()
cursor.close()

for row in rows:
current_callno = row[0]
title = row[1]
url = construct_url(title)
cf = callno_finder()
cf.find(title.decode('latin-1'), url)
...

(meanwhile, in another file)
....

class callno_finder:
def __init__(self):
global root
root = None

def find(self, title, uri):
global root

reader = HtmlLib.Reader()
root = reader.fromUri(uri)

# find what we're looking for
...

John J. Lee · Feb 2, 2004

ewan said:
I'm looping over a set of urls pulled from a database, fetching the
corresponding webpage, and building a DOM tree for it using
xml.dom.ext.reader.HtmlLib (then trying to match titles in a web library
catalogue).

Hmm, if this is open-source and it's more than a quick hack, let me
know when you have it working, I maintain a page on open-source stuff
of this nature (bibliographic and cataloguing).

all the trees seem to be kept in memory,

however, when I get through fifty or so iterations the program has used
about half my memory and slowed the system to a crawl.

tried turning on all gc debugging flags. they produce lots of output, but it
all says 'collectable' - sounds fine to me.

I've never had to resort to this... does it tell you what types /
classes are involved? IIRC, there was some code posted to python-dev
to give hints about this (though I guess that was mostly/always for
debugging leaks at the C level).

I even tried doing gc.collect() at the end of every iteration. nothing.
everything seems to be being collected. so why does each iteration increase
the memory usage by several megabytes?

below is some code (and by the way, do I have those 'global's in the right
places?)

Yes, they're in the right places. Not sure a global is really needed,
though...

any suggestions would be appreciated immeasurably... [...]
def find(self, title, uri):
global root

reader = HtmlLib.Reader()
root = reader.fromUri(uri)

# find what we're looking for
...

+ reader.releaseNode(root)

?

John

Chinese input for python and MySQL	1	Nov 21, 2003
Why C Is Not My Favourite Programming Language	132	Feb 5, 2005
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

4DOM eating all my memory

ewan

John J. Lee

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads