Memory problems (garbage collection)

C

Carbon Man

Very new to Python, running 2.5 on windows.
I am processing an XML file (7.2MB). Using the standard library I am
recursively processing each node and parsing it. The branches don't go
particularly deep. What is happening is that the program is running really
really slowly, so slow that even running it over night, it still doesn't
finish.
Stepping through it I have noticed that memory usage has shot up from 190MB
to 624MB and continues to climb. If I set a break point and then stop the
program the memory is not released. It is not until I shutdown PythonWin
that the memory gets released.
I thought this might mean objects were not getting GCed, so through the
interactive window I imported gc. gc.garbage is empty. gc.collect() seems to
fix the problem (after much thinking) and reports 2524104. Running it again
returns 0.
I thought that garbage collection was automatic, if I use variables in a
method do I have to del them?
I tried putting a "del node" in all my for node in .... loops but that
didn't help. collect() reports the same number. Tried putting gc.collect()
at the end of the loops but that didn't help either.
If I have the program at a break and do gc.collect() it doesn't fix it, so
whatever referencing is causing problems is still active.
My program is parsing the XML and generating a Python program for
SQLalchemy, but the program never gets a chance to run the memory problem is
prior to that. It probably has something to do with the way I am string
building.

My apologies for the long post but without being able to see the code I
doubt anyone can give me a solid answer so here it goes (sorry for the lack
of comments):

from xml.dom import minidom
import os
import gc

class xmlProcessing:
""" General class for XML processing"""

def process(self, filename="", xmlString=""):
if xmlString:
pass
elif filename:
xmldoc = minidom.parse(filename)
self.parse( xmldoc.documentElement )

def parseBranch(self, parentNode):
""" Process an XML branch """
for node in parentNode.childNodes:
try:
parseMethod = getattr(self, "parse_%s" %
node.__class__.__name__)
except AttributeError:
continue
if parseMethod(node):
continue
self.parseBranch(node)
del node

def parse_Document(self, node):
pass

def parse_Text(self, node):
pass

def parse_Comment(self, node):
pass

def parse_Element(self, node):
try:
handlerMethod = getattr(self, "do_%s" % node.tagName)
except AttributeError:
return False
handlerMethod(node)
return True

class reptorParsing(xmlProcessing):
""" Specific class for generating a SQLalchemy program to create tables
and populate them with data"""

def __init__(self):
self.schemaPreface = """from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine('sqlite:///tutorial.db', echo=False)
metadata = MetaData()
Base = declarative_base()"""
self.schemaTables = ""
self.schemaFields = ""
self.dataUpdate = ""
self.tableDict = {}
self.tableName = ""
self.tables = ""

def parse(self, parentNode):
"""Main entry point to begin processing a XML document"""
self.parseBranch(parentNode)
# Properties such as schemaTables and .tables are populated by the
various methods below
fupdate=open(os.path.join(os.getcwd(), "update.py"), 'w')
if self.schemaTables:
fupdate.write("import schema\n")
f=open(os.path.join(os.getcwd(), "schema.py"), 'w')
f.write(self.schemaPreface+"\n"+self.schemaTables+
'\n' + "metadata.create_all(engine)\n"+
"print 'hello 2'")
f.close()
if self.tables:
fupdate.write(self.tables)
# f=open(os.path.join(os.getcwd(), "dataUpdate.py"), 'w')
# f.write(self.dataUpdate)
# f.close()
fupdate.close()

def do_TABLES(self, tableNode):
"""Process schema for tables"""
for node in tableNode.childNodes:
self.tableName = node.tagName
# Define a declaritive mapping class
self.schemaTables += """\nclass %s(Base):
__tablename__ = '%s'
""" % (self.tableName, self.tableName)
self.schemaFields = ""
# allow for userA = users("Billy","Bob") via a __init__()
self.schemaInitPreface = " def __init__(self"
self.schemaInitBody = ""
self.parseBranch(node)
self.schemaInitPreface += "):\n"
self.schemaTables += self.schemaFields + "\n" + \
self.schemaInitPreface + \
self.schemaInitBody + "\n"
gc.collect()

def do_FIELDS(self, fieldsNode):
"""Process schema for fields within tables"""
for node in fieldsNode.childNodes:
if self.schemaFields:
self.schemaFields += "\n"
cType = ""
# The attribute type holds the type of field
crType = node.attributes["type"].value
if crType==u"C":
cType = "String(length=%s)" % node.attributes["len"].value
elif crType==u"N" and node.attributes["dec"].value==u'0':
cType = "Integer"
elif crType==u"N":
cType = "Numeric(precision=%s, scale=%s)" %
(node.attributes["len"].value,node.attributes["dec"].value)
elif crType==u"L":
cType = "Boolean"
elif crType==u"T":
cType = "DateTime"
elif crType==u"D":
cType = "Date"
elif crType==u"M" or crType==u"G":
cType = "Text"

if node.attributes.getNamedItem("primary"):
cType += ", primary_key=True"
self.schemaFields += " %s = Column(%s)" % (node.tagName,
cType)
self.schemaInitPreface += ", \\\n %s" % (node.tagName)
self.schemaInitBody += " self.%s = %s\n" %
(node.tagName, node.tagName)
self.tableDict[self.tableName + "." + node.tagName] = crType
del node

def do_DATA(self, dataNode):
"""This is for processing actual data to be pushed into the tables

Layout is DATA -> TABLE_NAME key='primary_field' -> TUPLE ->
FIELD_NAME -> VALUE"""
for node in dataNode.childNodes:
self.dataUpdate = """
import time
from datetime import *
from sqlalchemy import *
from sqlalchemy.orm import *
engine = create_engine('sqlite:///tutorial.db', echo=False)
Session = sessionmaker()
Session.configure(bind=engine)
session = Session()
"""
self.keyValue = ""
self.keyField = node.attributes["key"].value
self.tableName = node.tagName
self.parseBranch(node)
self.tables += "\nimport %s_update.py" % (self.tableName)
f=open(os.path.join(os.getcwd(), self.tableName + "_update.py"),
'w')
f.write(self.dataUpdate)
f.close()
gc.collect()

def do_TUPLE(self, tupleNode):
""" A TUPLE is what the XML file refers to a table row
Sits below a DATA child"""
self.dataUpdate += """
entry = %s()
session.add(entry)
""" % (self.tableName)
for node in tupleNode.childNodes:
for dataNode in node.childNodes:
crType = self.tableDict[self.tableName + "." + node.tagName]

if crType==u"C" or crType==u"M":
cValue = '"""%s"""' % dataNode.data
elif crType==u"T":
cValue = 'datetime.strptime("'+dataNode.data+'",
"%Y-%m-%d %H:%M")'
elif crType==u"D":
cValue = 'datetime.strptime("'+dataNode.data+'",
"%Y-%m-%d")'
else:
cValue = dataNode.data

self.dataUpdate += "\nentry.%s = %s" % (node.tagName,
cValue)
del dataNode

self.dataUpdate += "\nsession.commit()"
del node



if __name__ == '__main__':
replicate = reptorParsing()
replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
import update
 
G

Gerhard Häring

Carbon said:
Very new to Python, running 2.5 on windows.
I am processing an XML file (7.2MB). Using the standard library I am
recursively processing each node and parsing it. The branches don't go
particularly deep. What is happening is that the program is running really
really slowly, so slow that even running it over night, it still doesn't
finish.
Stepping through it I have noticed that memory usage has shot up from 190MB
to 624MB and continues to climb.

That sounds indeed like a problem in the code. But even if the XML file
is only 7.2 MB the XML structures and what you create out of them have
some overhead.
If I set a break point and then stop the
program the memory is not released. It is not until I shutdown PythonWin
that the memory gets released.

Then you're apparently looking at VSIZE or whatever it's called on
Windows. It's the maximum memory the process ever allocated. And this
usually *never* decreases, no matter what the application (Python or
otherwise).
[GC experiments]

Unless you have circular references, in my experience automatic garbage
collection in Python works fine. I never had to mess with it myself in
10 years of Python usage.
If I have the program at a break and do gc.collect() it doesn't fix it, so
whatever referencing is causing problems is still active.
My program is parsing the XML and generating a Python program for
SQLalchemy, but the program never gets a chance to run the memory problem is
prior to that. It probably has something to do with the way I am string
building.

Yes, you're apparently concatenating strings. A lot. Don't do that. At
least not this way:

s = ""
s += "something"
s += "else"

instead do this:

from cStringIO import StringIO

s = StringIO()
s.write("something")
s.write("else")
....
s.seek(0)
print s.read()

or

lst = []
lst.append("something")
lst.append("else")
print "".join(lst)

My apologies for the long post but without being able to see the code I
doubt anyone can give me a solid answer so here it goes (sorry for the lack
of comments): [...]

Code snipped.

Two tips: Use one of the above methods for concatenating strings. This
is a common problem in Python (and other languages, Java and C# also
have StringBuilder classes because of this).

If you want to speed up your XML processing, use the ElementTree module
in the standard library. It's a lot easier to use and also faster than
what you're using currently. A bonus is it can be swapped out for the
even faster lxml module (externally available, not in the standard
library) by changing a single import for another noticable performance
improvement.

HTH

-- Gerhard
 
P

Peter Otten

Carbon said:
Very new to Python, running 2.5 on windows.
I am processing an XML file (7.2MB). Using the standard library I am
recursively processing each node and parsing it. The branches don't go
particularly deep. What is happening is that the program is running really
really slowly, so slow that even running it over night, it still doesn't
finish.
Stepping through it I have noticed that memory usage has shot up from
190MB to 624MB and continues to climb. If I set a break point and then
stop the program the memory is not released. It is not until I shutdown
PythonWin that the memory gets released.
I thought this might mean objects were not getting GCed, so through the
interactive window I imported gc. gc.garbage is empty. gc.collect() seems
to fix the problem (after much thinking) and reports 2524104. Running it
again returns 0.
I thought that garbage collection was automatic, if I use variables in a
method do I have to del them?

No. Deleting a local variable only decreases the reference count. In your
code the next iteration of the for loop or returning from the method have
the same effect and occur directly after your del statements.
I tried putting a "del node" in all my for node in .... loops but that
didn't help. collect() reports the same number. Tried putting gc.collect()
at the end of the loops but that didn't help either.
If I have the program at a break and do gc.collect() it doesn't fix it, so
whatever referencing is causing problems is still active.
My program is parsing the XML and generating a Python program for
SQLalchemy, but the program never gets a chance to run the memory problem
is prior to that. It probably has something to do with the way I am string
building.

My apologies for the long post but without being able to see the code I
doubt anyone can give me a solid answer so here it goes (sorry for the
lack of comments):

First, use a small xml file to check if your program terminates and operates
correctly. Then try disabling cyclic garbage collection with gc.disable().
Remove the gc.collect() calls.

This will not help with the memory footprint, but sometimes when you are
creating many new objects that you want to keep Python spends a lot of time
in vain looking for unreachable objects -- so there may be a speedup.
from xml.dom import minidom
import os
import gc

gc.disable()

[snip more code]

Does this improve things?

Like Gerhard says, in the long run you are probably better off with
ElementTree.

Peter
 
C

Carbon Man

Thanks for the help.
I converted everything into the StringIO() format. Memory is still getting
chewed up. I will look at ElementTree later but for now I believe the speed
issue must be related to the amount of memory that is getting used. It is
causing all of windows to slow to a crawl. gc.collect() still reports the
same quantity as before.
Don't know what to try next. Updated program is below:

from xml.dom import minidom
import os
from cStringIO import StringIO

class xmlProcessing:
""" General class for XML processing"""

def process(self, filename="", xmlString=""):
if xmlString:
pass
elif filename:
xmldoc = minidom.parse(filename)
self.parse( xmldoc.documentElement )

def parseBranch(self, parentNode):
""" Process an XML branch """
for node in parentNode.childNodes:
try:
parseMethod = getattr(self, "parse_%s" %
node.__class__.__name__)
except AttributeError:
continue
if parseMethod(node):
continue
self.parseBranch(node)
del node

def parse_Document(self, node):
pass

def parse_Text(self, node):
pass

def parse_Comment(self, node):
pass

def parse_Element(self, node):
try:
handlerMethod = getattr(self, "do_%s" % node.tagName)
except AttributeError:
return False
handlerMethod(node)
return True

class reptorParsing(xmlProcessing):
""" Specific class for generating a SQLalchemy program to create tables
and populate them with data"""

def __init__(self):
self.schemaPreface = StringIO()
self.schemaPreface.write("""from sqlalchemy import *
from sqlalchemy.ext.declarative import declarative_base
engine = create_engine('sqlite:///tutorial.db', echo=False)
metadata = MetaData()
Base = declarative_base()""")
self.schemaTables = StringIO()
self.schemaFields = StringIO()
self.dataUpdate = StringIO()
self.tableDict = {}
self.tableName = StringIO()
self.tables = StringIO()

def parse(self, parentNode):
"""Main entry point to begin processing a XML document"""
self.parseBranch(parentNode)
# Properties such as schemaTables and .tables are populated by the
various methods below
fupdate=open(os.path.join(os.getcwd(), "update.py"), 'w')
if self.schemaTables:
fupdate.write("import schema\n")
f=open(os.path.join(os.getcwd(), "schema.py"), 'w')
f.write(self.schemaPreface+"\n"+self.schemaTables+
'\n' + "metadata.create_all(engine)\n"+
"print 'hello 2'")
f.close()
if self.tables:
fupdate.write(self.tables)
fupdate.close()

def do_TABLES(self, tableNode):
"""Process schema for tables"""
for node in tableNode.childNodes:
self.tableName = node.tagName
# Define a declaritive mapping class
self.schemaTables.write("""\nclass %s(Base):
__tablename__ = '%s'
""" % (self.tableName, self.tableName))
self.schemaFields = StringIO()
# allow for userA = users("Billy","Bob") via a __init__()
self.schemaInitPreface = StringIO()
self.schemaInitPreface.write(" def __init__(self")
self.schemaInitBody = StringIO()
self.parseBranch(node)
self.schemaInitPreface.write("):\n")
self.schemaTables.write(self.schemaFields.read() + "\n" + \
self.schemaInitPreface.read() + \
self.schemaInitBody.read() + "\n")

def do_FIELDS(self, fieldsNode):
"""Process schema for fields within tables"""
for node in fieldsNode.childNodes:
if self.schemaFields:
self.schemaFields.write("\n")
cType = ""
# The attribute type holds the type of field
crType = node.attributes["type"].value
if crType==u"C":
cType = "String(length=%s)" % node.attributes["len"].value
elif crType==u"N" and node.attributes["dec"].value==u'0':
cType = "Integer"
elif crType==u"N":
cType = "Numeric(precision=%s, scale=%s)" %
(node.attributes["len"].value,node.attributes["dec"].value)
elif crType==u"L":
cType = "Boolean"
elif crType==u"T":
cType = "DateTime"
elif crType==u"D":
cType = "Date"
elif crType==u"M" or crType==u"G":
cType = "Text"

if node.attributes.getNamedItem("primary"):
cType += ", primary_key=True"
self.schemaFields.write(" %s = Column(%s)" % (node.tagName,
cType))
self.schemaInitPreface.write(", \\\n %s" %
(node.tagName))
self.schemaInitBody.write(" self.%s = %s\n" %
(node.tagName, node.tagName))
self.tableDict[self.tableName + "." + node.tagName] = crType

def do_DATA(self, dataNode):
"""This is for processing actual data to be pushed into the tables

Layout is DATA -> TABLE_NAME key='primary_field' -> TUPLE ->
FIELD_NAME -> VALUE"""
for node in dataNode.childNodes:
self.tableName = node.tagName
self.dataUpdate=open(os.path.join(os.getcwd(), self.tableName +
"_update.py"), 'w')
self.dataUpdate.write("""
import time
from datetime import *
from sqlalchemy import *
from sqlalchemy.orm import *
engine = create_engine('sqlite:///tutorial.db', echo=False)
Session = sessionmaker()
Session.configure(bind=engine)
session = Session()
""")
self.keyValue = ""
self.keyField = node.attributes["key"].value
self.parseBranch(node)
self.tables.write("\nimport %s_update.py" % (self.tableName))
# f.write(self.dataUpdate)
self.dataUpdate.close()

def do_TUPLE(self, tupleNode):
""" A TUPLE is what the XML file refers to a table row
Sits below a DATA child"""
self.dataUpdate.write("""
entry = %s()
session.add(entry)
""" % (self.tableName))
for node in tupleNode.childNodes:
for dataNode in node.childNodes:
crType = self.tableDict[self.tableName + "." + node.tagName]

if crType==u"C" or crType==u"M":
cValue = u'"""%s"""' % dataNode.data
elif crType==u"T":
cValue = 'datetime.strptime("'+dataNode.data+'",
"%Y-%m-%d %H:%M")'
elif crType==u"D":
cValue = 'datetime.strptime("'+dataNode.data+'",
"%Y-%m-%d")'
else:
cValue = dataNode.data
self.dataUpdate.write(u"\nentry."+node.tagName+ u" = " +
cValue)

self.dataUpdate.write("\nsession.commit()")

if __name__ == '__main__':
replicate = reptorParsing()
replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
import update
 
C

Carbon Man

Thanks for the replies. I got my program working but the memory problem
remains. When the program finishes and I am brought back to the PythonWin
the memory is still tied up until I run gc.collect(). While my choice of
platform for XML processing may not be the best one (I will change it later)
I am still concerned with the memory issue. I can't believe that it could be
an ongoing issue for Python's xml.dom, but nobody was able to actually point
to anything in my code that may have caused it? I changed the way I was
doing string manipulation but, while it may have been beneficial for speed,
it didn't help the memory problem.
This is more nuts and bolts than perhaps a newbie needs to be getting into
but it does concern me. What if my program was a web server and periodically
received these requests? I decided to try something:

if __name__ == '__main__':
replicate = reptorParsing()
replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
import gc
gc.collect()

Fixed the problem, though I wouldn't know why. Thought it might be something
to do with my program but...

if __name__ == '__main__':
replicate = reptorParsing()
replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
del replicate

Did not resolve the memory problem.

Any ideas?
 
P

Paul Hemans

Taking into account that I am very new to Python and so must be missing
something important.... dumping xml.dom and going to lxml made a WORLD of
difference to the performance of the application.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,481
Members
44,900
Latest member
Nell636132

Latest Threads

Top