Memory problems (garbage collection)

Discussion in 'Python' started by Carbon Man, Apr 23, 2009.

  1. Carbon Man

    Carbon Man Guest

    Very new to Python, running 2.5 on windows.
    I am processing an XML file (7.2MB). Using the standard library I am
    recursively processing each node and parsing it. The branches don't go
    particularly deep. What is happening is that the program is running really
    really slowly, so slow that even running it over night, it still doesn't
    finish.
    Stepping through it I have noticed that memory usage has shot up from 190MB
    to 624MB and continues to climb. If I set a break point and then stop the
    program the memory is not released. It is not until I shutdown PythonWin
    that the memory gets released.
    I thought this might mean objects were not getting GCed, so through the
    interactive window I imported gc. gc.garbage is empty. gc.collect() seems to
    fix the problem (after much thinking) and reports 2524104. Running it again
    returns 0.
    I thought that garbage collection was automatic, if I use variables in a
    method do I have to del them?
    I tried putting a "del node" in all my for node in .... loops but that
    didn't help. collect() reports the same number. Tried putting gc.collect()
    at the end of the loops but that didn't help either.
    If I have the program at a break and do gc.collect() it doesn't fix it, so
    whatever referencing is causing problems is still active.
    My program is parsing the XML and generating a Python program for
    SQLalchemy, but the program never gets a chance to run the memory problem is
    prior to that. It probably has something to do with the way I am string
    building.

    My apologies for the long post but without being able to see the code I
    doubt anyone can give me a solid answer so here it goes (sorry for the lack
    of comments):

    from xml.dom import minidom
    import os
    import gc

    class xmlProcessing:
    """ General class for XML processing"""

    def process(self, filename="", xmlString=""):
    if xmlString:
    pass
    elif filename:
    xmldoc = minidom.parse(filename)
    self.parse( xmldoc.documentElement )

    def parseBranch(self, parentNode):
    """ Process an XML branch """
    for node in parentNode.childNodes:
    try:
    parseMethod = getattr(self, "parse_%s" %
    node.__class__.__name__)
    except AttributeError:
    continue
    if parseMethod(node):
    continue
    self.parseBranch(node)
    del node

    def parse_Document(self, node):
    pass

    def parse_Text(self, node):
    pass

    def parse_Comment(self, node):
    pass

    def parse_Element(self, node):
    try:
    handlerMethod = getattr(self, "do_%s" % node.tagName)
    except AttributeError:
    return False
    handlerMethod(node)
    return True

    class reptorParsing(xmlProcessing):
    """ Specific class for generating a SQLalchemy program to create tables
    and populate them with data"""

    def __init__(self):
    self.schemaPreface = """from sqlalchemy import *
    from sqlalchemy.ext.declarative import declarative_base
    engine = create_engine('sqlite:///tutorial.db', echo=False)
    metadata = MetaData()
    Base = declarative_base()"""
    self.schemaTables = ""
    self.schemaFields = ""
    self.dataUpdate = ""
    self.tableDict = {}
    self.tableName = ""
    self.tables = ""

    def parse(self, parentNode):
    """Main entry point to begin processing a XML document"""
    self.parseBranch(parentNode)
    # Properties such as schemaTables and .tables are populated by the
    various methods below
    fupdate=open(os.path.join(os.getcwd(), "update.py"), 'w')
    if self.schemaTables:
    fupdate.write("import schema\n")
    f=open(os.path.join(os.getcwd(), "schema.py"), 'w')
    f.write(self.schemaPreface+"\n"+self.schemaTables+
    '\n' + "metadata.create_all(engine)\n"+
    "print 'hello 2'")
    f.close()
    if self.tables:
    fupdate.write(self.tables)
    # f=open(os.path.join(os.getcwd(), "dataUpdate.py"), 'w')
    # f.write(self.dataUpdate)
    # f.close()
    fupdate.close()

    def do_TABLES(self, tableNode):
    """Process schema for tables"""
    for node in tableNode.childNodes:
    self.tableName = node.tagName
    # Define a declaritive mapping class
    self.schemaTables += """\nclass %s(Base):
    __tablename__ = '%s'
    """ % (self.tableName, self.tableName)
    self.schemaFields = ""
    # allow for userA = users("Billy","Bob") via a __init__()
    self.schemaInitPreface = " def __init__(self"
    self.schemaInitBody = ""
    self.parseBranch(node)
    self.schemaInitPreface += "):\n"
    self.schemaTables += self.schemaFields + "\n" + \
    self.schemaInitPreface + \
    self.schemaInitBody + "\n"
    gc.collect()

    def do_FIELDS(self, fieldsNode):
    """Process schema for fields within tables"""
    for node in fieldsNode.childNodes:
    if self.schemaFields:
    self.schemaFields += "\n"
    cType = ""
    # The attribute type holds the type of field
    crType = node.attributes["type"].value
    if crType==u"C":
    cType = "String(length=%s)" % node.attributes["len"].value
    elif crType==u"N" and node.attributes["dec"].value==u'0':
    cType = "Integer"
    elif crType==u"N":
    cType = "Numeric(precision=%s, scale=%s)" %
    (node.attributes["len"].value,node.attributes["dec"].value)
    elif crType==u"L":
    cType = "Boolean"
    elif crType==u"T":
    cType = "DateTime"
    elif crType==u"D":
    cType = "Date"
    elif crType==u"M" or crType==u"G":
    cType = "Text"

    if node.attributes.getNamedItem("primary"):
    cType += ", primary_key=True"
    self.schemaFields += " %s = Column(%s)" % (node.tagName,
    cType)
    self.schemaInitPreface += ", \\\n %s" % (node.tagName)
    self.schemaInitBody += " self.%s = %s\n" %
    (node.tagName, node.tagName)
    self.tableDict[self.tableName + "." + node.tagName] = crType
    del node

    def do_DATA(self, dataNode):
    """This is for processing actual data to be pushed into the tables

    Layout is DATA -> TABLE_NAME key='primary_field' -> TUPLE ->
    FIELD_NAME -> VALUE"""
    for node in dataNode.childNodes:
    self.dataUpdate = """
    import time
    from datetime import *
    from sqlalchemy import *
    from sqlalchemy.orm import *
    engine = create_engine('sqlite:///tutorial.db', echo=False)
    Session = sessionmaker()
    Session.configure(bind=engine)
    session = Session()
    """
    self.keyValue = ""
    self.keyField = node.attributes["key"].value
    self.tableName = node.tagName
    self.parseBranch(node)
    self.tables += "\nimport %s_update.py" % (self.tableName)
    f=open(os.path.join(os.getcwd(), self.tableName + "_update.py"),
    'w')
    f.write(self.dataUpdate)
    f.close()
    gc.collect()

    def do_TUPLE(self, tupleNode):
    """ A TUPLE is what the XML file refers to a table row
    Sits below a DATA child"""
    self.dataUpdate += """
    entry = %s()
    session.add(entry)
    """ % (self.tableName)
    for node in tupleNode.childNodes:
    for dataNode in node.childNodes:
    crType = self.tableDict[self.tableName + "." + node.tagName]

    if crType==u"C" or crType==u"M":
    cValue = '"""%s"""' % dataNode.data
    elif crType==u"T":
    cValue = 'datetime.strptime("'+dataNode.data+'",
    "%Y-%m-%d %H:%M")'
    elif crType==u"D":
    cValue = 'datetime.strptime("'+dataNode.data+'",
    "%Y-%m-%d")'
    else:
    cValue = dataNode.data

    self.dataUpdate += "\nentry.%s = %s" % (node.tagName,
    cValue)
    del dataNode

    self.dataUpdate += "\nsession.commit()"
    del node



    if __name__ == '__main__':
    replicate = reptorParsing()
    replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
    import update
     
    Carbon Man, Apr 23, 2009
    #1
    1. Advertising

  2. Carbon Man wrote:
    > Very new to Python, running 2.5 on windows.
    > I am processing an XML file (7.2MB). Using the standard library I am
    > recursively processing each node and parsing it. The branches don't go
    > particularly deep. What is happening is that the program is running really
    > really slowly, so slow that even running it over night, it still doesn't
    > finish.
    > Stepping through it I have noticed that memory usage has shot up from 190MB
    > to 624MB and continues to climb.


    That sounds indeed like a problem in the code. But even if the XML file
    is only 7.2 MB the XML structures and what you create out of them have
    some overhead.

    > If I set a break point and then stop the
    > program the memory is not released. It is not until I shutdown PythonWin
    > that the memory gets released.


    Then you're apparently looking at VSIZE or whatever it's called on
    Windows. It's the maximum memory the process ever allocated. And this
    usually *never* decreases, no matter what the application (Python or
    otherwise).

    > [GC experiments]


    Unless you have circular references, in my experience automatic garbage
    collection in Python works fine. I never had to mess with it myself in
    10 years of Python usage.

    > If I have the program at a break and do gc.collect() it doesn't fix it, so
    > whatever referencing is causing problems is still active.
    > My program is parsing the XML and generating a Python program for
    > SQLalchemy, but the program never gets a chance to run the memory problem is
    > prior to that. It probably has something to do with the way I am string
    > building.


    Yes, you're apparently concatenating strings. A lot. Don't do that. At
    least not this way:

    s = ""
    s += "something"
    s += "else"

    instead do this:

    from cStringIO import StringIO

    s = StringIO()
    s.write("something")
    s.write("else")
    ....
    s.seek(0)
    print s.read()

    or

    lst = []
    lst.append("something")
    lst.append("else")
    print "".join(lst)


    > My apologies for the long post but without being able to see the code I
    > doubt anyone can give me a solid answer so here it goes (sorry for the lack
    > of comments): [...]


    Code snipped.

    Two tips: Use one of the above methods for concatenating strings. This
    is a common problem in Python (and other languages, Java and C# also
    have StringBuilder classes because of this).

    If you want to speed up your XML processing, use the ElementTree module
    in the standard library. It's a lot easier to use and also faster than
    what you're using currently. A bonus is it can be swapped out for the
    even faster lxml module (externally available, not in the standard
    library) by changing a single import for another noticable performance
    improvement.

    HTH

    -- Gerhard
     
    Gerhard Häring, Apr 23, 2009
    #2
    1. Advertising

  3. Gerhard Häring, Apr 23, 2009
    #3
  4. Carbon Man

    Peter Otten Guest

    Carbon Man wrote:

    > Very new to Python, running 2.5 on windows.
    > I am processing an XML file (7.2MB). Using the standard library I am
    > recursively processing each node and parsing it. The branches don't go
    > particularly deep. What is happening is that the program is running really
    > really slowly, so slow that even running it over night, it still doesn't
    > finish.
    > Stepping through it I have noticed that memory usage has shot up from
    > 190MB to 624MB and continues to climb. If I set a break point and then
    > stop the program the memory is not released. It is not until I shutdown
    > PythonWin that the memory gets released.
    > I thought this might mean objects were not getting GCed, so through the
    > interactive window I imported gc. gc.garbage is empty. gc.collect() seems
    > to fix the problem (after much thinking) and reports 2524104. Running it
    > again returns 0.
    > I thought that garbage collection was automatic, if I use variables in a
    > method do I have to del them?


    No. Deleting a local variable only decreases the reference count. In your
    code the next iteration of the for loop or returning from the method have
    the same effect and occur directly after your del statements.

    > I tried putting a "del node" in all my for node in .... loops but that
    > didn't help. collect() reports the same number. Tried putting gc.collect()
    > at the end of the loops but that didn't help either.
    > If I have the program at a break and do gc.collect() it doesn't fix it, so
    > whatever referencing is causing problems is still active.
    > My program is parsing the XML and generating a Python program for
    > SQLalchemy, but the program never gets a chance to run the memory problem
    > is prior to that. It probably has something to do with the way I am string
    > building.
    >
    > My apologies for the long post but without being able to see the code I
    > doubt anyone can give me a solid answer so here it goes (sorry for the
    > lack of comments):


    First, use a small xml file to check if your program terminates and operates
    correctly. Then try disabling cyclic garbage collection with gc.disable().
    Remove the gc.collect() calls.

    This will not help with the memory footprint, but sometimes when you are
    creating many new objects that you want to keep Python spends a lot of time
    in vain looking for unreachable objects -- so there may be a speedup.

    > from xml.dom import minidom
    > import os
    > import gc


    gc.disable()

    [snip more code]

    Does this improve things?

    Like Gerhard says, in the long run you are probably better off with
    ElementTree.

    Peter
     
    Peter Otten, Apr 23, 2009
    #4
  5. Carbon Man

    Carbon Man Guest

    Thanks for the help.
    I converted everything into the StringIO() format. Memory is still getting
    chewed up. I will look at ElementTree later but for now I believe the speed
    issue must be related to the amount of memory that is getting used. It is
    causing all of windows to slow to a crawl. gc.collect() still reports the
    same quantity as before.
    Don't know what to try next. Updated program is below:

    from xml.dom import minidom
    import os
    from cStringIO import StringIO

    class xmlProcessing:
    """ General class for XML processing"""

    def process(self, filename="", xmlString=""):
    if xmlString:
    pass
    elif filename:
    xmldoc = minidom.parse(filename)
    self.parse( xmldoc.documentElement )

    def parseBranch(self, parentNode):
    """ Process an XML branch """
    for node in parentNode.childNodes:
    try:
    parseMethod = getattr(self, "parse_%s" %
    node.__class__.__name__)
    except AttributeError:
    continue
    if parseMethod(node):
    continue
    self.parseBranch(node)
    del node

    def parse_Document(self, node):
    pass

    def parse_Text(self, node):
    pass

    def parse_Comment(self, node):
    pass

    def parse_Element(self, node):
    try:
    handlerMethod = getattr(self, "do_%s" % node.tagName)
    except AttributeError:
    return False
    handlerMethod(node)
    return True

    class reptorParsing(xmlProcessing):
    """ Specific class for generating a SQLalchemy program to create tables
    and populate them with data"""

    def __init__(self):
    self.schemaPreface = StringIO()
    self.schemaPreface.write("""from sqlalchemy import *
    from sqlalchemy.ext.declarative import declarative_base
    engine = create_engine('sqlite:///tutorial.db', echo=False)
    metadata = MetaData()
    Base = declarative_base()""")
    self.schemaTables = StringIO()
    self.schemaFields = StringIO()
    self.dataUpdate = StringIO()
    self.tableDict = {}
    self.tableName = StringIO()
    self.tables = StringIO()

    def parse(self, parentNode):
    """Main entry point to begin processing a XML document"""
    self.parseBranch(parentNode)
    # Properties such as schemaTables and .tables are populated by the
    various methods below
    fupdate=open(os.path.join(os.getcwd(), "update.py"), 'w')
    if self.schemaTables:
    fupdate.write("import schema\n")
    f=open(os.path.join(os.getcwd(), "schema.py"), 'w')
    f.write(self.schemaPreface+"\n"+self.schemaTables+
    '\n' + "metadata.create_all(engine)\n"+
    "print 'hello 2'")
    f.close()
    if self.tables:
    fupdate.write(self.tables)
    fupdate.close()

    def do_TABLES(self, tableNode):
    """Process schema for tables"""
    for node in tableNode.childNodes:
    self.tableName = node.tagName
    # Define a declaritive mapping class
    self.schemaTables.write("""\nclass %s(Base):
    __tablename__ = '%s'
    """ % (self.tableName, self.tableName))
    self.schemaFields = StringIO()
    # allow for userA = users("Billy","Bob") via a __init__()
    self.schemaInitPreface = StringIO()
    self.schemaInitPreface.write(" def __init__(self")
    self.schemaInitBody = StringIO()
    self.parseBranch(node)
    self.schemaInitPreface.write("):\n")
    self.schemaTables.write(self.schemaFields.read() + "\n" + \
    self.schemaInitPreface.read() + \
    self.schemaInitBody.read() + "\n")

    def do_FIELDS(self, fieldsNode):
    """Process schema for fields within tables"""
    for node in fieldsNode.childNodes:
    if self.schemaFields:
    self.schemaFields.write("\n")
    cType = ""
    # The attribute type holds the type of field
    crType = node.attributes["type"].value
    if crType==u"C":
    cType = "String(length=%s)" % node.attributes["len"].value
    elif crType==u"N" and node.attributes["dec"].value==u'0':
    cType = "Integer"
    elif crType==u"N":
    cType = "Numeric(precision=%s, scale=%s)" %
    (node.attributes["len"].value,node.attributes["dec"].value)
    elif crType==u"L":
    cType = "Boolean"
    elif crType==u"T":
    cType = "DateTime"
    elif crType==u"D":
    cType = "Date"
    elif crType==u"M" or crType==u"G":
    cType = "Text"

    if node.attributes.getNamedItem("primary"):
    cType += ", primary_key=True"
    self.schemaFields.write(" %s = Column(%s)" % (node.tagName,
    cType))
    self.schemaInitPreface.write(", \\\n %s" %
    (node.tagName))
    self.schemaInitBody.write(" self.%s = %s\n" %
    (node.tagName, node.tagName))
    self.tableDict[self.tableName + "." + node.tagName] = crType

    def do_DATA(self, dataNode):
    """This is for processing actual data to be pushed into the tables

    Layout is DATA -> TABLE_NAME key='primary_field' -> TUPLE ->
    FIELD_NAME -> VALUE"""
    for node in dataNode.childNodes:
    self.tableName = node.tagName
    self.dataUpdate=open(os.path.join(os.getcwd(), self.tableName +
    "_update.py"), 'w')
    self.dataUpdate.write("""
    import time
    from datetime import *
    from sqlalchemy import *
    from sqlalchemy.orm import *
    engine = create_engine('sqlite:///tutorial.db', echo=False)
    Session = sessionmaker()
    Session.configure(bind=engine)
    session = Session()
    """)
    self.keyValue = ""
    self.keyField = node.attributes["key"].value
    self.parseBranch(node)
    self.tables.write("\nimport %s_update.py" % (self.tableName))
    # f.write(self.dataUpdate)
    self.dataUpdate.close()

    def do_TUPLE(self, tupleNode):
    """ A TUPLE is what the XML file refers to a table row
    Sits below a DATA child"""
    self.dataUpdate.write("""
    entry = %s()
    session.add(entry)
    """ % (self.tableName))
    for node in tupleNode.childNodes:
    for dataNode in node.childNodes:
    crType = self.tableDict[self.tableName + "." + node.tagName]

    if crType==u"C" or crType==u"M":
    cValue = u'"""%s"""' % dataNode.data
    elif crType==u"T":
    cValue = 'datetime.strptime("'+dataNode.data+'",
    "%Y-%m-%d %H:%M")'
    elif crType==u"D":
    cValue = 'datetime.strptime("'+dataNode.data+'",
    "%Y-%m-%d")'
    else:
    cValue = dataNode.data
    self.dataUpdate.write(u"\nentry."+node.tagName+ u" = " +
    cValue)

    self.dataUpdate.write("\nsession.commit()")

    if __name__ == '__main__':
    replicate = reptorParsing()
    replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
    import update
     
    Carbon Man, Apr 23, 2009
    #5
  6. Carbon Man

    Carbon Man Guest

    (UPDATE) Memory problems (garbage collection)

    Thanks for the replies. I got my program working but the memory problem
    remains. When the program finishes and I am brought back to the PythonWin
    the memory is still tied up until I run gc.collect(). While my choice of
    platform for XML processing may not be the best one (I will change it later)
    I am still concerned with the memory issue. I can't believe that it could be
    an ongoing issue for Python's xml.dom, but nobody was able to actually point
    to anything in my code that may have caused it? I changed the way I was
    doing string manipulation but, while it may have been beneficial for speed,
    it didn't help the memory problem.
    This is more nuts and bolts than perhaps a newbie needs to be getting into
    but it does concern me. What if my program was a web server and periodically
    received these requests? I decided to try something:

    if __name__ == '__main__':
    replicate = reptorParsing()
    replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
    import gc
    gc.collect()

    Fixed the problem, though I wouldn't know why. Thought it might be something
    to do with my program but...

    if __name__ == '__main__':
    replicate = reptorParsing()
    replicate.process(filename=os.path.join(os.getcwd(), "request.xml"))
    del replicate

    Did not resolve the memory problem.

    Any ideas?
     
    Carbon Man, Apr 24, 2009
    #6
  7. Carbon Man

    Paul Hemans Guest

    Memory problems - fixed!

    Taking into account that I am very new to Python and so must be missing
    something important.... dumping xml.dom and going to lxml made a WORLD of
    difference to the performance of the application.
     
    Paul Hemans, Apr 29, 2009
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andy
    Replies:
    6
    Views:
    3,678
  2. Richard Jones
    Replies:
    0
    Views:
    367
    Richard Jones
    Apr 29, 2004
  3. Richard Jones
    Replies:
    0
    Views:
    364
    Richard Jones
    Apr 29, 2004
  4. Richard Jones
    Replies:
    0
    Views:
    252
    Richard Jones
    Apr 29, 2004
  5. Øyvind Isaksen
    Replies:
    1
    Views:
    987
    Øyvind Isaksen
    May 18, 2007
Loading...

Share This Page