Re: python gc performance in large apps

Discussion in 'Python' started by Robby Dermody, Nov 4, 2005.

  1. Hey guys,

    An update (along with a request for paid help at the end): Over the past
    week and a half I've improved the memory usage situation by quite a bit.
    Going to python 2.4, linux kernel 2.6, twisted 2.0 and altering some
    code have reduced the severity by an order of a magnitude. However, on a
    simulated call center environment (constant 50 conversations, switching
    every 300 seconds) the director component still consumes an average of
    1MB more per hour and the harvester is taking an average of 4MB more per
    hour. With the director, 2/3 of this is resident (1/3 in swap). With the
    harvester, about 88% of this is resident (~ 12% in swap).

    After looking into things more, based on the helpful input I got from
    the very intelligent individuals who responded to the thread, I know a
    few more things:

    -Neither component has any uncollectable trash problems. gc.garbage
    never has any items after a full collection (and there are never any
    uncollectable cycles).

    -Running the harvester through valgrind under very light load showed 2
    minor memory leaks of my making in the "harvester core" (pyrex code).
    Both were fixed. Subsequent runs yielded nothing in my pyrex modules.

    -len(gc.get_objects()) will show linearly increasing counts over time
    with the director (that match the curve of the director's memory usage
    exactly), but with the harvester, the object count doesn't increase over
    time (although memory usage does). This might mean that we are dealing
    with two separate problems on each component....uncollected object count
    growth on the director, and something else on the harvester. ...OR the
    harvester may have an object count growth problem as well, it might just
    be in a C module in a place not visible to gc.get_objects() ?

    -I have some code that looks through gc.get_objects() to find the 10
    biggest lists and dicts, and I see no tell tale signs there (nothing
    growing out of control).

    -Running the harvester on python 2.4 (and 2.5, same results) compiled
    with COUNT_ALLOCS reported some interesting results with lists and dicts
    that may explain some of the memory usage:
    t = 0:
    list alloced: 19167, freed: 8231, max in use: 10937
    dict alloced: 47943, freed: 34108, max in use: 13837
    t = 120 seconds (1st run after being fully initialized):
    list alloced: 2394620, freed: 17565, max in use: 2377056
    dict alloced: 2447968, freed: 67999, max in use: 2379969
    t = ~20 hours:
    list alloced: 834032375, freed: 4625810, max in use: 829406570
    dict alloced: 845149422, freed: 15715727, max in use: 829433695

    The numbers for the other types mostly looked normal, but these two kept
    growing after every subsequent call to sys.getcounts(). I have not yet
    run the director with COUNT_ALLOCS yet, due to a problem dynamically
    loading in the kinterbasdb (Firebird SQL interface) module.

    -Memory usage is constant on both components when x conversations are
    started but never stopped. The code that handles that is just in the
    harvester, and is in pure C (in pyrex)... when a conversation is started
    or stopped, etc, python and pyrex code in both components runs, and
    things like twisted PB are used. This 1MB/hour and 4MB/hour increase is
    happening just by having the active convos stop and restart, but I can't
    see what would cause that kind of growth.

    -Running on Python 2.5 (latest CVS HEAD) didn't change anything


    At this point I think I've done all that I can do on my own with this,
    given my meager level of knowledge in this field :). I would be
    interested in retaining some professional help to see if, working
    together, we can trace down some likely culprits. I am also very
    interested in learning more on how to avoid these problems in the
    future, if they are my doing. I doubt we can remove all the memory
    problems, but if we can get this down by another 150 - 200%, that will
    be good enough for me.

    With that said, I would be prepared to pay someone to help me out with
    this problem (and offer tips on improving the code... I am very
    interested in improving both my skills and the code's quality). The
    director and harvester programs can both run on a single box running
    Linux kernel 2.6, python 2.4. There aren't that many dependencies, and I
    have a simple loadtester program that can be used to stress test the
    environment. I've also already written a module for both components that
    logs a bunch of useful memory usage statistics out to disk at whatever
    interval desired.

    I would be looking for an experienced python developer with the
    following knowledge:

    -python internals, python "best practices"
    -twisted framework
    -pyrex (and python C API)
    -previous experience fixing similar problems in python

    If anyone is interested in helping out, or know someone who might be
    interested, please get back to me with your asking rate (hourly, or by
    the project if you prefer). We can go from there.

    Also, as this project is starting to generate revenue, I will be hiring
    someone full time soon to take over primary development. This person
    doesn't need to be a python god. Someone with 2-3+ years of experience
    with python, subversion, etc. would be good enough, as long as they are
    a geek at heart, intelligent and adaptive, passionate about this kind of
    work and self motivated. Even a bright recent college grad with open
    source experience under his/her belt would be fine. If anyone knows of
    anyone who might love this kind of job, let me know. This is truly a
    _very_ interesting product.

    Robby


    Robby Dermody wrote:
    > Hey guys (thus begins a book of a post :),
    >
    > I'm in the process of writing a commercial VoIP call monitoring and
    > recording application suite in python and pyrex. Basically, this
    > software sits in a VoIP callcenter-type environment (complete with agent
    > phones and VoIP servers), sniffs voice data off of the network, and
    > allows users to listen into calls. It can record calls as well. The
    > project is about a year and 3 months in the making and lately the
    > codebase has stabilized enough to where it can be used by some of our
    > clients. The entire project has about 37,000 lines of python and pyrex
    > code (along with 1-2K lines of unrelated java code).
    >
    > Now, some disjointed rambling about the architecture of this software.
    > This software has two long-running server-type components. One
    > component, the "director" application, is written in pure python and
    > makes use of the twisted, nevow, and kinterbasdb libraries (which I
    > realize link to some C extensions). The other component, the
    > "harvester", is a mixture of python and pyrex, and makes use of the
    > twisted library, along with using the C libs libpcap and glib on the
    > pyrex end. Basically, the director is the "master" component. A single
    > director process interacts with users of the system through a web and/or
    > pygtk client application interface and can coordinate 1 to n harvesters
    > spread about the world. The harvester is the "heavy lifter" component
    > that sniffs the network traffic and sifts out the voice and signalling
    > data. It then updates the director of call status changes, and can
    > provide users of the system access to the data. It records the data to
    > disk as well. The scalibility of this thing is really cool: given a
    > single director sitting somewhere coordinating the list of agents,
    > multiple harvester can be placed anywhere there is voice traffic. A user
    > that logs into the director can end up seeing the activity of all of
    > these seperate voice networks presented like a single giant mesh.
    >
    > Overall, I have been very pleased with python and the 3rd party
    > libraries that I use (twisted, nevow, kinterbasdb and pygtk). It is a
    > joy to program with, and I think the python community has done a fine
    > job. However, as I have been running the software lately and profiling
    > its memory usage, the one and only Big Problem I have seen is that of
    > the memory usage. Ideally, the server application(s) should be able to
    > run indefinitely, but from the results I'm seeing I will end up
    > exhausting the memory on a 2 GB machine in 2 to 3 days of heavy load.
    >
    > Now normally I would not raise up an issue like this on this list, but
    > based on the conversations held on this list lately, and the work done
    > by Evan Jones (http://evanjones.ca/python-memory.html), I am led to
    > believe that this memory usage -- while partially due to some probably
    > leaks in my program -- is largely due to the current python gc. I have
    > some graphs I made to show the extent of this memory usage growth:
    >
    > http://public.robbyd.fastmail.fm/iq-graph1.gif
    >
    > http://public.robbyd.fastmail.fm/iq-graph-director-rss.gif
    >
    > http://public.robbyd.fastmail.fm/iq-graph-harv-rss.gif
    >
    > The preceding three diagrams are the result of running the 1 director
    > process and 1 harvester process on the same machine for about 48 hours.
    > This is the most basic configuration of this software. I was running
    > this application through /usr/bin/python (CPython) on a Debian 'testing'
    > box running Linux 2.4 with 2GB of memory and Python version 2.3.5.
    > During that time, I gathered the resident and virtual memory size of
    > each component at 120 second intervals. I then imported this data into
    > MINITAB and did some plots. The first one is a graph of the resident
    > (RSS) and virtual memory usage of the two applications. The second one
    > is a zoomed in graph of the director's resident memory usage (complete
    > with a best fit quadratic), and the 3rd one is a zoomed in graph of the
    > harvester's resident memory usage.
    >
    > To give you an idea of the network load these apps were undergoing
    > during this sampling time, by the time 48 hours had passed, the
    > harvester had gathered and parsed about 900 million packets. During the
    > day there will be 50-70 agents talking. This number goes to 10-30 at night.
    >
    > In the diagrams above, one can see the night-day separation clearly. At
    > night, the memory usage growth seemed to all but stop, but with the
    > increased call volume of the day, it started shooting off again. When I
    > first started gathering this data, I was hoping for a logarithmic curve,
    > but at least after 48 hours, it looks like the usage increase is almost
    > linear. (Although logarithmic may still be the case after it exceeds a
    > gig or two of used memory. :) I'm not sure if this is something that I
    > should expect from the current gc, and when it would stop.
    >
    > Now, as I stated above, I am certain that at least some of this
    > increased memory usage is due to either un-collectable objects in the
    > python code, or memory leaks in the pyrex code (where I make some use of
    > malloc/free). I am working on finding and removing these issues, but
    > from what I've seen with the help of gc UNCOLLECTABLE traces, there are
    > not many un-collectable reference issues at least. Yes, there are some
    > but definitely not enough to justify growth like I am seeing. The pyrex
    > side should not be leaking too much, I'm very good about freeing what I
    > allocate in pyrex/C land. I will be running that linked to a memory leak
    > finding library in the next few days. Past the code reviews I've done,
    > what makes me think that I don't have any *wild* leaks going on at least
    > with the pyrex code is that I am seeing the same type of growth patterns
    > in both apps, and I don't use any pyrex with the director. Yes, the
    > harvester is consuming much more memory, but it also does the majority
    > of the heavy lifting.
    >
    > I am alright with the app not freeing all the memory it can between high
    > and low activity times, but what puzzles me is how the memory usage just
    > keeps on growing and growing. Will it ever stop?
    >
    > What I would like to know if others on this list have had similar
    > problems with python's gc in long running, larger python applications.
    > Am I crazy or is this a real problem with python's gc itself? If it's a
    > python gc issue, then it's my opinion that we will need to enhance the
    > gc before python can really gain leverage as a language suitable for
    > "enterprise-class" applications. I have surprised many other programmers
    > that I'm writing an application like this in python/pyrex that works
    > just as well and even more efficiently than the C/C++/Java competitors.
    > The only thing I have left to show is that the app lasts as long between
    > restarts. ;)
    >
    >
    > Robby
     
    Robby Dermody, Nov 4, 2005
    #1
    1. Advertising

  2. Robby Dermody

    David Rushby Guest

    Robby Dermody wrote:
    > I have not yet run the director with COUNT_ALLOCS yet, due to a
    > problem dynamically loading in the kinterbasdb (Firebird SQL
    > interface) module.


    What're the specifics of the loading problem with kinterbasdb?

    AFAIK, kinterbasdb itself runs fine in a debug build of Python. The
    egenix mx extensions, which kinterbasdb uses for date/time handling
    unless the user specifies otherwise, do not. I've conferred with MAL
    on this topic and was told that egenix "doesn't support" running the
    extensions in a debug build. I made my own modifications to the mx
    extensions so that they work properly in a debug build; I could post
    the modified version somewhere for download if you're interested.
     
    David Rushby, Nov 4, 2005
    #2
    1. Advertising

  3. Robby Dermody

    Paul Rubin Guest

    Robby Dermody <> writes:
    > t = 120 seconds (1st run after being fully initialized):
    > list alloced: 2394620, freed: 17565, max in use: 2377056
    > dict alloced: 2447968, freed: 67999, max in use: 2379969


    This looks like a garden variety memory leak. I think the next thing
    to do is to run Python under gdb and sut a breakpoint in the allocator
    that saves a stack trace at every list or dict allocation and then
    continues, or alternatively put some code into the interpreter itself
    to do that. You'll probably find that all these leaked allocations
    are coming from the same place and you're forgetting to DECREF something.
     
    Paul Rubin, Nov 5, 2005
    #3
  4. Robby Dermody

    Bryan Olson Guest

    Robby Dermody wrote:
    > [...] However, on a
    > simulated call center environment (constant 50 conversations, switching
    > every 300 seconds) the director component still consumes an average of
    > 1MB more per hour and the harvester is taking an average of 4MB more per
    > hour. With the director, 2/3 of this is resident (1/3 in swap). With the
    > harvester, about 88% of this is resident (~ 12% in swap).


    For a long-uptime server, that's obviously not going to work.
    Memory leaks are of course, bugs, and bugs call for debugging,
    but given the long history of such bugs in Python and similar
    projects, there's no real chance of eliminating them.

    You might look at how the Apache HTTP server deals with this.
    Apache can stay up arbitrarily long, even when it links in
    modules with slow memory leaks. A parent process forks off
    child processes as needed, and the children handle all the
    client services. Each child is, by default, limited to
    serving 10,000 connections, a number adjustable via the
    MaxRequestsPerChild directive. When a child reaches its
    connection limit, it stops listening for new connection
    requests; when the last of the connection closes, the
    child gracefully dies.


    > [...] I doubt we can remove all the memory
    > problems, but if we can get this down by another 150 - 200%, that will
    > be good enough for me.


    Down by more than 100% ? Wouldn't that require turning
    memory leaks into memory wells?


    --
    --Bryan
     
    Bryan Olson, Nov 5, 2005
    #4
  5. In message <>, Robby
    Dermody <> writes
    >An update (along with a request for paid help at the end): Over the
    >past week and a half I've improved the memory usage situation by quite
    >a bit. Going to python 2.4, linux kernel 2.6, twisted 2.0 and altering
    >some code


    I don't know if you have a Windows version of this software but if you
    do you may find Python Memory Validator useful.

    http://www.softwareverify.com/pythonMemoryValidator/index.html

    Stephen
    --
    Stephen Kellett
    Object Media Limited http://www.objmedia.demon.co.uk/software.html
    Computer Consultancy, Software Development
    Windows C++, Java, Assembler, Performance Analysis, Troubleshooting
     
    Stephen Kellett, Nov 5, 2005
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. terry
    Replies:
    1
    Views:
    480
    William F. Robertson, Jr.
    Jul 7, 2003
  2. Robby Dermody

    python gc performance in large apps

    Robby Dermody, Oct 21, 2005, in forum: Python
    Replies:
    0
    Views:
    393
    Robby Dermody
    Oct 21, 2005
  3. aph
    Replies:
    4
    Views:
    418
    Peter Hansen
    Jan 14, 2006
  4. anonymous

    Call windows apps from web apps

    anonymous, Feb 22, 2005, in forum: ASP .Net Datagrid Control
    Replies:
    4
    Views:
    227
    anonymous
    Feb 28, 2005
  5. Richard Choate

    Web enabled apps/Thin client apps

    Richard Choate, Jul 23, 2003, in forum: ASP General
    Replies:
    2
    Views:
    302
    Chris Barber
    Jul 23, 2003
Loading...

Share This Page