Small Python, Java comparison

Discussion in 'Python' started by Dave Brueck, Jul 11, 2003.

  1. Dave Brueck

    Dave Brueck Guest

    Below is some information I collected from a *small* project in which I wrote
    a Python version of a Java application. I share this info only as a data
    point (rather than trying to say this data "proves" something) to consider
    the next time the "Python makes developers more productive" thread starts up
    again.

    Background
    ==========
    An employee who left our company had written a log processor we use to read
    records from text files (1 record per line), do a bunch of computations on
    them, and load them into various tables in our database. Typically each day
    there are 1000 to 2000 files with a total number of records being around 100
    million. The nature of the work the processor's work (some of it is
    summarizing data) results in about 1 database insert or update for about
    every 20 "raw" log records.

    After this employee left I inherited this chunk of code and the first thing I
    did was create a fairly comprehensive test suite for it (in Python) because
    there was no test suite and in the past we had had many problems with bugs
    and flakiness in the code. As I did this I got pretty familiar with how the
    code worked and noticed lots of places where the language seemed to "get in
    the way" of what needed to be done. During this time I also began hearing
    rumblings about how this processor would need to be extended to add more
    functionality, so I thought I'd start rewriting it in Python on the side just
    to see how it would turn out (the idea being that if I got it done and it was
    good enough then I could simply implement the new functionality in the Python
    version :) ).

    Data
    ====
    The Java version is for the Java 1.4 platform (uses the Sun JVM), and it
    connects to an Oracle 9 database on our LAN via the thin JDBC driver provided
    by Oracle. As far as complexity goes, I'd rate the problem itself as
    non-trivial but not rocket science. :) Implementing it in Java turned out to
    be fairly involved; a lot of the complexity there revolved around efficiently
    maintaining data structures in memory that didn't get committed to the
    database until the end of each separate file. The developer for the Java
    version has about as much experience as a developer as I do, and he has about
    as much experience with Java as I do with Python (IOW, I don't consider that
    to be much of a differentiating factor - although the Java developer has
    *way* more database experience than I do).

    From the start of the design to the end of the implementation and initial
    round of bug fixing took 3 weeks. An additional 2-3 weeks were spent
    optimizing the code and fixing more bugs. The source code had 4700 lines in 9
    files. When running, it would process an average of 1050 records per second
    (where process = time to parse the file, do calculations, and insert them
    into the database).

    The Python version is for Python 2.1.3 and it connects to the same Oracle
    database using DCOracle2. The implementation is very straightforward, nothing
    fancy. It's also still pretty "pristine" as I haven't had any time to try and
    optimize it yet. :) One of the biggest simplifying factors in the code was
    how easy it was to have dictionaries map dynamically-defined and
    arbitrarily-complex keys (made up of object tuples) to other objects. For
    whatever reason this seemed to be a huge factor in making data management go
    from a difficult problem to almost a no-brainer. If the Python version had
    come first and the Java one second, then some of that approach could have
    made it into the Java version, but in Java I don't find myself thinking about
    problems quite the same way - IOW the Python version's approach wouldn't be
    as obvious in Java or would seem to be too much up-front work (whereas the
    approach that was used was easier up front but in the end became quite
    complex and a limiting factor to adding new functionality IMO).

    Because I wasn't working on this full-time, the development was spread out
    over the course of two weeks (10 working days) at an average of just over 2
    hours per day (for a total of not quite 3 full days of work). The source code
    was less than 700 lines in 4 files. Most surprising to me was that it
    processes an average of 1200 records per second! I had assumed that after I
    got it working I'd need to spend time optimizing things to make up for Java
    having a JIT compiler to speed things up, but for now I won't bother.

    Both versions could be improved by splitting the log parsing/summarizing and
    database work into two separate threads (watching the processes I see periods
    of high CPU usage followed by near-idle time while the database churns away).
    Currently the Java version averages 47% CPU utilization and the Python
    version averages 51%.

    Caveats
    =======
    There are a million reasons why this isn't an apple-to-apple comparison or why
    somebody might read this and cry "foul!"; here's a few off the top of my
    head:

    - The second time you write something you can do it better - since development
    is often part exploratory, once you're done you usually have a good idea of
    how to do it better were you to do it again. I didn't write the first
    version, but writing the test suite (and fixing the bugs it uncovered) made
    me familiar with the weaknesses in the initial version.
    - It's proprietary code; nobody else can see the two versions of source - Too
    bad. ;-)
    - I haven't bothered to do a "true" LOC count for both versions - I just did
    "wc -l *.py" and "wc -l *.java" to get line counts - so comments and other
    junk is included in the line totals.
    - My development time didn't include much design time because my design was
    mostly a reaction to the Java implementation.
    - I used the thin Java driver (written in pure Java) and DCOracle 2 (uses
    native driver on Linux) - this may affect performance some.

    Conclusion
    ==========
    Like I said before, I'd hesitate to read *too* much into this experience,
    although I will say it's a more concrete and confirming example of what I
    (and many others) have experienced before - that there are some big
    productivity gains by using Python over some other languages. It's generally
    hard to measure this sort of thing because it's rare to do a rewrite without
    adding functionality, and the larger the project the more rare this becomes,
    so even though this is a very small example it's still useful. I certainly
    wouldn't have gotten "approval" to do the simple rewrite, but doing it in my
    spare time and having a comprehensive test suite made it possible.

    Anyway, for a very small development cost I ended up with a codebase 15% of
    the size of the original, and an implementation that will be far easier to
    extend (and easier to pass off to somebody else!). In order to get smaller
    and cleaner code I had been willing to take a modest performance hit, but in
    the end the new version was slightly faster too - icing on the cake!

    -Dave
    Dave Brueck, Jul 11, 2003
    #1
    1. Advertising

  2. Dave Brueck

    Peter Hansen Guest

    Dave Brueck wrote:
    >
    > I certainly
    > wouldn't have gotten "approval" to do the simple rewrite, but doing it in my
    > spare time and having a comprehensive test suite made it possible.


    Maybe that will change the next time, after others have seen how short
    a time it took you and how effective the results are.

    -Peter
    Peter Hansen, Jul 12, 2003
    #2
    1. Advertising

  3. Dave Brueck

    GerritM Guest

    "Dave Brueck" <> schreef in bericht
    news:...
    <...snip...> Java:
    From the start of the design to the end of the implementation and initial
    round of bug fixing took 3 weeks. An additional 2-3 weeks were spent
    optimizing the code and fixing more bugs. The source code had 4700 lines in
    9
    files. When running, it would process an average of 1050 records per second
    (where process = time to parse the file, do calculations, and insert them
    into the database).
    <...snip...> Python:
    Because I wasn't working on this full-time, the development was spread out
    over the course of two weeks (10 working days) at an average of just over 2
    hours per day (for a total of not quite 3 full days of work). The source
    code
    was less than 700 lines in 4 files. Most surprising to me was that it
    processes an average of 1200 records per second! I had assumed that after I
    got it working I'd need to spend time optimizing things to make up for Java
    having a JIT compiler to speed things up, but for now I won't bother.

    Both versions could be improved by splitting the log parsing/summarizing and
    database work into two separate threads (watching the processes I see
    periods
    of high CPU usage followed by near-idle time while the database churns
    away).
    Currently the Java version averages 47% CPU utilization and the Python
    version averages 51%.
    <...snip...>

    I recently wrote a short article about bloating of software:
    "Exploration of the bloating of software"
    www.extra.research.philips.com/natlab/sysarch/BloatingExploredPaper.pdf
    (long URL may be broken in two by e-mail reader!).
    I explain a number of effects I have observed which are caused by bloating,
    one of them degradation of performance. This degradation is repaired which
    causes again more bloating....

    I think your small sample point is a clear illustration of the fact that
    appropriate technology is important. Your gain of a factor of 6-7 in loc
    will translate in comparable gains in maintenance, cost and ease of
    extending etcetera.

    thanks for sharing your experience, regards Gerrit



    --
    www.extra.research.philips.com/natlab/sysarch/
    GerritM, Jul 12, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alen Hubtka

    Date Comparison using Java Script

    Alen Hubtka, Sep 8, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    2,576
    Alen Hubtka
    Sep 8, 2004
  2. Mark Thornton
    Replies:
    5
    Views:
    2,071
    Richard J Woodland
    Jan 8, 2003
  3. Jens Thiede

    Comparison: C++, C, Python, Java

    Jens Thiede, Jun 16, 2004, in forum: C++
    Replies:
    6
    Views:
    3,687
    Pete C.
    Jun 16, 2004
  4. Deepu
    Replies:
    1
    Views:
    237
    ccc31807
    Feb 7, 2011
  5. qak
    Replies:
    47
    Views:
    513
    Ben Bacarisse
    Sep 5, 2013
Loading...

Share This Page