Help with script with performance problems

Discussion in 'Python' started by Dennis Roberts, Nov 23, 2003.

  1. I have a script to parse a dns querylog and generate some statistics.
    For a 750MB file a perl script using the same methods (splits) can
    parse the file in 3 minutes. My python script takes 25 minutes. It
    is enough of a difference that unless I can figure out what I did
    wrong or a better way of doing it I might not be able to use python
    (since most of what I do is parsing various logs). The main reason to
    try python is I had to look at some early scripts I wrote in perl and
    had no idea what the hell I was thinking or what the script even did!
    After some googling and reading Eric Raymonds essay on python I jumped
    in:) Here is my script. I am looking for constructive comments -
    please don't bash my newbie code.

    #!/usr/bin/python -u

    import string
    import sys

    clients = {}
    queries = {}
    count = 0

    print "Each dot is 100000 lines..."

    f = sys.stdin

    while 1:

    line = f.readline()

    if count % 100000 == 0:
    sys.stdout.write(".")

    if line:
    splitline = string.split(line)

    try:
    (month, day, time, stype, source, qtype, query, ctype,
    record) = splitline
    except:
    print "problem spliting line", count
    print line
    break

    try:
    words = string.split(source,'#')
    source = words[0]
    except:
    print "problem splitting source", count
    print line
    break

    if clients.has_key(source):
    clients[source] = clients[source] + 1
    else:
    clients[source] = 1

    if queries.has_key(query):
    queries[query] = queries[query] + 1
    else:
    queries[query] = 1

    else:
    print
    break

    count = count + 1

    f.close()

    print count, "lines processed"

    for numclient, count in clients.items():
    if count > 100000:
    print "%s,%s" % (numclient, count)

    for numquery, count in queries.items():
    if count > 100000:
    print "%s,%s" % (numquery, count)
     
    Dennis Roberts, Nov 23, 2003
    #1
    1. Advertising

  2. Dennis Roberts

    Ville Vainio Guest

    (Dennis Roberts) writes:

    > is enough of a difference that unless I can figure out what I did
    > wrong or a better way of doing it I might not be able to use python
    > (since most of what I do is parsing various logs). The main reason to


    Isn't parsing logs a batch-oriented thing, where 20 minutes more
    wouldn't matter all that much? Log parsing is the home field of Perl,
    so python probably can't match its performance there, but other
    advantages of Python might make you still want to avoid going back to
    Perl. As long as it's 'efficient enough', who cares?

    > f = sys.stdin


    Have you tried using a normal file instead of stdin? BTW, you can
    iterate over a file easily by "for line in open("mylog.log"):". ISTR
    it's also more efficient than readline()'s, because it caches the
    lines instead of reading them one by one. You can also get the line
    numbers by doing "for linenum, line in enumerate(open("mylog.log")):"


    > splitline = string.split(line)


    Do not use 'string' module (it's deprecated), use string methods
    instead: line.split()

    > clients[source] = clients[source] + 1


    clients[source] += 1

    or another way to handle the common 'add 1, might not exist' idiom:


    clients[source] = 1 + clients.get(source,0)

    See http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/66516


    --
    Ville Vainio http://www.students.tut.fi/~vainio24
     
    Ville Vainio, Nov 23, 2003
    #2
    1. Advertising

  3. Dennis Roberts

    Miki Tebeka Guest

    Hello Dennis,

    A general note: Use the "hotshot" module to find where you spend most of your time.

    > splitline = string.split(line)

    My guess is that if you'll use the "re" module things will be much faster.

    import re
    ws_split = re.compile("\s+").split
    ....
    splitline = split(line)
    ....

    HTH.

    Miki
     
    Miki Tebeka, Nov 23, 2003
    #3
  4. Dennis Roberts

    Paul Clinch Guest

    (Miki Tebeka) wrote in message news:<>...
    > Hello Dennis,
    >
    > A general note: Use the "hotshot" module to find where you spend most of your time.
    >
    > > splitline = string.split(line)

    > My guess is that if you'll use the "re" module things will be much faster.
    >
    > import re
    > ws_split = re.compile("\s+").split
    > ...
    > splitline = split(line)
    > ...
    >
    > HTH.
    >
    > Miki



    An alternative in python 2.3 is the timeit module, the following
    extracted from doc.s:-
    import timeit

    timer1 = timeit.Timer('unicode("abc")')
    timer2 = timeit.Timer('"abc" + u""')

    # Run three trials
    print timer1.repeat(repeat=3, number=100000)
    print timer2.repeat(repeat=3, number=100000)

    # On my laptop this outputs:
    # [0.36831796169281006, 0.37441694736480713, 0.35304892063140869]
    # [0.17574405670166016, 0.18193507194519043, 0.17565798759460449]

    Regards Paul Clinch
     
    Paul Clinch, Nov 23, 2003
    #4
  5. Ville Vainio <> wrote in message news:<>...
    > > f = sys.stdin

    >
    > Have you tried using a normal file instead of stdin? BTW, you can
    > iterate over a file easily by "for line in open("mylog.log"):". ISTR
    > it's also more efficient than readline()'s, because it caches the
    > lines instead of reading them one by one. You can also get the line
    > numbers by doing "for linenum, line in enumerate(open("mylog.log")):"
    >


    i have a 240207 line sample log file that I test with. The script I
    submitted parsed it in 18 seconds. My perl script parsed it in 4
    seconds.

    The new python script, using a normal file as suggested above, does it
    in 3 seconds!

    Changed "f = sys.stdin" to "f = open('sample', 'r')".

    Thanks Ville!

    Note (I made the other changes one at a time as well - the file open
    change was the only one that made it faster)
     
    Dennis Roberts, Nov 23, 2003
    #5
  6. Dennis Roberts

    Aahz Guest

    In article <>,
    Dennis Roberts <> wrote:
    >
    >I have a script to parse a dns querylog and generate some statistics.
    >For a 750MB file a perl script using the same methods (splits) can
    >parse the file in 3 minutes. My python script takes 25 minutes. It
    >is enough of a difference that unless I can figure out what I did
    >wrong or a better way of doing it I might not be able to use python
    >(since most of what I do is parsing various logs). The main reason to
    >try python is I had to look at some early scripts I wrote in perl and
    >had no idea what the hell I was thinking or what the script even did!
    >After some googling and reading Eric Raymonds essay on python I jumped
    >in:) Here is my script. I am looking for constructive comments -
    >please don't bash my newbie code.


    If you haven't yet, make sure you upgrade to Python 2.3; there are a lot
    of speed enhancements. Also, it allows you to switch to idioms that work
    more like Perl's:

    for line in f:
    fields = line.split()
    ...

    Generally speaking, contrary to what another poster suggested, string
    methods will almost always be faster than regexes (assuming that a
    string method does what you want directly, of course; using multiple
    string methods may or may not be faster than regexes).
    --
    Aahz () <*> http://www.pythoncraft.com/

    Weinberg's Second Law: If builders built buildings the way programmers wrote
    programs, then the first woodpecker that came along would destroy civilization.
     
    Aahz, Nov 23, 2003
    #6
  7. Dennis Roberts

    Peter Otten Guest

    Dennis Roberts wrote:

    > I have a script to parse a dns querylog and generate some statistics.
    > For a 750MB file a perl script using the same methods (splits) can
    > parse the file in 3 minutes. My python script takes 25 minutes. It
    > is enough of a difference that unless I can figure out what I did
    > wrong or a better way of doing it I might not be able to use python
    > (since most of what I do is parsing various logs). The main reason to
    > try python is I had to look at some early scripts I wrote in perl and
    > had no idea what the hell I was thinking or what the script even did!
    > After some googling and reading Eric Raymonds essay on python I jumped
    > in:) Here is my script. I am looking for constructive comments -
    > please don't bash my newbie code.


    Below is my version of your script. It tries to use more idiomatic Python
    and is about 20%t faster on some bogus data - but nowhere near to close the
    performance gap you claim to the perl script.
    However, it took 143 seconds to process 10**7 lines generated by

    <makesample.py>
    import itertools, sys
    sample = "%dmonth day time stype source%d#sowhat qtype %dquery ctype record"
    thousand = itertools.cycle(range(1000))
    hundred = itertools.cycle(range(100))

    out = file(sys.argv[1], "w")
    try:
    try:
    count = int(sys.argv[2])
    except IndexError:
    count = 10**7
    for i in range(count):
    print >> out, sample % (i, thousand.next(), hundred.next())
    finally:
    out.close()
    </makesample.py>

    with Python 2.3.2 on my 2.6GHz P4. Would that mean Perl would do it in 17
    seconds? Anyway, the performance problem would rather be your computer :),
    Python should be fast enough for the purpose.

    Peter

    <parselog.py>
    #!/usr/bin/python -u
    #Warning, not seriously tested
    import sys

    #import time
    #starttime = time.time()

    clients = {}
    queries = {}
    lineNo = -1

    threshold = 100
    pointmod = 100000

    f = file(sys.argv[1])
    try:
    print "Each dot is %d lines..." % pointmod
    for lineNo, line in enumerate(f):
    if lineNo % pointmod == 0:
    sys.stdout.write(".")

    try:
    month, day, timestr, stype, source, qtype, query, ctype, record
    = line.split()
    except ValueError:
    raise Exception("problem splitting line %d\n%s" % (lineNo,
    line))

    source = source.split('#', 1)[0]

    clients[source] = clients.get(source, 0) + 1
    queries[query] = queries.get(query, 0) + 1
    finally:
    f.close()

    print
    print lineNo+1, "lines processed"

    for numclient, count in clients.iteritems():
    if count > threshold:
    print "%s,%s" % (numclient, count)

    for numquery, count in queries.iteritems():
    if count > threshold:
    print "%s,%s" % (numquery, count)

    #print "time:", time.time() - starttime
    </parselog.py>
     
    Peter Otten, Nov 23, 2003
    #7
  8. Dennis Roberts

    Peter Otten Guest

    Peter Otten wrote:

    > However, it took 143 seconds to process 10**7 lines generated by


    I just downloaded psycho, oops, keep misspelling the name :) and it brings
    down the time to 92 seconds - almost for free. I must say I'm impressed,
    the psycologist(s) did an excellent job.

    Peter

    #!/usr/bin/python -u
    import psyco, sys
    psyco.full()

    def main():
    clients = {}
    queries = {}
    lineNo = -1

    threshold = 100
    pointmod = 100000

    f = file(sys.argv[1])
    try:
    print "Each dot is %d lines..." % pointmod
    for lineNo, line in enumerate(f):
    if lineNo % pointmod == 0:
    sys.stdout.write(".")

    try:
    month, day, timestr, stype, source, qtype, query, ctype,
    record = line.split()
    except ValueError:
    raise Exception("problem splitting line %d\n%s" % (lineNo,
    line))

    source = source.split('#', 1)[0]

    clients[source] = clients.get(source, 0) + 1
    queries[query] = queries.get(query, 0) + 1
    finally:
    f.close()

    print
    print lineNo+1, "lines processed"

    for numclient, count in clients.iteritems():
    if count > threshold:
    print "%s,%s" % (numclient, count)

    for numquery, count in queries.iteritems():
    if count > threshold:
    print "%s,%s" % (numquery, count)

    import time
    starttime = time.time()
    main()
    print "time:", time.time() - starttime
     
    Peter Otten, Nov 23, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. jm
    Replies:
    1
    Views:
    540
    alien2_51
    Dec 12, 2003
  2. Rajat
    Replies:
    3
    Views:
    747
    Jorgen Grahn
    Jan 8, 2010
  3. VYAS ASHISH M-NTB837
    Replies:
    2
    Views:
    602
    Jan Kaliszewski
    Jan 7, 2010
  4. Replies:
    16
    Views:
    253
    Mike Brind
    Oct 10, 2006
  5. Software Engineer
    Replies:
    0
    Views:
    368
    Software Engineer
    Jun 10, 2011
Loading...

Share This Page