performance problem with time.strptime()

  • Thread starter Nils Rüttershoff
  • Start date
N

Nils Rüttershoff

Hi everyone,

In my free time I translate scripts from open source projects or write
my own, to train my python skills. ATM I convert the aplogmerge.pl from
awstats. It merges multiple apache logfiles and sort the output by the
timestamps of each line. My first version of this script hasn't a good
performance, so I started profiling. It turned out that the script spend
much time for converting the timestamps of the line into a struct_time
object. Here a code example (Python 2.6.2 on Ubuntu 7.10):

Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*")
Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'

def strptime():
m = Rec.match(Line)
if m:
date_string = m.group(1)
# date_string example: '02/Jul/2009:01:50:26'
return time.strptime(date_string, "%d/%b/%Y:%H:%M:%S")

with timeit this functions takes approximate 125 sec and 29.004.081
function calls (I've configured timeit with 1.000.000 runs). A look at
the output of cProfile told me that more than the half time is spent in
locale.py:

1000002 11.712 0.000 19.592 0.000 locale.py:316(normalize)

1000002 3.639 0.000 23.231 0.000 locale.py:382(_parse_localename)

1000002 5.162 0.000 30.271 0.000 locale.py:481(getlocale)



I studied the time documentation and thought that I had to set TZ in os
environ:

os.environ['TZ'] = 'Europe/Berlin'

time.set()


but that had no effect. I don't know why time.strptime() looks every
time for my locale. Maybe it's a problem with my OS locale... However.
I've introduced a work around, which works perfectly for my problem. For
time comparison I could use any sort of time representation and so I
convert to epoch:

# needed to translate month str to dig repr

Shortmonth = {'Jan' : '01',

'Feb' : '02',

'Mar' : '03',

'Apr' : '04',

'May' : '05',

'Jun' : '06',

'Jul' : '07',

'Aug' : '08',

'Sep' : '09',

'Oct' : '10',

'Nov' : '11',

'Dec' : '12'}

Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(?P<day>\d{2})/(?P<month>\w+)/(?P<year>\d{4}):(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\s\+\d{4}\].*")

Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'

def epoch():

m = Rec.match(Line)

if m:

result = m.groupdict()

date_tuple = (result["year"], Shortmonth[result["month"]], result["day"], result["hour"], result["min"], result["sec"], -1, -1, -1)

date_tuple = map(int,date_tuple)

return time.mktime(date_tuple)


with this workaround I had a speed up to 4 times; it tooks only 31 sec
with only 5.000.009 function calls. Maybe this helps some of you, who
had the similar problems with time conversion....

....But one big question remains: Why time.strptime() checks everytime
the locale? had I missed something or could I have a problem with my OS
locale?

With python 3.1 there is no difference, unless that time.strptime() took
approx 12 sec longer... :(

regards, Nils



Here a complete test script:

#!/opt/python/2.6.2/bin/python

import time

import timeit

import cProfile

import re

# needed to tranlate month str to dig repr

Shortmonth = {'Jan' : '01',

'Feb' : '02',

'Mar' : '03',

'Apr' : '04',

'May' : '05',

'Jun' : '06',

'Jul' : '07',

'Aug' : '08',

'Sep' : '09',

'Oct' : '10',

'Nov' : '11',

'Dec' : '12'}

Rec1 = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(?P<day>\d{2})/(?P<month>\w+)/(?P<year>\d{4}):(?P<hour>\d{2}):(?P<min>\d{2}):(?P<sec>\d{2})\s\+\d{4}\].*")

Rec2 = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*")

Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'

def epoch():

m = Rec1.match(Line)

if m:

result = m.groupdict()

date_tuple = (result["year"], Shortmonth[result["month"]], result["day"], result["hour"], result["min"], result["sec"], -1, -1, -1)

date_tuple = map(int,date_tuple)

return time.mktime(date_tuple)

def strptime():

m = Rec2.match(Line)

if m:

date_string = m.group(1)

return time.strptime(date_string, "%d/%b/%Y:%H:%M:%S")

if __name__ == "__main__":

t1 = timeit.Timer("epoch()","from __main__ import epoch")

t2 = timeit.Timer("strptime()", "from __main__ import strptime")

cProfile.run("t1.timeit();print")

print ""

cProfile.run("t2.timeit();print")
 
C

Casey Webster

Rec = re.compile(r"^\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3}\s-\s\d+\s\[(\d{2}/\w+/\d{4}:\d{2}:\d{2}:\d{2})\s\+\d{4}\].*")
Line = '1.2.3.4 - 4459 [02/Jul/2009:01:50:26 +0200] "GET /foo HTTP/1.0" 200 - "-" "www.example.org" "-" "-" "-"'

I'm not sure how much it will help but if you are only using the regex
to get the date/time group element, it might be faster to replace the
regex with:
date_string = Line.split()[3][1:-1]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top