Parsing apache log files

Jim Richardson · Feb 20, 2004

I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

A typical (good) line, looks like this

111.111.111.11 - - [16/Feb/2004:04:09:49 -0800] "GET /ads/redirectads/336x280redirect.htm HTTP/1.1" 304 - "http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

which I can split fine, by spliting on the " first, then splitting each
bit up on the appropriate thing. mostly spaces. But occasionaly I get
something like

11.111.11.111 - - [16/Feb/2004:10:35:12 -0800] "GET /ads/redirectads/468x60redirect.htm HTTP/1.1" 200 541 "http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera 7.20 [ru\"]"

note the [ru\" at the end.

I am looking for a way to strip out the IP, day, time, requested url,
referrer, bytes, status, and user agent, and what I have, though a bit
crufty, works 99.99% of the time, but then something like this shows up.

I have a couple of approaches. Reject the bad entries, save them to a
file, then manually enter them, problem is, with 10 million entries, and
about 1 in 1000 being bad...

Although as I write this, I think maybe I can use the \ to warn me, and
behave accordingly? hm, I'll have to try that.

In the meantime, is there some obvious method, or module that I have
missed ?

Josiah Carlson · Feb 20, 2004

In the meantime, is there some obvious method, or module that I have

missed ?

I use a regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) - - \[([^\[\]:]+):'
'(\d+:\d+:\d+) -(\d\d\d\d\)] ("[^"]*") '
'(\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #day/month/year
a.group(3) #time of day
a.group(4) #timezone
a.group(5) #request
a.group(6) #code 200 for success, 404 for not found, etc.
a.group(7) #bytes transferred
a.group(8) #referrer
a.group(9) #browser
else:
#this line did not match.

That should work for most any line you get, but you may want to run it
over a few megs of your logs just to check and see if that else
statement is ever caught for a non-empty line.

- Josiah

Paul McGuire · Feb 20, 2004

Jim Richardson said:
I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

pyparsing examples directory includes an HTTP server log parser. Using your
data, there was one minor error where the bytesSent field in the first line
was just a dash instead of an integer. After correcting that, I ran it
against your test lines and got this output:

fields.numBytesSent = -
fields.timestamp = ['16/Feb/2004:04:09:49', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
fields.referrer =
http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512
fields.cmd = ['GET', '/ads/redirectads/336x280redirect.htm', 'HTTP/1.1']
fields.ipAddr = 111.111.111.11
fields.statusCode = 304

fields.numBytesSent = 541
fields.timestamp = ['16/Feb/2004:10:35:12', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera
7.20 [ru
fields.referrer = http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56
fields.cmd = ['GET', '/ads/redirectads/468x60redirect.htm', 'HTTP/1.1']
fields.ipAddr = 11.111.11.111
fields.statusCode = 200

Download pyparsing at http://pyparsing.sourceforge.net.

Here's the change you'll have to make to the example:

Change:
integer.setResultsName("statusCode") +
integer.setResultsName("numBytesSent") +
to:
(integer | "-").setResultsName("statusCode") +
(integer | "-").setResultsName("numBytesSent") +

-- Paul

Jim Richardson · Feb 20, 2004

In the meantime, is there some obvious method, or module that I have
missed ?

Click to expand...

I use a regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) - - \[([^\[\]:]+):'
'(\d+:\d+:\d+) -(\d\d\d\d\)] ("[^"]*") '
'(\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #day/month/year
a.group(3) #time of day
a.group(4) #timezone
a.group(5) #request
a.group(6) #code 200 for success, 404 for not found, etc.
a.group(7) #bytes transferred
a.group(8) #referrer
a.group(9) #browser
else:
#this line did not match.

That should work for most any line you get, but you may want to run it
over a few megs of your logs just to check and see if that else
statement is ever caught for a non-empty line.

- Josiah

thanks, although reading that re makes my brain hurt!

, and I don't
think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.

Jim Richardson · Feb 20, 2004

Jim Richardson said:
Jim Richardson said:

I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

Click to expand...

pyparsing examples directory includes an HTTP server log parser. Using your
data, there was one minor error where the bytesSent field in the first line
was just a dash instead of an integer. After correcting that, I ran it
against your test lines and got this output:

fields.numBytesSent = -
fields.timestamp = ['16/Feb/2004:04:09:49', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
fields.referrer =
http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512
fields.cmd = ['GET', '/ads/redirectads/336x280redirect.htm', 'HTTP/1.1']
fields.ipAddr = 111.111.111.11
fields.statusCode = 304

fields.numBytesSent = 541
fields.timestamp = ['16/Feb/2004:10:35:12', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera
7.20 [ru
fields.referrer = http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56
fields.cmd = ['GET', '/ads/redirectads/468x60redirect.htm', 'HTTP/1.1']
fields.ipAddr = 11.111.11.111
fields.statusCode = 200

Download pyparsing at http://pyparsing.sourceforge.net.

Here's the change you'll have to make to the example:

Change:
integer.setResultsName("statusCode") +
integer.setResultsName("numBytesSent") +
to:
(integer | "-").setResultsName("statusCode") +
(integer | "-").setResultsName("numBytesSent") +

-- Paul

now *this* looks interesting. Thanks a lot!

Josiah Carlson · Feb 20, 2004

thanks, although reading that re makes my brain hurt!

, and I don't

think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.

It depends on which dash you were talking about. The dash immediately
after the response code is the number of bytes sent, and is handled by
the regular expression.

Unless you use identd checks, the first '-' will always be there, though
the second '-' is the identity of the client given through http auth,
which may or may not be important to you.

Modifying the regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) (-|\w*) (-|\w*) '
'\[([^\[\]:]+)

\d+:\d+:\d+) -(\d\d\d\d\)] '
'("[^"]*") (\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #identd response (if any)
a.group(3) #http auth user
a.group(4) #day/month/year
a.group(5) #time of day
a.group(6) #timezone
a.group(7) #request
a.group(8) #code 200 for success, 404 for not found, etc.
a.group(9) #bytes transferred
a.group(10) #referrer
a.group(11) #browser
else:
#this line did not match.

There you go.
- Josiah

Jim Richardson · Feb 21, 2004

It depends on which dash you were talking about. The dash immediately
after the response code is the number of bytes sent, and is handled by
the regular expression.

Unless you use identd checks, the first '-' will always be there, though
the second '-' is the identity of the client given through http auth,
which may or may not be important to you.

<snip>

It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp

Thank you very much for your
help.

Josiah Carlson · Feb 21, 2004

It was the http auth, which for some reason, show up from time to time,

may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp Thank you very much for your
help.

It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah

Jim Richardson · Feb 21, 2004

It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah

I can parse it if I think hard about what it does

I guess that means
that the python interp is smarter than me

Thanks again.

Jim Richardson · Feb 22, 2004

I can parse it if I think hard about what it does I guess that means
that the python interp is smarter than me Thanks again.

Oh, and there was a bug! I found a bug! woot!

(hey, I'm not very good at this, It's cool when I fix a bug)

Extract string from log file	0	Aug 9, 2008
Trouble with quotes	1	Mar 8, 2010
ajax code injection hacking attempt	2	Sep 7, 2011
Netscape webserver extended log format?	0	Mar 16, 2005
clr 2.0 missing in iis log (& broken app)	4	Aug 1, 2006
Split string but ignore quotes	5	Sep 29, 2009
regex to get OS from combined log	3	Oct 9, 2003
python screen scraping/parsing	2	Jun 13, 2008

Parsing apache log files

Jim Richardson

Josiah Carlson

Paul McGuire

Jim Richardson

Jim Richardson

Josiah Carlson

Jim Richardson

Josiah Carlson

Jim Richardson

Jim Richardson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads