Parsing apache log files

J

Jim Richardson

I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.

A typical (good) line, looks like this


111.111.111.11 - - [16/Feb/2004:04:09:49 -0800] "GET /ads/redirectads/336x280redirect.htm HTTP/1.1" 304 - "http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

which I can split fine, by spliting on the " first, then splitting each
bit up on the appropriate thing. mostly spaces. But occasionaly I get
something like


11.111.11.111 - - [16/Feb/2004:10:35:12 -0800] "GET /ads/redirectads/468x60redirect.htm HTTP/1.1" 200 541 "http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera 7.20 [ru\"]"

note the [ru\" at the end.


I am looking for a way to strip out the IP, day, time, requested url,
referrer, bytes, status, and user agent, and what I have, though a bit
crufty, works 99.99% of the time, but then something like this shows up.

I have a couple of approaches. Reject the bad entries, save them to a
file, then manually enter them, problem is, with 10 million entries, and
about 1 in 1000 being bad...


Although as I write this, I think maybe I can use the \ to warn me, and
behave accordingly? hm, I'll have to try that.

In the meantime, is there some obvious method, or module that I have
missed ?
 
J

Josiah Carlson

In the meantime, is there some obvious method, or module that I have

I use a regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) - - \[([^\[\]:]+):'
'(\d+:\d+:\d+) -(\d\d\d\d\)] ("[^"]*") '
'(\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #day/month/year
a.group(3) #time of day
a.group(4) #timezone
a.group(5) #request
a.group(6) #code 200 for success, 404 for not found, etc.
a.group(7) #bytes transferred
a.group(8) #referrer
a.group(9) #browser
else:
#this line did not match.

That should work for most any line you get, but you may want to run it
over a few megs of your logs just to check and see if that else
statement is ever caught for a non-empty line.

- Josiah
 
P

Paul McGuire

Jim Richardson said:
I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.
pyparsing examples directory includes an HTTP server log parser. Using your
data, there was one minor error where the bytesSent field in the first line
was just a dash instead of an integer. After correcting that, I ran it
against your test lines and got this output:

fields.numBytesSent = -
fields.timestamp = ['16/Feb/2004:04:09:49', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
fields.referrer =
http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512
fields.cmd = ['GET', '/ads/redirectads/336x280redirect.htm', 'HTTP/1.1']
fields.ipAddr = 111.111.111.11
fields.statusCode = 304

fields.numBytesSent = 541
fields.timestamp = ['16/Feb/2004:10:35:12', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera
7.20 [ru
fields.referrer = http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56
fields.cmd = ['GET', '/ads/redirectads/468x60redirect.htm', 'HTTP/1.1']
fields.ipAddr = 11.111.11.111
fields.statusCode = 200

Download pyparsing at http://pyparsing.sourceforge.net.

Here's the change you'll have to make to the example:

Change:
integer.setResultsName("statusCode") +
integer.setResultsName("numBytesSent") +
to:
(integer | "-").setResultsName("statusCode") +
(integer | "-").setResultsName("numBytesSent") +

-- Paul
 
J

Jim Richardson

In the meantime, is there some obvious method, or module that I have
missed ?

I use a regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) - - \[([^\[\]:]+):'
'(\d+:\d+:\d+) -(\d\d\d\d\)] ("[^"]*") '
'(\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #day/month/year
a.group(3) #time of day
a.group(4) #timezone
a.group(5) #request
a.group(6) #code 200 for success, 404 for not found, etc.
a.group(7) #bytes transferred
a.group(8) #referrer
a.group(9) #browser
else:
#this line did not match.

That should work for most any line you get, but you may want to run it
over a few megs of your logs just to check and see if that else
statement is ever caught for a non-empty line.

- Josiah


thanks, although reading that re makes my brain hurt! :), and I don't
think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.
 
J

Jim Richardson

Jim Richardson said:
I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.
pyparsing examples directory includes an HTTP server log parser. Using your
data, there was one minor error where the bytesSent field in the first line
was just a dash instead of an integer. After correcting that, I ran it
against your test lines and got this output:

fields.numBytesSent = -
fields.timestamp = ['16/Feb/2004:04:09:49', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
fields.referrer =
http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512
fields.cmd = ['GET', '/ads/redirectads/336x280redirect.htm', 'HTTP/1.1']
fields.ipAddr = 111.111.111.11
fields.statusCode = 304

fields.numBytesSent = 541
fields.timestamp = ['16/Feb/2004:10:35:12', '-0800']
fields.clientSfw = Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera
7.20 [ru
fields.referrer = http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56
fields.cmd = ['GET', '/ads/redirectads/468x60redirect.htm', 'HTTP/1.1']
fields.ipAddr = 11.111.11.111
fields.statusCode = 200

Download pyparsing at http://pyparsing.sourceforge.net.

Here's the change you'll have to make to the example:

Change:
integer.setResultsName("statusCode") +
integer.setResultsName("numBytesSent") +
to:
(integer | "-").setResultsName("statusCode") +
(integer | "-").setResultsName("numBytesSent") +

-- Paul

now *this* looks interesting. Thanks a lot!
 
J

Josiah Carlson

thanks, although reading that re makes my brain hurt! :), and I don't
think it handles the case where the dashes are something else (the dash
is a place holder for some data that wasn't there on this request,
bytelength, referrer, something) but I'll look into it, thanks for the
example.

It depends on which dash you were talking about. The dash immediately
after the response code is the number of bytes sent, and is handled by
the regular expression.

Unless you use identd checks, the first '-' will always be there, though
the second '-' is the identity of the client given through http auth,
which may or may not be important to you.

Modifying the regular expression:
import re
rexp = re.compile('(\d+\.\d+\.\d+\.\d+) (-|\w*) (-|\w*) '
'\[([^\[\]:]+):(\d+:\d+:\d+) -(\d\d\d\d\)] '
'("[^"]*") (\d+) (-|\d+) ("[^"]*") (".*")\s*\Z')

a = rexp.match(line)
if not a is None:
a.group(1) #IP address
a.group(2) #identd response (if any)
a.group(3) #http auth user
a.group(4) #day/month/year
a.group(5) #time of day
a.group(6) #timezone
a.group(7) #request
a.group(8) #code 200 for success, 404 for not found, etc.
a.group(9) #bytes transferred
a.group(10) #referrer
a.group(11) #browser
else:
#this line did not match.


There you go.
- Josiah
 
J

Jim Richardson

It depends on which dash you were talking about. The dash immediately
after the response code is the number of bytes sent, and is handled by
the regular expression.

Unless you use identd checks, the first '-' will always be there, though
the second '-' is the identity of the client given through http auth,
which may or may not be important to you.

<snip>

It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp :) Thank you very much for your
help.
 
J

Josiah Carlson

It was the http auth, which for some reason, show up from time to time,
may be a misconfigured router/proxy don't know. But this works, although
my brain is still parsing the regexp :) Thank you very much for your
help.

It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah
 
J

Jim Richardson

It is relatively easy to generate the regular expression by hand, I did.
But I agree, it is a bit dense if you didn't do it. I need to pull
out the python docs every time I see a regular expression that I didn't
write.

- Josiah


I can parse it if I think hard about what it does :) I guess that means
that the python interp is smarter than me :) Thanks again.
 
J

Jim Richardson

I can parse it if I think hard about what it does :) I guess that means
that the python interp is smarter than me :) Thanks again.


Oh, and there was a bug! I found a bug! woot!

(hey, I'm not very good at this, It's cool when I fix a bug)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top