J
Jim Richardson
I am pulling apart some big apache logs (800-1000MB) for some analysis,
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.
A typical (good) line, looks like this
111.111.111.11 - - [16/Feb/2004:04:09:49 -0800] "GET /ads/redirectads/336x280redirect.htm HTTP/1.1" 304 - "http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
which I can split fine, by spliting on the " first, then splitting each
bit up on the appropriate thing. mostly spaces. But occasionaly I get
something like
11.111.11.111 - - [16/Feb/2004:10:35:12 -0800] "GET /ads/redirectads/468x60redirect.htm HTTP/1.1" 200 541 "http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera 7.20 [ru\"]"
note the [ru\" at the end.
I am looking for a way to strip out the IP, day, time, requested url,
referrer, bytes, status, and user agent, and what I have, though a bit
crufty, works 99.99% of the time, but then something like this shows up.
I have a couple of approaches. Reject the bad entries, save them to a
file, then manually enter them, problem is, with 10 million entries, and
about 1 in 1000 being bad...
Although as I write this, I think maybe I can use the \ to warn me, and
behave accordingly? hm, I'll have to try that.
In the meantime, is there some obvious method, or module that I have
missed ?
and stuffing it into a MySQL database. Most of it goes ok, despite my
meager coding abilities. But every so often I run across "borken" bits
of data, like user agent strings that include "'/\ and such, although
they are escaped by apache in writing the log, they break up my somewhat
clunky splits.
A typical (good) line, looks like this
111.111.111.11 - - [16/Feb/2004:04:09:49 -0800] "GET /ads/redirectads/336x280redirect.htm HTTP/1.1" 304 - "http://www.foobarp.org/theme_detail.php?type=vs&cat=0&mid=27512" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"
which I can split fine, by spliting on the " first, then splitting each
bit up on the appropriate thing. mostly spaces. But occasionaly I get
something like
11.111.11.111 - - [16/Feb/2004:10:35:12 -0800] "GET /ads/redirectads/468x60redirect.htm HTTP/1.1" 200 541 "http://11.11.111.11/adframe.php?n=ad1f311a&what=zone:56" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) Opera 7.20 [ru\"]"
note the [ru\" at the end.
I am looking for a way to strip out the IP, day, time, requested url,
referrer, bytes, status, and user agent, and what I have, though a bit
crufty, works 99.99% of the time, but then something like this shows up.
I have a couple of approaches. Reject the bad entries, save them to a
file, then manually enter them, problem is, with 10 million entries, and
about 1 in 1000 being bad...
Although as I write this, I think maybe I can use the \ to warn me, and
behave accordingly? hm, I'll have to try that.
In the meantime, is there some obvious method, or module that I have
missed ?