Trouble with quotes

  • Thread starter Stephen Nelson-Smith
  • Start date
S

Stephen Nelson-Smith

Hi,

I've written some (primitive) code to parse some apache logfies and
establish if apache has appended a session cookie to the end. We're
finding that some browsers don't and apache doesn't just append a "-"
- it just omits it.

It's working fine, but for an edge case:

Couldn't match 192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET
http://sekrit.com/node/175523 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:31:15 +0100] "GET
http://sekrit.com/node/175521 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:07 +0100] "GET
http://sekrit.com/node/175520 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:32:33 +0100] "GET
http://sekrit.com/node/175522 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [24/Feb/2010:20:33:01 +0100] "GET
http://sekrit.com/node/175527 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
Couldn't match 192.168.1.107 - - [25/Feb/2010:17:01:54 +0100] "GET
http://sekrit.com/search/results/ HTTP/1.0" 200 -
"http://sekrit.com/search/results/"guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"
Couldn't match 192.168.1.107 - - [25/Feb/2010:17:02:15 +0100] "GET
http://sekrit.com/search/results/ HTTP/1.0" 200 -
"http://sekrit.com/search/results/"guideline%20grids"&page=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR
1.1.4322; .NET CLR 2.0.50727)"

If there are " " inside the request string, my regex breaks.

Here's the code:

#!/usr/bin/env python
import re

pattern = r'(?P<ForwardedFor>^(-|[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(,
[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})*){1})
(?P<RemoteLogname>(\S*)) (?P<RemoteUser>(\S*))
(?P<Timestamp>(\[[^\]]+\]))
(?P<FirstLineOfRequest>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<Status>(\S*)) (?P<Size>(\S*))
(?P<Referrer>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)
(?P<UserAgent>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)(
)?(?P<SiteIntelligenceCookie>(\"([^"\\]*(?:\\.[^"\\]*)*)\")?)'

regex = re.compile(pattern)

lines = 0
no_cookies = 0
unmatched = 0

for line in open('/home/stephen/scratch/test-data.txt'):
lines +=1
line = line.strip()
match = regex.match(line)

if match:
data = match.groupdict()
if data['SiteIntelligenceCookie'] == '':
no_cookies +=1
else:
print "Couldn't match ", line
unmatched +=1

print "I analysed %s lines." % (lines,)
print "There were %s lines with missing Site Intelligence cookies." %
(no_cookies,)
print "I was unable to process %s lines." % (unmatched,)

How can I make the regex a bit more resilient so it doesn't break when
" " is embedded?
 
M

Martin P. Hellwig

Hi,

I've written some (primitive) code to parse some apache logfies and
establish if apache has appended a session cookie to the end. We're
finding that some browsers don't and apache doesn't just append a "-"
- it just omits it.

It's working fine, but for an edge case:

Couldn't match 192.168.1.107 - - [24/Feb/2010:20:30:44 +0100] "GET
http://sekrit.com/node/175523 HTTP/1.1" 200 -
"http://sekrit.com/search/results/"3%2B2%20course"" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1; GTB6.4)"
<cut rest>
I didn't try to mentally parse the regex pattern (I like to keep
reasonably sane). However from the sounds of it the script barfs when
there is a quoted part in the second URL part. So how about doing a
simple string.replace('/"','') & string.replace('" ','') before doing
your re foo?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top