Logfile analysing with pyparsing

A

Andi Clemens

Hi,

we had some problems in the last weeks with our mailserver.
Some messages were not delivered and we wanted to know why.
But looking through the logfile is a time consuming process.
So I wanted to write a parser to analyse the logs and parse them as XML.

But I have never written a parser before and know I'm sitting in front
of the logfile trying to write the grammar for pyparsing.

First of all I need to know if it is possible to parse that kind of info
into XML.
Here is an excerpt of the logfile lines I'm interested in:

Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
message-id=<[email protected]>
Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message
<[email protected]> for nobody:65534
Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 -
BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL
scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<[email protected]>,bayes=1,autolearn=no

Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
to=<[email protected]>, relay=10.49.0.7[10.49.0.7],
delay=1, status=sent (250 2.6.0
<[email protected]> Queued mail for delivery)

They are filtered by "message-id", so all these lines above have
something to do with the message
"(e-mail address removed)".

The original logfile is about 25 MB big, so I can't post all of the
lines of course ;-)

Looking at these lines I realized that there are "Queue IDs":
755387301
DA1431965E
EF90720AD

Filtering the log for these IDs results in the following lines:

Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301:
client=unknown[194.25.242.123]
Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
message-id=<[email protected]>
Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301:
from=<[email protected]>, size=8152, nrcpt=7 (queue active)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed

Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534
from=<nobody>
Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
message-id=<[email protected]>
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E:
from=<[email protected]>, size=11074, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
to=<[email protected]>, relay=localhost[127.0.0.1],
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed

Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD:
client=localhost[127.0.0.1]
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
to=<[email protected]>, relay=localhost[127.0.0.1],
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD:
from=<[email protected]>, size=11263, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
to=<[email protected]>, relay=10.49.0.7[10.49.0.7],
delay=1, status=sent (250 2.6.0
<[email protected]> Queued mail for delivery)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed

All this work is done with command line and grep...

Is it possible to parse this big logfile only ONCE and extract all this
info into XML?

Like this:

<message id="(e-mail address removed)">
<timestamp>Sep 18 04:15:26</timestamp>
<from>[email protected]</from>
<to>[email protected]</to>
<to>[email protected]</to>
<to>[email protected]</to>
<to>[email protected]</to>
<queueID>EF90720AD</queueID>
<queueID>DA1431965E</queueID>
<queueID>755387301</queueID>
<spamd>
<score>15</score>
<filtered>yes</filtered>
<sendto>[email protected]</sendto>
</spamd>
</message>

The goal of this is to provide a web interface were we can see if the
messages were filtered as spam (or deleted by our virus scanner).

Is it possible? Or do I have to scan / parse the file more than once?

Andi
 
N

Nick Vatamaniuc

You can parse it just once, you just have to setup your data structure
(the structure of your XML schema) and fill it up as you parse.

For example, you can represent you data structure as a dictionaries in
Python:

message={
MID : {
' timestamp' : TIMESTAMP,
'from':FROM,
'tolist':[TO1, TO2, ... ],
'qids': [QID1, QID2, ...],
'spam_score':S_SCORE,
'spam_filtered':Y/N,
'spam_sentdo':S_SENDTO,
}
}


Note: I inferred this from your XML.

Then as you parse each record(line?) identify what token is what and
fill in the corresponding data structure. So if you pass by a line that
has the message_id, queue_id and a send_to fields, check if the
corresponding message_id dictionary has been created (if not do so) and
fill in what you just found out. Later on you might see another record
that will help you fill in other information regarding this particular
message_id (such as a 'timestamp' or a 'from' field).

Then translating the data into XML should be fairly easy.

Note: I am not familiar with the syntax of the mail log so I presented
a general idea only. My assumptions about the syntax might have been
wrong.

Hope this helps,

Nick Vatamaniuc




Andi said:
Hi,

we had some problems in the last weeks with our mailserver.
Some messages were not delivered and we wanted to know why.
But looking through the logfile is a time consuming process.
So I wanted to write a parser to analyse the logs and parse them as XML.

But I have never written a parser before and know I'm sitting in front
of the logfile trying to write the grammar for pyparsing.

First of all I need to know if it is possible to parse that kind of info
into XML.
Here is an excerpt of the logfile lines I'm interested in:

Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
message-id=<[email protected]>
Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message
<[email protected]> for nobody:65534
Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 -
BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL
scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<[email protected]>,bayes=1,autolearn=no

Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
to=<[email protected]>, relay=10.49.0.7[10.49.0.7],
delay=1, status=sent (250 2.6.0
<[email protected]> Queued mail for delivery)

They are filtered by "message-id", so all these lines above have
something to do with the message
"(e-mail address removed)".

The original logfile is about 25 MB big, so I can't post all of the
lines of course ;-)

Looking at these lines I realized that there are "Queue IDs":
755387301
DA1431965E
EF90720AD

Filtering the log for these IDs results in the following lines:

Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301:
client=unknown[194.25.242.123]
Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
message-id=<[email protected]>
Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301:
from=<[email protected]>, size=8152, nrcpt=7 (queue active)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed

Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534
from=<nobody>
Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
message-id=<[email protected]>
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E:
from=<[email protected]>, size=11074, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
to=<[email protected]>, relay=localhost[127.0.0.1],
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed

Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD:
client=localhost[127.0.0.1]
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
to=<[email protected]>, relay=localhost[127.0.0.1],
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD:
from=<[email protected]>, size=11263, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
to=<[email protected]>, relay=10.49.0.7[10.49.0.7],
delay=1, status=sent (250 2.6.0
<[email protected]> Queued mail for delivery)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed

All this work is done with command line and grep...

Is it possible to parse this big logfile only ONCE and extract all this
info into XML?

Like this:

<message id="(e-mail address removed)">
<timestamp>Sep 18 04:15:26</timestamp>
<from>[email protected]</from>
<to>[email protected]</to>
<to>[email protected]</to>
<to>[email protected]</to>
<to>[email protected]</to>
<queueID>EF90720AD</queueID>
<queueID>DA1431965E</queueID>
<queueID>755387301</queueID>
<spamd>
<score>15</score>
<filtered>yes</filtered>
<sendto>[email protected]</sendto>
</spamd>
</message>

The goal of this is to provide a web interface were we can see if the
messages were filtered as spam (or deleted by our virus scanner).

Is it possible? Or do I have to scan / parse the file more than once?

Andi
 
P

Paul McGuire

Andi Clemens said:
Hi,

we had some problems in the last weeks with our mailserver.
Some messages were not delivered and we wanted to know why.
But looking through the logfile is a time consuming process.
So I wanted to write a parser to analyse the logs and parse them as XML.
<snip>

Andi -

Well, pyparsing does have *some* XML connection, but I don't think it will
be as direct as you might like. I have attached below a pyparsing program
that will probably parse 90% of your log messages, and give you some pretty
easy-to-access data fields which you can then use to create your own Python
data structures, such as dict keyed by queue id, dict keyed by message-id,
etc., and then navigate through them to generate your XML.

-- Paul

logdata = """\
Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
message-id=<[email protected]>
Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message
<[email protected]> for nobody:65534
Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 -
BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL
scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<[email protected]>,bayes=1,autolearn=no
Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
to=<[email protected]>, relay=10.49.0.7[10.49.0.7],
delay=1, status=sent (250 2.6.0
<[email protected]> Queued mail for delivery)

Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301:
client=unknown[194.25.242.123]
Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
message-id=<[email protected]>
Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301:
from=<[email protected]>, size=8152, nrcpt=7 (queue active)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
to=<[email protected]>, relay=procmail, delay=14, status=sent (filter)
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed

Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534
from=<nobody>
Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
message-id=<[email protected]>
Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E:
from=<[email protected]>, size=11074, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
to=<[email protected]>, relay=localhost[127.0.0.1],
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed

Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD:
client=localhost[127.0.0.1]
Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
message-id=<[email protected]>
Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
to=<[email protected]>, relay=localhost[127.0.0.1],
delay=1, status=sent (250 Ok: queued as EF90720AD)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD:
from=<[email protected]>, size=11263, nrcpt=1 (queue active)
Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
to=<[email protected]>, relay=10.49.0.7[10.49.0.7],
delay=1, status=sent (250 2.6.0
<[email protected]> Queued mail for delivery)
Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed
""".split('\n')

from pyparsing import *

month = oneOf("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec")
dayOfMonth = Word(nums,max=2)
timeOfDay = Combine(Word(nums,exact=2)+":"+
Word(nums,exact=2)+":"+Word(nums,exact=2))
timeStamp = month + dayOfMonth + timeOfDay
# may need to expand this if log contains other entries in this field
source = Literal("mailrelay")
emailAddr = QuotedString("<",endQuoteChar=">")
ipAddr = Combine(Word(nums)+"."+Word(nums)+"."+\
Word(nums)+"."+Word(nums))
ipRef = ( "localhost" | ipAddr ) + "[" + ipAddr + "]"

command = Combine(Word(alphas) + Optional("/" + Word(alphas)))
pid = "[" + Word(nums) + "]"
queueId = Word(hexnums)
integer = Word(nums)
msgValue = ( integer | emailAddr | ipRef | Word(alphas) ) + \
Optional( QuotedString("(",endQuoteChar=")") )
nvList = Dict(delimitedList( Group( Word(alphas+"-") +
Suppress("=") + msgValue ) ))
msgBody = "removed" | nvList
spamdMsg = "spamd:" + restOfLine
regularMsg = queueId.setResultsName("queueId") + ":" + \
msgBody.setResultsName("body")
logMessage = timeStamp + source + command.setResultsName("command") +\
pid.setResultsName("pid") + ":" + (spamdMsg | regularMsg)

# parse each line in log
for log in logdata:
if log:
results = logMessage.parseString(log)
print results.dump()
for fieldName in "message-id queueId from to".split():
print fieldName,":",
try:
print results[fieldName]
except KeyError,ke:
print


Prints out (excerpt):
- body: ['message-id', '(e-mail address removed)']
- command: postfix/cleanup
- message-id: (e-mail address removed)
- pid: ['[', '13057', ']']
- queueId: EF90720AD
['Sep', '18', '04:15:26', 'mailrelay', 'postfix/cleanup', '[', '13057', ']',
':', 'EF90720AD', ':', ['message-id',
'(e-mail address removed)']]
message-id : (e-mail address removed)
queueId : EF90720AD
from :
to :
- body: ['to', '(e-mail address removed)']
- command: postfix/smtp
- pid: ['[', '10879', ']']
- queueId: EF90720AD
- relay: 10
- to: (e-mail address removed)
['Sep', '18', '04:15:26', 'mailrelay', 'postfix/smtp', '[', '10879', ']',
':', 'EF90720AD', ':', ['to', '(e-mail address removed)'],
['relay', '10']]
message-id :
queueId : EF90720AD
from :
to : (e-mail address removed)
- body: ['client', 'unknown']
- client: unknown
- command: postfix/smtpd
- pid: ['[', '10841', ']']
- queueId: 755387301
['Sep', '18', '02:15:11', 'mailrelay', 'postfix/smtpd', '[', '10841', ']',
':', '755387301', ':', ['client', 'unknown']]
message-id :
queueId : 755387301
from :
to :
- body: ['message-id', '(e-mail address removed)']
- command: postfix/cleanup
- message-id: (e-mail address removed)
- pid: ['[', '12103', ']']
- queueId: 755387301
['Sep', '18', '04:15:22', 'mailrelay', 'postfix/cleanup', '[', '12103', ']',
':', '755387301', ':', ['message-id',
'(e-mail address removed)']]
message-id : (e-mail address removed)
queueId : 755387301
from :
to :
- body: ['from', '(e-mail address removed)']
- command: postfix/qmgr
- from: (e-mail address removed)
- nrcpt: ['7', 'queue active']
- pid: ['[', '11082', ']']
- queueId: 755387301
- size: 8152
['Sep', '18', '04:15:22', 'mailrelay', 'postfix/qmgr', '[', '11082', ']',
':', '755387301', ':', ['from', '(e-mail address removed)'], ['size', '8152'],
['nrcpt', '7', 'queue active']]
message-id :
queueId : 755387301
from : (e-mail address removed)
to :
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Staff online

Members online

Forum statistics

Threads
473,767
Messages
2,569,571
Members
45,045
Latest member
DRCM

Latest Threads

Top