Logfile analysing with pyparsing

Discussion in 'Python' started by Andi Clemens, Sep 26, 2006.

  1. Andi Clemens

    Andi Clemens Guest

    Hi,

    we had some problems in the last weeks with our mailserver.
    Some messages were not delivered and we wanted to know why.
    But looking through the logfile is a time consuming process.
    So I wanted to write a parser to analyse the logs and parse them as XML.

    But I have never written a parser before and know I'm sitting in front
    of the logfile trying to write the grammar for pyparsing.

    First of all I need to know if it is possible to parse that kind of info
    into XML.
    Here is an excerpt of the logfile lines I'm interested in:

    Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
    message-id=<>
    Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message
    <> for nobody:65534
    Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 -
    BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL
    scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<>,bayes=1,autolearn=no

    Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
    message-id=<>
    Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
    message-id=<>
    Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
    to=<>, relay=10.49.0.7[10.49.0.7],
    delay=1, status=sent (250 2.6.0
    <> Queued mail for delivery)

    They are filtered by "message-id", so all these lines above have
    something to do with the message
    "".

    The original logfile is about 25 MB big, so I can't post all of the
    lines of course ;-)

    Looking at these lines I realized that there are "Queue IDs":
    755387301
    DA1431965E
    EF90720AD

    Filtering the log for these IDs results in the following lines:

    Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301:
    client=unknown[194.25.242.123]
    Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
    message-id=<>
    Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301:
    from=<>, size=8152, nrcpt=7 (queue active)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed

    Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534
    from=<nobody>
    Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
    message-id=<>
    Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E:
    from=<>, size=11074, nrcpt=1 (queue active)
    Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
    to=<>, relay=localhost[127.0.0.1],
    delay=1, status=sent (250 Ok: queued as EF90720AD)
    Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed

    Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD:
    client=localhost[127.0.0.1]
    Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
    message-id=<>
    Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
    to=<>, relay=localhost[127.0.0.1],
    delay=1, status=sent (250 Ok: queued as EF90720AD)
    Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD:
    from=<>, size=11263, nrcpt=1 (queue active)
    Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
    to=<>, relay=10.49.0.7[10.49.0.7],
    delay=1, status=sent (250 2.6.0
    <> Queued mail for delivery)
    Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed

    All this work is done with command line and grep...

    Is it possible to parse this big logfile only ONCE and extract all this
    info into XML?

    Like this:

    <message id="">
    <timestamp>Sep 18 04:15:26</timestamp>
    <from></from>
    <to></to>
    <to></to>
    <to></to>
    <to></to>
    <queueID>EF90720AD</queueID>
    <queueID>DA1431965E</queueID>
    <queueID>755387301</queueID>
    <spamd>
    <score>15</score>
    <filtered>yes</filtered>
    <sendto></sendto>
    </spamd>
    </message>

    The goal of this is to provide a web interface were we can see if the
    messages were filtered as spam (or deleted by our virus scanner).

    Is it possible? Or do I have to scan / parse the file more than once?

    Andi

    --
    Mozilla Thunderbird 1.5.0.7
    Arch Linux
     
    Andi Clemens, Sep 26, 2006
    #1
    1. Advertising

  2. You can parse it just once, you just have to setup your data structure
    (the structure of your XML schema) and fill it up as you parse.

    For example, you can represent you data structure as a dictionaries in
    Python:

    message={
    MID : {
    ' timestamp' : TIMESTAMP,
    'from':FROM,
    'tolist':[TO1, TO2, ... ],
    'qids': [QID1, QID2, ...],
    'spam_score':S_SCORE,
    'spam_filtered':Y/N,
    'spam_sentdo':S_SENDTO,
    }
    }


    Note: I inferred this from your XML.

    Then as you parse each record(line?) identify what token is what and
    fill in the corresponding data structure. So if you pass by a line that
    has the message_id, queue_id and a send_to fields, check if the
    corresponding message_id dictionary has been created (if not do so) and
    fill in what you just found out. Later on you might see another record
    that will help you fill in other information regarding this particular
    message_id (such as a 'timestamp' or a 'from' field).

    Then translating the data into XML should be fairly easy.

    Note: I am not familiar with the syntax of the mail log so I presented
    a general idea only. My assumptions about the syntax might have been
    wrong.

    Hope this helps,

    Nick Vatamaniuc




    Andi Clemens wrote:
    > Hi,
    >
    > we had some problems in the last weeks with our mailserver.
    > Some messages were not delivered and we wanted to know why.
    > But looking through the logfile is a time consuming process.
    > So I wanted to write a parser to analyse the logs and parse them as XML.
    >
    > But I have never written a parser before and know I'm sitting in front
    > of the logfile trying to write the grammar for pyparsing.
    >
    > First of all I need to know if it is possible to parse that kind of info
    > into XML.
    > Here is an excerpt of the logfile lines I'm interested in:
    >
    > Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
    > message-id=<>
    > Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message
    > <> for nobody:65534
    > Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 -
    > BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL
    > scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<>,bayes=1,autolearn=no
    >
    > Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
    > message-id=<>
    > Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
    > message-id=<>
    > Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
    > to=<>, relay=10.49.0.7[10.49.0.7],
    > delay=1, status=sent (250 2.6.0
    > <> Queued mail for delivery)
    >
    > They are filtered by "message-id", so all these lines above have
    > something to do with the message
    > "".
    >
    > The original logfile is about 25 MB big, so I can't post all of the
    > lines of course ;-)
    >
    > Looking at these lines I realized that there are "Queue IDs":
    > 755387301
    > DA1431965E
    > EF90720AD
    >
    > Filtering the log for these IDs results in the following lines:
    >
    > Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301:
    > client=unknown[194.25.242.123]
    > Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
    > message-id=<>
    > Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301:
    > from=<>, size=8152, nrcpt=7 (queue active)
    > Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    > to=<>, relay=procmail, delay=14, status=sent (filter)
    > Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    > to=<>, relay=procmail, delay=14, status=sent (filter)
    > Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    > to=<>, relay=procmail, delay=14, status=sent (filter)
    > Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    > to=<>, relay=procmail, delay=14, status=sent (filter)
    > Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed
    >
    > Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534
    > from=<nobody>
    > Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
    > message-id=<>
    > Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E:
    > from=<>, size=11074, nrcpt=1 (queue active)
    > Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
    > to=<>, relay=localhost[127.0.0.1],
    > delay=1, status=sent (250 Ok: queued as EF90720AD)
    > Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed
    >
    > Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD:
    > client=localhost[127.0.0.1]
    > Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
    > message-id=<>
    > Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
    > to=<>, relay=localhost[127.0.0.1],
    > delay=1, status=sent (250 Ok: queued as EF90720AD)
    > Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD:
    > from=<>, size=11263, nrcpt=1 (queue active)
    > Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
    > to=<>, relay=10.49.0.7[10.49.0.7],
    > delay=1, status=sent (250 2.6.0
    > <> Queued mail for delivery)
    > Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed
    >
    > All this work is done with command line and grep...
    >
    > Is it possible to parse this big logfile only ONCE and extract all this
    > info into XML?
    >
    > Like this:
    >
    > <message id="">
    > <timestamp>Sep 18 04:15:26</timestamp>
    > <from></from>
    > <to></to>
    > <to></to>
    > <to></to>
    > <to></to>
    > <queueID>EF90720AD</queueID>
    > <queueID>DA1431965E</queueID>
    > <queueID>755387301</queueID>
    > <spamd>
    > <score>15</score>
    > <filtered>yes</filtered>
    > <sendto></sendto>
    > </spamd>
    > </message>
    >
    > The goal of this is to provide a web interface were we can see if the
    > messages were filtered as spam (or deleted by our virus scanner).
    >
    > Is it possible? Or do I have to scan / parse the file more than once?
    >
    > Andi
    >
    > --
    > Mozilla Thunderbird 1.5.0.7
    > Arch Linux
     
    Nick Vatamaniuc, Sep 26, 2006
    #2
    1. Advertising

  3. Andi Clemens

    Paul McGuire Guest

    "Andi Clemens" <> wrote in message
    news:efadbv$gq7$...
    > Hi,
    >
    > we had some problems in the last weeks with our mailserver.
    > Some messages were not delivered and we wanted to know why.
    > But looking through the logfile is a time consuming process.
    > So I wanted to write a parser to analyse the logs and parse them as XML.
    >

    <snip>

    Andi -

    Well, pyparsing does have *some* XML connection, but I don't think it will
    be as direct as you might like. I have attached below a pyparsing program
    that will probably parse 90% of your log messages, and give you some pretty
    easy-to-access data fields which you can then use to create your own Python
    data structures, such as dict keyed by queue id, dict keyed by message-id,
    etc., and then navigate through them to generate your XML.

    -- Paul

    logdata = """\
    Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
    message-id=<>
    Sep 18 04:15:22 mailrelay spamd[1364]: spamd: processing message
    <> for nobody:65534
    Sep 18 04:15:25 mailrelay spamd[1364]: spamd: result: Y 15 -
    BAYES_99,DATE_IN_PAST_03_06,DNS_FROM_RFC_ABUSE,DNS_FROM_RFC_DSN,DNS_FROM_RFC_POST,DNS_FROM_RFC_WHOIS,FORGED_MUA_OUTLOOK,SPF_SOFTFAIL
    scantime=3.1,size=8086,user=nobody,uid=65534,required_score=5.0,rhost=localhost,raddr=127.0.0.1,rport=55277,mid=<>,bayes=1,autolearn=no
    Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
    message-id=<>
    Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
    message-id=<>
    Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
    to=<>, relay=10.49.0.7[10.49.0.7],
    delay=1, status=sent (250 2.6.0
    <> Queued mail for delivery)

    Sep 18 02:15:11 mailrelay postfix/smtpd[10841]: 755387301:
    client=unknown[194.25.242.123]
    Sep 18 04:15:22 mailrelay postfix/cleanup[12103]: 755387301:
    message-id=<>
    Sep 18 04:15:22 mailrelay postfix/qmgr[11082]: 755387301:
    from=<>, size=8152, nrcpt=7 (queue active)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/pipe[11659]: 755387301:
    to=<>, relay=procmail, delay=14, status=sent (filter)
    Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: 755387301: removed

    Sep 18 04:15:25 mailrelay postfix/pickup[13175]: DA1431965E: uid=65534
    from=<nobody>
    Sep 18 04:15:25 mailrelay postfix/cleanup[12074]: DA1431965E:
    message-id=<>
    Sep 18 04:15:25 mailrelay postfix/qmgr[11082]: DA1431965E:
    from=<>, size=11074, nrcpt=1 (queue active)
    Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
    to=<>, relay=localhost[127.0.0.1],
    delay=1, status=sent (250 Ok: queued as EF90720AD)
    Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: DA1431965E: removed

    Sep 18 04:15:25 mailrelay postfix/smtpd[11704]: EF90720AD:
    client=localhost[127.0.0.1]
    Sep 18 04:15:26 mailrelay postfix/cleanup[13057]: EF90720AD:
    message-id=<>
    Sep 18 04:15:26 mailrelay postfix/smtp[11703]: DA1431965E:
    to=<>, relay=localhost[127.0.0.1],
    delay=1, status=sent (250 Ok: queued as EF90720AD)
    Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD:
    from=<>, size=11263, nrcpt=1 (queue active)
    Sep 18 04:15:26 mailrelay postfix/smtp[10879]: EF90720AD:
    to=<>, relay=10.49.0.7[10.49.0.7],
    delay=1, status=sent (250 2.6.0
    <> Queued mail for delivery)
    Sep 18 04:15:26 mailrelay postfix/qmgr[11082]: EF90720AD: removed
    """.split('\n')

    from pyparsing import *

    month = oneOf("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec")
    dayOfMonth = Word(nums,max=2)
    timeOfDay = Combine(Word(nums,exact=2)+":"+
    Word(nums,exact=2)+":"+Word(nums,exact=2))
    timeStamp = month + dayOfMonth + timeOfDay
    # may need to expand this if log contains other entries in this field
    source = Literal("mailrelay")
    emailAddr = QuotedString("<",endQuoteChar=">")
    ipAddr = Combine(Word(nums)+"."+Word(nums)+"."+\
    Word(nums)+"."+Word(nums))
    ipRef = ( "localhost" | ipAddr ) + "[" + ipAddr + "]"

    command = Combine(Word(alphas) + Optional("/" + Word(alphas)))
    pid = "[" + Word(nums) + "]"
    queueId = Word(hexnums)
    integer = Word(nums)
    msgValue = ( integer | emailAddr | ipRef | Word(alphas) ) + \
    Optional( QuotedString("(",endQuoteChar=")") )
    nvList = Dict(delimitedList( Group( Word(alphas+"-") +
    Suppress("=") + msgValue ) ))
    msgBody = "removed" | nvList
    spamdMsg = "spamd:" + restOfLine
    regularMsg = queueId.setResultsName("queueId") + ":" + \
    msgBody.setResultsName("body")
    logMessage = timeStamp + source + command.setResultsName("command") +\
    pid.setResultsName("pid") + ":" + (spamdMsg | regularMsg)

    # parse each line in log
    for log in logdata:
    if log:
    results = logMessage.parseString(log)
    print results.dump()
    for fieldName in "message-id queueId from to".split():
    print fieldName,":",
    try:
    print results[fieldName]
    except KeyError,ke:
    print


    Prints out (excerpt):
    - body: ['message-id', '']
    - command: postfix/cleanup
    - message-id:
    - pid: ['[', '13057', ']']
    - queueId: EF90720AD
    ['Sep', '18', '04:15:26', 'mailrelay', 'postfix/cleanup', '[', '13057', ']',
    ':', 'EF90720AD', ':', ['message-id',
    '']]
    message-id :
    queueId : EF90720AD
    from :
    to :
    - body: ['to', '']
    - command: postfix/smtp
    - pid: ['[', '10879', ']']
    - queueId: EF90720AD
    - relay: 10
    - to:
    ['Sep', '18', '04:15:26', 'mailrelay', 'postfix/smtp', '[', '10879', ']',
    ':', 'EF90720AD', ':', ['to', ''],
    ['relay', '10']]
    message-id :
    queueId : EF90720AD
    from :
    to :
    - body: ['client', 'unknown']
    - client: unknown
    - command: postfix/smtpd
    - pid: ['[', '10841', ']']
    - queueId: 755387301
    ['Sep', '18', '02:15:11', 'mailrelay', 'postfix/smtpd', '[', '10841', ']',
    ':', '755387301', ':', ['client', 'unknown']]
    message-id :
    queueId : 755387301
    from :
    to :
    - body: ['message-id', '']
    - command: postfix/cleanup
    - message-id:
    - pid: ['[', '12103', ']']
    - queueId: 755387301
    ['Sep', '18', '04:15:22', 'mailrelay', 'postfix/cleanup', '[', '12103', ']',
    ':', '755387301', ':', ['message-id',
    '']]
    message-id :
    queueId : 755387301
    from :
    to :
    - body: ['from', '']
    - command: postfix/qmgr
    - from:
    - nrcpt: ['7', 'queue active']
    - pid: ['[', '11082', ']']
    - queueId: 755387301
    - size: 8152
    ['Sep', '18', '04:15:22', 'mailrelay', 'postfix/qmgr', '[', '11082', ']',
    ':', '755387301', ':', ['from', ''], ['size', '8152'],
    ['nrcpt', '7', 'queue active']]
    message-id :
    queueId : 755387301
    from :
    to :
     
    Paul McGuire, Sep 26, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bob Rivers

    Analysing Xrunhprof data

    Bob Rivers, Aug 2, 2004, in forum: Java
    Replies:
    1
    Views:
    2,951
    Tris Orendorff
    Aug 2, 2004
  2. Xavier Seneque
    Replies:
    0
    Views:
    631
    Xavier Seneque
    Jul 9, 2005
  3. jmdeschamps
    Replies:
    4
    Views:
    445
    jmdeschamps
    Jan 19, 2004
  4. Replies:
    2
    Views:
    385
    Howard
    Jun 22, 2006
  5. Arjen van Schie

    code analysing tools for Ruby

    Arjen van Schie, Mar 28, 2007, in forum: Ruby
    Replies:
    1
    Views:
    191
    Zev Blut
    Mar 29, 2007
Loading...

Share This Page