Searching through two logfiles in parallel?

Discussion in 'Python' started by Victor Hooi, Jan 7, 2013.

  1. Victor Hooi

    Victor Hooi Guest

    Hi,

    I'm trying to compare two logfiles in Python.

    One logfile will have lines recording the message being sent:

    05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

    the other logfile has line recording the message being received

    05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

    The goal is to compare the time stamp between the two - we can safely assume the timestamp on the message being received is later than the timestamp on transmission.

    If it was a direct line-by-line, I could probably use itertools.izip(), right?

    However, it's not a direct line-by-line comparison of the two files - the lines I'm looking for are interspersed among other loglines, and the time difference between sending/receiving is quite variable.

    So the idea is to iterate through the sending logfile - then iterate through the receiving logfile from that timestamp forwards, looking for the matching pair. Obviously I want to minimise the amount of back-forth through the file.

    Also, there is a chance that certain messages could get lost - so I assume there's a threshold after which I want to give up searching for the matching received message, and then just try to resync to the next sent message.

    Is there a Pythonic way, or some kind of idiom that I can use to approach this problem?

    Cheers,
    Victor
    Victor Hooi, Jan 7, 2013
    #1
    1. Advertising

  2. On 7 January 2013 22:10, Victor Hooi <> wrote:
    > Hi,
    >
    > I'm trying to compare two logfiles in Python.
    >
    > One logfile will have lines recording the message being sent:
    >
    > 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
    >
    > the other logfile has line recording the message being received
    >
    > 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
    >
    > The goal is to compare the time stamp between the two - we can safely assume the timestamp on the message being received is later than the timestamp on transmission.
    >
    > If it was a direct line-by-line, I could probably use itertools.izip(), right?
    >
    > However, it's not a direct line-by-line comparison of the two files - the lines I'm looking for are interspersed among other loglines, and the time difference between sending/receiving is quite variable.
    >
    > So the idea is to iterate through the sending logfile - then iterate through the receiving logfile from that timestamp forwards, looking for the matching pair. Obviously I want to minimise the amount of back-forth through the file.
    >
    > Also, there is a chance that certain messages could get lost - so I assume there's a threshold after which I want to give up searching for the matching received message, and then just try to resync to the next sent message.
    >
    > Is there a Pythonic way, or some kind of idiom that I can use to approach this problem?


    Assuming that you can impose a maximum time between the send and
    recieve timestamps, something like the following might work
    (untested):

    def find_matching(logfile1, logfile2, maxdelta):
    buf = {}
    logfile2 = iter(logfile2)
    for msg1 in logfile1:
    if msg1.key in buf:
    yield msg1, buf.pop(msg1.key)
    continue
    maxtime = msg1.time + maxdelta
    for msg2 in logfile2:
    if msg2.key == msg1.key:
    yield msg1, msg2
    break
    buf[msg2.key] = msg2
    if msg2.time > maxtime:
    break
    else:
    yield msg1, 'No match'


    Oscar
    Oscar Benjamin, Jan 7, 2013
    #2
    1. Advertising

  3. Victor Hooi

    Victor Hooi Guest

    Hi Oscar,

    Thanks for the quick reply =).

    I'm trying to understand your code properly, and it seems like for each line in logfile1, we loop through all of logfile2?

    The idea was that it would remember it's position in logfile2 as well - since we can assume that the loglines are in chronological order - we only need to search forwards in logfile2 each time, not from the beginning each time.

    So for example - logfile1:

    05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
    05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
    05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4

    logfile2:

    05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
    05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C: 2.3
    05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C: 5.4

    The idea is that I'd iterate through logfile 1 - I'd get the 05:00:06 logline - I'd search through logfile2 and find the 05:00:09 logline.

    Then, back in logline1 I'd find the next logline at 05:00:08. Then in logfile2, instead of searching back from the beginning, I'd start from the next line, which happens to be 5:00:12.

    In reality, I'd need to handle missing messages in logfile2, but that's the general idea.

    Does that make sense? (There's also a chance I've misunderstood your buf code, and it does do this - in that case, I apologies - is there any chance you could explain it please?)

    Cheers,
    Victor

    On Tuesday, 8 January 2013 09:58:36 UTC+11, Oscar Benjamin wrote:
    > On 7 January 2013 22:10, Victor Hooi <> wrote:
    >
    > > Hi,

    >
    > >

    >
    > > I'm trying to compare two logfiles in Python.

    >
    > >

    >
    > > One logfile will have lines recording the message being sent:

    >
    > >

    >
    > > 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

    >
    > >

    >
    > > the other logfile has line recording the message being received

    >
    > >

    >
    > > 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

    >
    > >

    >
    > > The goal is to compare the time stamp between the two - we can safely assume the timestamp on the message being received is later than the timestamp on transmission.

    >
    > >

    >
    > > If it was a direct line-by-line, I could probably use itertools.izip(), right?

    >
    > >

    >
    > > However, it's not a direct line-by-line comparison of the two files - the lines I'm looking for are interspersed among other loglines, and the time difference between sending/receiving is quite variable.

    >
    > >

    >
    > > So the idea is to iterate through the sending logfile - then iterate through the receiving logfile from that timestamp forwards, looking for the matching pair. Obviously I want to minimise the amount of back-forth through the file.

    >
    > >

    >
    > > Also, there is a chance that certain messages could get lost - so I assume there's a threshold after which I want to give up searching for the matching received message, and then just try to resync to the next sent message.

    >
    > >

    >
    > > Is there a Pythonic way, or some kind of idiom that I can use to approach this problem?

    >
    >
    >
    > Assuming that you can impose a maximum time between the send and
    >
    > recieve timestamps, something like the following might work
    >
    > (untested):
    >
    >
    >
    > def find_matching(logfile1, logfile2, maxdelta):
    >
    > buf = {}
    >
    > logfile2 = iter(logfile2)
    >
    > for msg1 in logfile1:
    >
    > if msg1.key in buf:
    >
    > yield msg1, buf.pop(msg1.key)
    >
    > continue
    >
    > maxtime = msg1.time + maxdelta
    >
    > for msg2 in logfile2:
    >
    > if msg2.key == msg1.key:
    >
    > yield msg1, msg2
    >
    > break
    >
    > buf[msg2.key] = msg2
    >
    > if msg2.time > maxtime:
    >
    > break
    >
    > else:
    >
    > yield msg1, 'No match'
    >
    >
    >
    >
    >
    > Oscar
    Victor Hooi, Jan 7, 2013
    #3
  4. Victor Hooi

    Victor Hooi Guest

    Hi Oscar,

    Thanks for the quick reply =).

    I'm trying to understand your code properly, and it seems like for each line in logfile1, we loop through all of logfile2?

    The idea was that it would remember it's position in logfile2 as well - since we can assume that the loglines are in chronological order - we only need to search forwards in logfile2 each time, not from the beginning each time.

    So for example - logfile1:

    05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
    05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
    05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4

    logfile2:

    05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
    05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C: 2.3
    05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C: 5.4

    The idea is that I'd iterate through logfile 1 - I'd get the 05:00:06 logline - I'd search through logfile2 and find the 05:00:09 logline.

    Then, back in logline1 I'd find the next logline at 05:00:08. Then in logfile2, instead of searching back from the beginning, I'd start from the next line, which happens to be 5:00:12.

    In reality, I'd need to handle missing messages in logfile2, but that's the general idea.

    Does that make sense? (There's also a chance I've misunderstood your buf code, and it does do this - in that case, I apologies - is there any chance you could explain it please?)

    Cheers,
    Victor

    On Tuesday, 8 January 2013 09:58:36 UTC+11, Oscar Benjamin wrote:
    > On 7 January 2013 22:10, Victor Hooi <> wrote:
    >
    > > Hi,

    >
    > >

    >
    > > I'm trying to compare two logfiles in Python.

    >
    > >

    >
    > > One logfile will have lines recording the message being sent:

    >
    > >

    >
    > > 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

    >
    > >

    >
    > > the other logfile has line recording the message being received

    >
    > >

    >
    > > 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

    >
    > >

    >
    > > The goal is to compare the time stamp between the two - we can safely assume the timestamp on the message being received is later than the timestamp on transmission.

    >
    > >

    >
    > > If it was a direct line-by-line, I could probably use itertools.izip(), right?

    >
    > >

    >
    > > However, it's not a direct line-by-line comparison of the two files - the lines I'm looking for are interspersed among other loglines, and the time difference between sending/receiving is quite variable.

    >
    > >

    >
    > > So the idea is to iterate through the sending logfile - then iterate through the receiving logfile from that timestamp forwards, looking for the matching pair. Obviously I want to minimise the amount of back-forth through the file.

    >
    > >

    >
    > > Also, there is a chance that certain messages could get lost - so I assume there's a threshold after which I want to give up searching for the matching received message, and then just try to resync to the next sent message.

    >
    > >

    >
    > > Is there a Pythonic way, or some kind of idiom that I can use to approach this problem?

    >
    >
    >
    > Assuming that you can impose a maximum time between the send and
    >
    > recieve timestamps, something like the following might work
    >
    > (untested):
    >
    >
    >
    > def find_matching(logfile1, logfile2, maxdelta):
    >
    > buf = {}
    >
    > logfile2 = iter(logfile2)
    >
    > for msg1 in logfile1:
    >
    > if msg1.key in buf:
    >
    > yield msg1, buf.pop(msg1.key)
    >
    > continue
    >
    > maxtime = msg1.time + maxdelta
    >
    > for msg2 in logfile2:
    >
    > if msg2.key == msg1.key:
    >
    > yield msg1, msg2
    >
    > break
    >
    > buf[msg2.key] = msg2
    >
    > if msg2.time > maxtime:
    >
    > break
    >
    > else:
    >
    > yield msg1, 'No match'
    >
    >
    >
    >
    >
    > Oscar
    Victor Hooi, Jan 7, 2013
    #4
  5. On 7 January 2013 23:41, Victor Hooi <> wrote:
    > Hi Oscar,
    >
    > Thanks for the quick reply =).
    >
    > I'm trying to understand your code properly, and it seems like for each line in logfile1, we loop through all of logfile2?


    No we don't. It iterates once through both files but keeps a buffer of
    lines that are within maxdelta time of the current message.

    The important line is the line that calls iter(logfile2). Since
    logfile2 is replaced by an iterator when we break out of the inner for
    loop and then re-enter our place in the iterator is saved. If you can
    follow the interactive session below it should make sense:

    >>> a = [1,2,3,4,5]
    >>> for x in a:

    .... print x,
    ....
    1 2 3 4 5
    >>> for x in a:

    .... print x,
    ....
    1 2 3 4 5
    >>> it = iter(a)
    >>> next(it)

    1
    >>> for x in it:

    .... print x,
    ....
    2 3 4 5
    >>> next(it)

    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    StopIteration
    >>> for x in it:

    .... print x,
    ....
    >>> it = iter(a)
    >>> for x in it:

    .... print x,
    .... if x == 2: break
    ....
    1 2
    >>> for x in it:

    .... print x,
    ....
    3 4 5


    I'll repeat the code (with a slight fix):


    def find_matching(logfile1, logfile2, maxdelta):
    buf = {}
    logfile2 = iter(logfile2)
    for msg1 in logfile1:
    if msg1.key in buf:
    yield msg1, buf.pop(msg1.key)
    continue
    maxtime = msg1.time + maxdelta
    for msg2 in logfile2:
    if msg2.key == msg1.key:
    yield msg1, msg2
    break
    buf[msg2.key] = msg2
    if msg2.time > maxtime:
    yield msg1, 'No match'
    break
    else:
    yield msg1, 'No match'


    Oscar


    >
    > The idea was that it would remember it's position in logfile2 as well - since we can assume that the loglines are in chronological order - we only need to search forwards in logfile2 each time, not from the beginning each time.
    >
    > So for example - logfile1:
    >
    > 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
    > 05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
    > 05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4
    >
    > logfile2:
    >
    > 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9
    > 05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C: 2.3
    > 05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C: 5.4
    >
    > The idea is that I'd iterate through logfile 1 - I'd get the 05:00:06 logline - I'd search through logfile2 and find the 05:00:09 logline.
    >
    > Then, back in logline1 I'd find the next logline at 05:00:08. Then in logfile2, instead of searching back from the beginning, I'd start from the next line, which happens to be 5:00:12.
    >
    > In reality, I'd need to handle missing messages in logfile2, but that's the general idea.
    >
    > Does that make sense? (There's also a chance I've misunderstood your buf code, and it does do this - in that case, I apologies - is there any chance you could explain it please?)
    >
    > Cheers,
    > Victor
    >
    > On Tuesday, 8 January 2013 09:58:36 UTC+11, Oscar Benjamin wrote:
    >> On 7 January 2013 22:10, Victor Hooi <> wrote:
    >>
    >> > Hi,

    >>
    >> >

    >>
    >> > I'm trying to compare two logfiles in Python.

    >>
    >> >

    >>
    >> > One logfile will have lines recording the message being sent:

    >>
    >> >

    >>
    >> > 05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9

    >>
    >> >

    >>
    >> > the other logfile has line recording the message being received

    >>
    >> >

    >>
    >> > 05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C: 9.9

    >>
    >> >

    >>
    >> > The goal is to compare the time stamp between the two - we can safely assume the timestamp on the message being received is later than the timestamp on transmission.

    >>
    >> >

    >>
    >> > If it was a direct line-by-line, I could probably use itertools.izip(), right?

    >>
    >> >

    >>
    >> > However, it's not a direct line-by-line comparison of the two files - the lines I'm looking for are interspersed among other loglines, and the time difference between sending/receiving is quite variable.

    >>
    >> >

    >>
    >> > So the idea is to iterate through the sending logfile - then iterate through the receiving logfile from that timestamp forwards, looking for the matching pair. Obviously I want to minimise the amount of back-forth through the file.

    >>
    >> >

    >>
    >> > Also, there is a chance that certain messages could get lost - so I assume there's a threshold after which I want to give up searching for the matching received message, and then just try to resync to the next sent message.

    >>
    >> >

    >>
    >> > Is there a Pythonic way, or some kind of idiom that I can use to approach this problem?

    >>
    >>
    >>
    >> Assuming that you can impose a maximum time between the send and
    >>
    >> recieve timestamps, something like the following might work
    >>
    >> (untested):
    >>
    >>
    >>
    >> def find_matching(logfile1, logfile2, maxdelta):
    >>
    >> buf = {}
    >>
    >> logfile2 = iter(logfile2)
    >>
    >> for msg1 in logfile1:
    >>
    >> if msg1.key in buf:
    >>
    >> yield msg1, buf.pop(msg1.key)
    >>
    >> continue
    >>
    >> maxtime = msg1.time + maxdelta
    >>
    >> for msg2 in logfile2:
    >>
    >> if msg2.key == msg1.key:
    >>
    >> yield msg1, msg2
    >>
    >> break
    >>
    >> buf[msg2.key] = msg2
    >>
    >> if msg2.time > maxtime:
    >>
    >> break
    >>
    >> else:
    >>
    >> yield msg1, 'No match'
    >>
    >>
    >>
    >>
    >>
    >> Oscar

    > --
    > http://mail.python.org/mailman/listinfo/python-list
    Oscar Benjamin, Jan 8, 2013
    #5
  6. Victor Hooi

    darnold Guest

    i don't think in iterators (yet), so this is a bit wordy.
    same basic idea, though: for each message (set of parameters), build a
    list of transactions consisting of matching send/receive times.

    mildly tested:


    from datetime import datetime, timedelta

    sendData = '''\
    05:00:06 Message sent - Value A: 5.6, Value B: 6.2, Value C: 9.9
    05:00:08 Message sent - Value A: 3.3, Value B: 4.3, Value C: 2.3
    05:00:10 Message sent - Value A: 3.0, Value B: 0.4, Value C: 5.4
    #orphan
    05:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4
    07:00:14 Message sent - Value A: 1.0, Value B: 0.4, Value C: 5.4
    '''

    receiveData = '''\
    05:00:09 Message received - Value A: 5.6, Value B: 6.2, Value C:
    9.9
    05:00:12 Message received - Value A: 3.3, Value B: 4.3, Value C:
    2.3
    05:00:15 Message received - Value A: 1.0, Value B: 0.4, Value C:
    5.4
    07:00:18 Message received - Value A: 1.0, Value B: 0.4, Value C:
    5.4
    07:00:30 Message received - Value A: 1.0, Value B: 0.4, Value C:
    5.4 #orphan
    07:00:30 Message received - Value A: 17.0, Value B: 0.4, Value C:
    5.4 #orphan
    '''

    def parse(line):
    timestamp, rest = line.split(' Message ')
    action, params = rest.split(' - ' )
    params = params.split('#')[0]
    return timestamp.strip(), params.strip()

    def isMatch(sendTime,receiveTime,maxDelta):
    if sendTime is None:
    return False

    sendDT = datetime.strptime(sendTime,'%H:%M:%S')
    receiveDT = datetime.strptime(receiveTime,'%H:%M:%S')
    return receiveDT - sendDT <= maxDelta

    results = {}

    for line in sendData.split('\n'):
    if not line.strip():
    continue

    timestamp, params = parse(line)
    if params not in results:
    results[params] = [{'sendTime': timestamp, 'receiveTime':
    None}]
    else:
    results[params].append({'sendTime': timestamp, 'receiveTime':
    None})

    for line in receiveData.split('\n'):
    if not line.strip():
    continue

    timestamp, params = parse(line)
    if params not in results:
    results[params] = [{'sendTime': None, 'receiveTime':
    timestamp}]
    else:
    for tranNum, transaction in enumerate(results[params]):
    if
    isMatch(transaction['sendTime'],timestamp,timedelta(seconds=5)):
    results[params][tranNum]['receiveTime'] = timestamp
    break
    else:
    results[params].append({'sendTime': None, 'receiveTime':
    timestamp})

    for params in sorted(results):
    print params
    for transaction in results[params]:
    print '\t%s' % transaction


    >>> ================================ RESTART ================================
    >>>

    Value A: 1.0, Value B: 0.4, Value C: 5.4
    {'sendTime': '05:00:14', 'receiveTime': '05:00:15'}
    {'sendTime': '07:00:14', 'receiveTime': '07:00:18'}
    {'sendTime': None, 'receiveTime': '07:00:30'}
    Value A: 17.0, Value B: 0.4, Value C: 5.4
    {'sendTime': None, 'receiveTime': '07:00:30'}
    Value A: 3.0, Value B: 0.4, Value C: 5.4
    {'sendTime': '05:00:10', 'receiveTime': None}
    Value A: 3.3, Value B: 4.3, Value C: 2.3
    {'sendTime': '05:00:08', 'receiveTime': '05:00:12'}
    Value A: 5.6, Value B: 6.2, Value C: 9.9
    {'sendTime': '05:00:06', 'receiveTime': '05:00:09'}
    >>>


    HTH,
    Don
    darnold, Jan 8, 2013
    #6
  7. On 8 January 2013 19:16, darnold <> wrote:
    > i don't think in iterators (yet), so this is a bit wordy.
    > same basic idea, though: for each message (set of parameters), build a
    > list of transactions consisting of matching send/receive times.


    The advantage of an iterator based solution is that we can avoid
    loading all of both log files into memory.

    [SNIP]
    >
    > results = {}
    >
    > for line in sendData.split('\n'):
    > if not line.strip():
    > continue
    >
    > timestamp, params = parse(line)
    > if params not in results:
    > results[params] = [{'sendTime': timestamp, 'receiveTime':
    > None}]
    > else:
    > results[params].append({'sendTime': timestamp, 'receiveTime':
    > None})

    [SNIP]

    This kind of logic is made a little easier (and more efficient) if you
    use a collections.defaultdict instead of a dict since it saves needing
    to check if the key is in the dict yet. Example:

    >>> import collections
    >>> results = collections.defaultdict(list)
    >>> results

    defaultdict(<type 'list'>, {})
    >>> results['asd'].append(1)
    >>> results

    defaultdict(<type 'list'>, {'asd': [1]})
    >>> results['asd'].append(2)
    >>> results

    defaultdict(<type 'list'>, {'asd': [1, 2]})
    >>> results['qwe'].append(3)
    >>> results

    defaultdict(<type 'list'>, {'qwe': [3], 'asd': [1, 2]})


    Oscar
    Oscar Benjamin, Jan 8, 2013
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    530
  2. Courtis Joannis

    Question about logfiles in Python

    Courtis Joannis, Feb 24, 2005, in forum: Python
    Replies:
    1
    Views:
    303
    Harlin Seritt
    Feb 24, 2005
  3. Soren
    Replies:
    4
    Views:
    1,233
    c d saunter
    Feb 14, 2008
  4. Mitchell Gould
    Replies:
    8
    Views:
    150
    Mitchell Gould
    Jul 14, 2008
  5. Replies:
    14
    Views:
    231
    J. Gleixner
    Jan 19, 2006
Loading...

Share This Page