Parse ASCII log ; sort and keep most recent entries

Nova's Taylor · Jun 16, 2004

Hi folks,

I am a newbie to Python and am hoping that someone can get me started
on a log parser that I am trying to write.

The log is an ASCII file that contains a process identifier (PID),
username, date, and time field like this:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 02AUG03 14:11:20
23 jonesjimbo 07AUG03 15:25:00
2348 williamstim 17AUG03 9:13:55
748 jonesjimbo 13OCT03 14:10:05
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often). In my example,
the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

Any help would be appreciated!

Taylor

(e-mail address removed)

Peter Hansen · Jun 17, 2004

Nova's Taylor said:
I am a newbie to Python and am hoping that someone can get me started
on a log parser that I am trying to write.

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often). In my example,
the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

I think you are specifying the implementation of the solution
a bit, rather than just the requirements. Do you really need
the resulting list to be sorted by PID and date/time, or was
that just part of how you thought you'd write it?

If you don't care about the sorting part, but just want the
output to be a list of unique PIDs, you could just do the
following instead, taking advantage of how Python dictionaries
have unique keys. Note that this assumes that the contents
of the file were originally in order by date (i.e. more recent
items come later).

1. Create empty dict: "d = {}"
2. Read data line by line: "for line in infile.readlines()"
3. Split so the PID is separate: "pid = line.split()[0]"
4. Store entire line in dictionary using PID as key: "d[pid] = line"

When you're done, the dict will contain only the most recent
line with a given PID, though in "arbitrary" (effectively
random) order. If you don't care about the order of the final
result, just open a file and with one line the reduced data
is written out:

newfile.write(''.join(d.values()))

-Peter

Larry Bates · Jun 17, 2004

Here's a quick solution.

Larry Bates
Syscon, Inc.

def cmpfunc(x,y):
xdate=x[0]
xtime=x[1]
ydate=y[0]
ytime=y[1]
if xdate == ydate:
#
# If the two dates are equal, I must check the times
#
if xtime > ytime: return 1
elif xtime == ytime: return 0
else: return -1
elif xdate > ydate: return 1
return -1

fp=file(yourlogfilepath, 'r')
lines=fp.readlines()
fp.close()
list=[]
months={'JAN': '01', 'FEB': '02', 'MAR': '03', 'APR': '04',
'MAY': '05', 'JUN': '06', 'JUL': '07', 'AUG': '08',
'SEP': '09', 'OCT': '10', 'NOV': '11', 'DEC': '12'}

logdict={}

for line in lines:
if not line.strip(): break
print line
pid, name, date, time=[x.strip() for x in line.rstrip().split(' ')]
#
# Must zero pad time for proper comparison
#
stime=time.zfill(8)
#
# Must reformat the data as YYMMDD
#
sdate=date[-2:]+months[date[2:5]]+date[:2]
list.append((sdate, stime, pid, name, date, time))

list.sort(cmpfunc)
list.reverse()

for sdate, stime, pid, name, date, time in list:
if logdict.has_key(pid): continue
logdict[pid]=(pid, name, date, time)

for key in logdict.keys():
pid, name, date, time=logdict[key]
print pid, name, date, time

David Fisher · Jun 17, 2004

Hi folks,

I am a newbie to Python and am hoping that someone can get me started
on a log parser that I am trying to write.

The log is an ASCII file that contains a process identifier (PID),
username, date, and time field like this:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 02AUG03 14:11:20
23 jonesjimbo 07AUG03 15:25:00
2348 williamstim 17AUG03 9:13:55
748 jonesjimbo 13OCT03 14:10:05
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often). In my example,
the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

Any help would be appreciated!

Taylor

(e-mail address removed)

#!/usr/bin/env python
#
# I'm expecting the log file to be in chronalogical order
# so later entries are later in time
# using the dict, later PIDs overwrite newer ones.
# make a script and use this like
# logparse.py mylogfile.log > newlogfile.log
#
import fileinput
piddict = {}
for line in fileinput:
pid,username,date,time = line.split()
piddict[pid] = (username,date,time)
#
pidlist = piddict.keys()
pidlist.sort()
for pid in pidlist:
username,date,time = piddict[pid]
print pid,username,date,time
#tada!

Christos TZOTZIOY Georgiou · Jun 17, 2004

[snip]

If you don't care about the order of the final
result, just open a file and with one line the reduced data
is written out:

newfile.write(''.join(d.values()))

or

newfile.writelines(d.values()) # 1.5.2 and later

or

newfile.writelines(d.itervalues()) # 2.2 and later

Terry Reedy · Jun 17, 2004

Nova's Taylor said:
The log is an ASCII file that contains a process identifier (PID),
username, date, and time field like this:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 02AUG03 14:11:20
23 jonesjimbo 07AUG03 15:25:00
2348 williamstim 17AUG03 9:13:55
748 jonesjimbo 13OCT03 14:10:05
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

If you can get the log writer to write fixed length records with everything
lined up nicely, it would be easier to read the log by eye (with fixed
pitch font, which my newsreader doesn't use). It is also then trivial to
slice a field out of the middle of the line.

If one wants/needs to sort records by date, life is also easier if you can
get the record writer to print dates in sortable format: YYYYMMDD. (I
learned this 25 years ago.)

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often).

If these are *nix process ids, this does not make obvious sense. Since
pids are arbitrary, why delete a recent record because its PID got reused
while keeping an old record because its PID happended not to? I could
better imagine keeping all records since a certain date or the last n
records (the latter is trivial with fixed len records).

In my example, the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

That is one possibility: you have form a list of (key, line) pairs, where
key is extracted from the line.

Any help would be appreciated!

Alternative: instead of sort then filter duplicates, filter duplicates and
then sort the reduced list. Assuming records are in date order from
earlier to later, insert them into a dict with PID as key and entire record
as value, and later records will replace earlier records with same key
(PID). Then resort d.values() by date. Variation: if you cannot get dates
stored properly for easy sorting, store line numbers with records so you
can sort by line number instead of fiddling with nasty dates. Something
like (incomplete and untested):

d = {}
for pair in enumerate(file('whatever')):
d[getpid(pair[1])] = pair # getpid might be inline expression
uniqs = d.values()
uniqs.sort()
new = [pair[1] for pair in uniqs]

Terry J. Reedy

Nova's Taylor · Jun 18, 2004

Wow - thanks for all of your great suggestions. I did neglect to
mention that the log file is appended to over time, so the values are
already in a time-sequenced sort going in, thus allowing the use of a
dictionary as suggested by David and others. This is what I wound up
using:

sourceFile = open(r'C:\_sandbox\SASAdmin\Python\ServerAdmin\SignOnLog.txt')

# output file for testing only
logFile = open(r'C:\_sandbox\SASAdmin\Python\ServerAdmin\test.txt',
'w')

piddict = {}
for line in sourceFile:
pid,username,date,time = line.split()
piddict[pid] = (username,date,time)

pidlist = piddict.keys()
pidlist.sort()
for pid in pidlist:
username,date,time = piddict[pid]
# next line seems amateurish, but that is what I am!
logFile.write(pid + " " + username + " " + date + "" + time +
"\n")

More background:

I will next merge this log file to process identifiers running on a
server, so I can identify "who-started-what-process-when." In Perl I
do it this way:

$pattern=sas; ## name of application I am searching for

# Use PSLIST.EXE to list processes on the server
open(PIDLIST, "pslist |") or die "Can not run the PSLIST program:
$!\n";

while(<PIDLIST>)
{
$output .=$_;
if (/$pattern/i)
{
## collect pids that match pattern into an array, splitting on
white spaces
@taskList=split(/\s+/, $_);

## Check each value in the Server task list with each row in the
log file
foreach $proc_val ( @fl )
{
chomp($proc_val); ## Remove new line characters at the end.
@log=split(/\s+/, $proc_val);

if ( $log[0] eq $taskList[1])
{
# print">>>>No matches in log Files!!<<<<<<<<<<< \n"; #
debug
print "$taskList[0] $log[0] $log[1] $log[2]
$taskList[5] $log[3] $taskList[8] \n";
$foundIt=1;
}
}
}
}
close(PIDLIST);

So now its more reading to see how to do this in Python!
Thanks again for all your help!

Taylor

Peter Hansen · Jun 19, 2004

Nova's Taylor said:
This is what I wound up using:

Could I suggest part of my suggestion again? See below:

piddict = {}
for line in sourceFile:
pid,username,date,time = line.split()
piddict[pid] = (username,date,time)

Here you are splitting the whole thing, and storing a Python
tuple rather than the original "line" contents...

pidlist = piddict.keys()
pidlist.sort()
for pid in pidlist:
username,date,time = piddict[pid]
# next line seems amateurish, but that is what I am!
logFile.write(pid + " " + username + " " + date + "" + time +
"\n")

Here you are writing out something that is exactly equal
(if I read this all correctly) to the original line, but
having to split the tuple and append lots of strings together
again with spaces, the newline, etc.

Why not just store the original line and use it at the end:

for line in sourceFile:
pid, _ = line.split(' ', 1)
piddict[pid] = line

and later, use writelines as Christos suggested, without
even needing a loop:

logFile.writelines(piddict.values())

The difference in the writing part is that you are sorting by
pid, though I'm not clear why or if it's required. If it is,
you could still loop, but more simply:

for pid in pidlist:
logFile.write(piddict[pid])

No splitting, no concatenating...

-Peter

Minimum Total Difficulty	0	Nov 15, 2023
Taskcproblem calendar	4	Aug 31, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
MemoryError in data conversion	10	Apr 14, 2014
How to use ufixed when it involves multiplication a number of times?(VHDL question)	0	Aug 22, 2016
Number of objects grows unbouned...Memory leak	1	May 3, 2014
Cannot find column after sort	0	Jul 18, 2008
How do I make pprint do this	2	May 20, 2009

Parse ASCII log ; sort and keep most recent entries

Nova's Taylor

Peter Hansen

Larry Bates

David Fisher

Christos TZOTZIOY Georgiou

Terry Reedy

Nova's Taylor

Peter Hansen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads