Parse ASCII log ; sort and keep most recent entries

N

Nova's Taylor

Hi folks,

I am a newbie to Python and am hoping that someone can get me started
on a log parser that I am trying to write.

The log is an ASCII file that contains a process identifier (PID),
username, date, and time field like this:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 02AUG03 14:11:20
23 jonesjimbo 07AUG03 15:25:00
2348 williamstim 17AUG03 9:13:55
748 jonesjimbo 13OCT03 14:10:05
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often). In my example,
the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

Any help would be appreciated!

Taylor

(e-mail address removed)
 
P

Peter Hansen

Nova's Taylor said:
I am a newbie to Python and am hoping that someone can get me started
on a log parser that I am trying to write.

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often). In my example,
the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

I think you are specifying the implementation of the solution
a bit, rather than just the requirements. Do you really need
the resulting list to be sorted by PID and date/time, or was
that just part of how you thought you'd write it?

If you don't care about the sorting part, but just want the
output to be a list of unique PIDs, you could just do the
following instead, taking advantage of how Python dictionaries
have unique keys. Note that this assumes that the contents
of the file were originally in order by date (i.e. more recent
items come later).

1. Create empty dict: "d = {}"
2. Read data line by line: "for line in infile.readlines()"
3. Split so the PID is separate: "pid = line.split()[0]"
4. Store entire line in dictionary using PID as key: "d[pid] = line"

When you're done, the dict will contain only the most recent
line with a given PID, though in "arbitrary" (effectively
random) order. If you don't care about the order of the final
result, just open a file and with one line the reduced data
is written out:

newfile.write(''.join(d.values()))

-Peter
 
L

Larry Bates

Here's a quick solution.

Larry Bates
Syscon, Inc.


def cmpfunc(x,y):
xdate=x[0]
xtime=x[1]
ydate=y[0]
ytime=y[1]
if xdate == ydate:
#
# If the two dates are equal, I must check the times
#
if xtime > ytime: return 1
elif xtime == ytime: return 0
else: return -1
elif xdate > ydate: return 1
return -1

fp=file(yourlogfilepath, 'r')
lines=fp.readlines()
fp.close()
list=[]
months={'JAN': '01', 'FEB': '02', 'MAR': '03', 'APR': '04',
'MAY': '05', 'JUN': '06', 'JUL': '07', 'AUG': '08',
'SEP': '09', 'OCT': '10', 'NOV': '11', 'DEC': '12'}

logdict={}

for line in lines:
if not line.strip(): break
print line
pid, name, date, time=[x.strip() for x in line.rstrip().split(' ')]
#
# Must zero pad time for proper comparison
#
stime=time.zfill(8)
#
# Must reformat the data as YYMMDD
#
sdate=date[-2:]+months[date[2:5]]+date[:2]
list.append((sdate, stime, pid, name, date, time))

list.sort(cmpfunc)
list.reverse()

for sdate, stime, pid, name, date, time in list:
if logdict.has_key(pid): continue
logdict[pid]=(pid, name, date, time)

for key in logdict.keys():
pid, name, date, time=logdict[key]
print pid, name, date, time
 
D

David Fisher

Hi folks,

I am a newbie to Python and am hoping that someone can get me started
on a log parser that I am trying to write.

The log is an ASCII file that contains a process identifier (PID),
username, date, and time field like this:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 02AUG03 14:11:20
23 jonesjimbo 07AUG03 15:25:00
2348 williamstim 17AUG03 9:13:55
748 jonesjimbo 13OCT03 14:10:05
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often). In my example,
the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

Any help would be appreciated!

Taylor

(e-mail address removed)
#!/usr/bin/env python
#
# I'm expecting the log file to be in chronalogical order
# so later entries are later in time
# using the dict, later PIDs overwrite newer ones.
# make a script and use this like
# logparse.py mylogfile.log > newlogfile.log
#
import fileinput
piddict = {}
for line in fileinput:
pid,username,date,time = line.split()
piddict[pid] = (username,date,time)
#
pidlist = piddict.keys()
pidlist.sort()
for pid in pidlist:
username,date,time = piddict[pid]
print pid,username,date,time
#tada!
 
C

Christos TZOTZIOY Georgiou

[snip]
If you don't care about the order of the final
result, just open a file and with one line the reduced data
is written out:

newfile.write(''.join(d.values()))

or

newfile.writelines(d.values()) # 1.5.2 and later

or

newfile.writelines(d.itervalues()) # 2.2 and later
 
T

Terry Reedy

Nova's Taylor said:
The log is an ASCII file that contains a process identifier (PID),
username, date, and time field like this:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 02AUG03 14:11:20
23 jonesjimbo 07AUG03 15:25:00
2348 williamstim 17AUG03 9:13:55
748 jonesjimbo 13OCT03 14:10:05
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

If you can get the log writer to write fixed length records with everything
lined up nicely, it would be easier to read the log by eye (with fixed
pitch font, which my newsreader doesn't use). It is also then trivial to
slice a field out of the middle of the line.

If one wants/needs to sort records by date, life is also easier if you can
get the record writer to print dates in sortable format: YYYYMMDD. (I
learned this 25 years ago.)
I want to read in and sort the file so the new list only contains only
the most the most recent PID (PIDS get reused often).

If these are *nix process ids, this does not make obvious sense. Since
pids are arbitrary, why delete a recent record because its PID got reused
while keeping an old record because its PID happended not to? I could
better imagine keeping all records since a certain date or the last n
records (the latter is trivial with fixed len records).
In my example, the new list would be:

1234 williamstim 01AUG03 7:44:31
2348 williamstim 17AUG03 9:13:55
23 jonesjimbo 14OCT03 23:01:23
748 jonesjimbo 14OCT03 23:59:59

So I need to sort by PID and date + time,then keep the most recent.

That is one possibility: you have form a list of (key, line) pairs, where
key is extracted from the line.
Any help would be appreciated!

Alternative: instead of sort then filter duplicates, filter duplicates and
then sort the reduced list. Assuming records are in date order from
earlier to later, insert them into a dict with PID as key and entire record
as value, and later records will replace earlier records with same key
(PID). Then resort d.values() by date. Variation: if you cannot get dates
stored properly for easy sorting, store line numbers with records so you
can sort by line number instead of fiddling with nasty dates. Something
like (incomplete and untested):

d = {}
for pair in enumerate(file('whatever')):
d[getpid(pair[1])] = pair # getpid might be inline expression
uniqs = d.values()
uniqs.sort()
new = [pair[1] for pair in uniqs]

Terry J. Reedy
 
N

Nova's Taylor

Wow - thanks for all of your great suggestions. I did neglect to
mention that the log file is appended to over time, so the values are
already in a time-sequenced sort going in, thus allowing the use of a
dictionary as suggested by David and others. This is what I wound up
using:

sourceFile = open(r'C:\_sandbox\SASAdmin\Python\ServerAdmin\SignOnLog.txt')

# output file for testing only
logFile = open(r'C:\_sandbox\SASAdmin\Python\ServerAdmin\test.txt',
'w')

piddict = {}
for line in sourceFile:
pid,username,date,time = line.split()
piddict[pid] = (username,date,time)

pidlist = piddict.keys()
pidlist.sort()
for pid in pidlist:
username,date,time = piddict[pid]
# next line seems amateurish, but that is what I am!
logFile.write(pid + " " + username + " " + date + "" + time +
"\n")

More background:

I will next merge this log file to process identifiers running on a
server, so I can identify "who-started-what-process-when." In Perl I
do it this way:


$pattern=sas; ## name of application I am searching for

# Use PSLIST.EXE to list processes on the server
open(PIDLIST, "pslist |") or die "Can not run the PSLIST program:
$!\n";

while(<PIDLIST>)
{
$output .=$_;
if (/$pattern/i)
{
## collect pids that match pattern into an array, splitting on
white spaces
@taskList=split(/\s+/, $_);

## Check each value in the Server task list with each row in the
log file
foreach $proc_val ( @fl )
{
chomp($proc_val); ## Remove new line characters at the end.
@log=split(/\s+/, $proc_val);

if ( $log[0] eq $taskList[1])
{
# print">>>>No matches in log Files!!<<<<<<<<<<< \n"; #
debug
print "$taskList[0] $log[0] $log[1] $log[2]
$taskList[5] $log[3] $taskList[8] \n";
$foundIt=1;
}
}
}
}
close(PIDLIST);


So now its more reading to see how to do this in Python!
Thanks again for all your help!

Taylor
 
P

Peter Hansen

Nova's Taylor said:
This is what I wound up using:

Could I suggest part of my suggestion again? See below:
piddict = {}
for line in sourceFile:
pid,username,date,time = line.split()
piddict[pid] = (username,date,time)

Here you are splitting the whole thing, and storing a Python
tuple rather than the original "line" contents...
pidlist = piddict.keys()
pidlist.sort()
for pid in pidlist:
username,date,time = piddict[pid]
# next line seems amateurish, but that is what I am!
logFile.write(pid + " " + username + " " + date + "" + time +
"\n")

Here you are writing out something that is exactly equal
(if I read this all correctly) to the original line, but
having to split the tuple and append lots of strings together
again with spaces, the newline, etc.

Why not just store the original line and use it at the end:

for line in sourceFile:
pid, _ = line.split(' ', 1)
piddict[pid] = line

and later, use writelines as Christos suggested, without
even needing a loop:

logFile.writelines(piddict.values())

The difference in the writing part is that you are sorting by
pid, though I'm not clear why or if it's required. If it is,
you could still loop, but more simply:

for pid in pidlist:
logFile.write(piddict[pid])

No splitting, no concatenating...

-Peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top