How to read such file and sumarize the data?

H

huisky

Say I have following log file, which records the code usage.
I want to read this file and do the summarize how much total CPU time
consumed for each user.
Is Python able to do so or say easy to achieve this?, anybody can give
me some hints, appricate very much!


Example log file.
**************************************************************************************
LSTC license server version 224 started at Sun Dec 6 18:56:48 2009
using configuration file /usr/local/lstc/server_data
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
18:57:40
(e-mail address removed) completed Sun Dec 6 19:42:55
xyz (e-mail address removed) LS-DYNA_971 NCPU=2 started Sun Dec 6
20:17:02
(e-mail address removed) completed Sun Dec 6 20:26:03
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
21:01:17
(e-mail address removed) completed Sun Dec 6 21:01:28
tanhoi (e-mail address removed) LS-DYNA_971 NCPU=1 started Mon
Dec 7 09:31:00
(e-mail address removed) presumed dead Mon Dec 7 10:36:48
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 13:14:47
(e-mail address removed) completed Mon Dec 7 13:24:07
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:21:34
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:28:42
(e-mail address removed) killed Mon Dec 7 14:31:48
(e-mail address removed) killed Mon Dec 7 14:32:06
 
M

MRAB

Say I have following log file, which records the code usage.
I want to read this file and do the summarize how much total CPU time
consumed for each user.
Is Python able to do so or say easy to achieve this?, anybody can give
me some hints, appricate very much!


Example log file.
**************************************************************************************
LSTC license server version 224 started at Sun Dec 6 18:56:48 2009
using configuration file /usr/local/lstc/server_data
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
18:57:40
(e-mail address removed) completed Sun Dec 6 19:42:55
xyz (e-mail address removed) LS-DYNA_971 NCPU=2 started Sun Dec 6
20:17:02
(e-mail address removed) completed Sun Dec 6 20:26:03
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
21:01:17
(e-mail address removed) completed Sun Dec 6 21:01:28
tanhoi (e-mail address removed) LS-DYNA_971 NCPU=1 started Mon
Dec 7 09:31:00
(e-mail address removed) presumed dead Mon Dec 7 10:36:48
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 13:14:47
(e-mail address removed) completed Mon Dec 7 13:24:07
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:21:34
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:28:42
(e-mail address removed) killed Mon Dec 7 14:31:48
(e-mail address removed) killed Mon Dec 7 14:32:06

Here's how I would probably do it:

Use 2 dicts, one for the start time of each user, and the other for the
total elapsed time of each user.

For each line extract the user name, date/time and whether the user is
starting, finishing, etc.

When a user starts, save the info in the first dict, and when a user
finishes, calculate the elapsed time and add it to the total for that
user.

The date/time can be parsed by the time or datetime module.
 
T

Tim Harig

I want to read this file and do the summarize how much total CPU time
consumed for each user.
Is Python able to do so or say easy to achieve this?, anybody can give
me some hints, appricate very much!

The question is, is the information you want available in the data.
Example log file.
**************************************************************************************
LSTC license server version 224 started at Sun Dec 6 18:56:48 2009
using configuration file /usr/local/lstc/server_data
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
18:57:40
(e-mail address removed) completed Sun Dec 6 19:42:55
xyz (e-mail address removed) LS-DYNA_971 NCPU=2 started Sun Dec 6
20:17:02
(e-mail address removed) completed Sun Dec 6 20:26:03
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
21:01:17
(e-mail address removed) completed Sun Dec 6 21:01:28
tanhoi (e-mail address removed) LS-DYNA_971 NCPU=1 started Mon
Dec 7 09:31:00
(e-mail address removed) presumed dead Mon Dec 7 10:36:48
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 13:14:47
(e-mail address removed) completed Mon Dec 7 13:24:07
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:21:34
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:28:42
(e-mail address removed) killed Mon Dec 7 14:31:48
(e-mail address removed) killed Mon Dec 7 14:32:06

I see starts, completes, kills, and presumed deads. The question is can
the starts be matched to the completes and kills either from the numbers
before @ or from a combination of the address and NCPU. You will need to
figure out whether or not you want to count the presumed deads in your
calculations.

Assuming that the starts and stops can be corrilated, it is a simple matter
of finding the pairs and using the datetime module to find the difference
in time between them.
 
T

Tim Harig

When a user starts, save the info in the first dict, and when a user
finishes, calculate the elapsed time and add it to the total for that
user.

Perhaps you know more about the structure of this data. It seems to me
that a user might have more then a single job(?) running at once. I
therefore made the assumption that each start must be matched to its
coorisponding stop.
 
S

Steve Holden

Say I have following log file, which records the code usage.
I want to read this file and do the summarize how much total CPU time
consumed for each user.
Is Python able to do so or say easy to achieve this?, anybody can give
me some hints, appricate very much!


Example log file.
**************************************************************************************
I'm assuming the following (unquoted) data is in file "data.txt":
LSTC license server version 224 started at Sun Dec 6 18:56:48 2009
using configuration file /usr/local/lstc/server_data
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
18:57:40
(e-mail address removed) completed Sun Dec 6 19:42:55
xyz (e-mail address removed) LS-DYNA_971 NCPU=2 started Sun Dec 6
20:17:02
(e-mail address removed) completed Sun Dec 6 20:26:03
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec 6
21:01:17
(e-mail address removed) completed Sun Dec 6 21:01:28
tanhoi (e-mail address removed) LS-DYNA_971 NCPU=1 started Mon
Dec 7 09:31:00
(e-mail address removed) presumed dead Mon Dec 7 10:36:48
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 13:14:47
(e-mail address removed) completed Mon Dec 7 13:24:07
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:21:34
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec 7 14:28:42
(e-mail address removed) killed Mon Dec 7 14:31:48
(e-mail address removed) killed Mon Dec 7 14:32:06

The line wrapping being wrong shouldn't affect the logic.

$ cat data.py
lines = open("data.txt").readlines()
from collections import defaultdict
c = defaultdict(int)
for line in lines:
ls = line.split()
if len(ls) > 3 and ls[3].startswith("NCPU="):
amt = int(ls[3][5:])
c[ls[0]] += amt
for key, value in c.items():
print key, ":", value


$ python data.py
xyz : 4
tanhoi : 1
sabril : 6

regards
Steve
 
M

MRAB

Perhaps you know more about the structure of this data. It seems to me
that a user might have more then a single job(?) running at once. I
therefore made the assumption that each start must be matched to its
coorisponding stop.

I did make certain assumptions. It's up to the OP to adapt it to the
actual problem accordingly. :)
 
M

Martin Gregorie

Say I have following log file, which records the code usage. I want to
read this file and do the summarize how much total CPU time consumed for
each user.
Two points you should think about:

- I don't think you can extract CPU time from this log: you can get
the process elapsed time and the number of CPUs each run has used,
but you can't calculate CPU time from those values since you don't
know how the process spent waiting for i/o etc.

- is the first (numeric) part of the first field on the line a process id?
If it is, you can match start and stop messages on the value of the
first field provided that this value can never be shared by two
processes that are both running. If you can get simultaneous
duplicates, then you're out of luck because you'll never be able to
match up start and stop lines.

Is Python able to do so or say easy to achieve this?, anybody can give
me some hints, appricate very much!
Sure. There are two approaches possible:
- sort the log on the first two fields and then process it with Python
knowing that start and stop lines will be adjacent

- use the first field as the key to an array and put the start time
and CPU count in that element. When a matching stop line is found
you, retrieve the array element, calculate and output or total the
usage figure for that run and delete the array element.
 
T

Terry Reedy

$ cat data.py
lines = open("data.txt").readlines()

Since you iterate through the file just once, there is no reason I can
think of to make a complete in-memory copy. That would be a problem with
a multi-gigabyte log file ;=). In 3.x at least, open files are line
iterators and one would just need

lines = open("data.txt")
from collections import defaultdict
c = defaultdict(int)
for line in lines:
ls = line.split()
if len(ls)> 3 and ls[3].startswith("NCPU="):
amt = int(ls[3][5:])
c[ls[0]] += amt
for key, value in c.items():
print key, ":", value


$ python data.py
xyz : 4
tanhoi : 1
sabril : 6

regards
Steve
 
S

Steve Holden

Since you iterate through the file just once, there is no reason I can
think of to make a complete in-memory copy. That would be a problem with
a multi-gigabyte log file ;=). In 3.x at least, open files are line
iterators and one would just need
You are indeed perfectly correct. Thank you. Probably old-ingrained
habits showing through. Open files have been line iterators since 2.2, I
believe.

regards
Steve
lines = open("data.txt")
from collections import defaultdict
c = defaultdict(int)
for line in lines:
ls = line.split()
if len(ls)> 3 and ls[3].startswith("NCPU="):
amt = int(ls[3][5:])
c[ls[0]] += amt
for key, value in c.items():
print key, ":", value


$ python data.py
xyz : 4
tanhoi : 1
sabril : 6

regards
Steve
 
H

huisky

thank you Martin. You are right.
But the elapsed time is also okay for me. And i would like to assume
that the total CPU time equals to the number of CPUs multiply the
elapsed time. As to the number you mentioned, it is the 'process id',
so it will be no problem to identify each job.

Huiksy
 
H

huisky

the number before @ is the process id in the linux server and it is
identical.
So i do NOT think distinguish each job's starting and ending time is
difficult in this case.

Huisky
 
H

huisky

Thank you very much for your reply.
I think you just count the total number of NCPU used for each user.
And it does NOT show how much time used for each user.

Huisky

On 11/17/2010 4:45 PM, huisky wrote:> Say I have following log file, which records the code usage.
I want to read this file and do the summarize how much total CPU time
consumed for each user.
Is Python able to do so or say easy to achieve this?, anybody can give
me some hints, appricate very much!
Example log file.
**************************************************************************************

I'm assuming the following (unquoted) data is in file "data.txt":


LSTC license server version 224 started at Sun Dec  6 18:56:48 2009
using configuration file /usr/local/lstc/server_data
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec  6
18:57:40
(e-mail address removed) completed Sun Dec  6 19:42:55
xyz (e-mail address removed) LS-DYNA_971 NCPU=2 started Sun Dec  6
20:17:02
(e-mail address removed) completed Sun Dec  6 20:26:03
xyz (e-mail address removed) LS-DYNA_971 NCPU=1 started Sun Dec  6
21:01:17
(e-mail address removed) completed Sun Dec  6 21:01:28
tanhoi (e-mail address removed) LS-DYNA_971 NCPU=1 started Mon
Dec  7 09:31:00
(e-mail address removed) presumed dead Mon Dec  7 10:36:48
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec  7 13:14:47
(e-mail address removed) completed Mon Dec  7 13:24:07
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec  7 14:21:34
sabril (e-mail address removed) LS-DYNA_971 NCPU=2 started Mon
Dec  7 14:28:42
(e-mail address removed) killed Mon Dec  7 14:31:48
(e-mail address removed) killed Mon Dec  7 14:32:06

The line wrapping being wrong shouldn't affect the logic.

$ cat data.py
lines = open("data.txt").readlines()
from collections import defaultdict
c = defaultdict(int)
for line in lines:
    ls = line.split()
    if len(ls) > 3 and ls[3].startswith("NCPU="):
        amt = int(ls[3][5:])
        c[ls[0]] += amt
for key, value in c.items():
    print key, ":", value

$ python data.py
xyz : 4
tanhoi : 1
sabril : 6

regards
 Steve
--
Steve Holden           +1 571 484 6266   +1 800 494 3119
PyCon 2011 Atlanta March 9-17      http://us.pycon.org/
See Python Video!      http://python.mirocommunity.org/
Holden Web LLC                http://www.holdenweb.com/
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top