python vs awk for simple sysamin tasks

  • Thread starter Matthew Thorley
  • Start date
M

Matthew Thorley

My friend sent me an email asking this:
> I'm attemtping to decide which scripting language I should master and
> was wondering if it's possible to do
> these unixy awkish commands in python:
>
> How to find the amount of disk space a user is taking up:
>
> find / -user rprice -fstype nfs ! -name /dev/\* -ls | awk '{sum+=$7};\
> {print "User rprice total disk use = " sum}'
>
> How to find the average size of a file for a user:
>
> find / -user rprice -fstype nfs ! -name /dev/\* -ls | awk '{sum+=$7};\
> {print "The ave size of file for rprice is = " sum/NR}'

I wasn't able to give him an afirmative answer because I've never used
python for things like this. I just spent the last while looking on
google and haven't found an answer yet. I was hoping some one out there
might have some thoughts ?

thanks much
-matthew
 
M

Matthew Thorley

Steve said:
Individually, no. Combined, yes.

So I went ahead and combined them and added a little extra heres the script:

#!/usr/bin/python

import os
from sys import argv, exit

class userFileStats:
def __init__(self):
self.path = ''
self.uid = ''
self.userName = ''
self.oserrors = 0
self.totalDirs = 0
self.totalFiles = 0
self.totalFileSize = 0

self.totalUserDirs = 0
self.totalUserFiles = 0
self.totalUserFileSize = 0
self.smallestUserFile = [100**100, 'name']
self.largestUserFile = [0, 'name']


def walkPath(self, path, uid):
self.path = path
self.uid = int(uid)
os.path.walk(path, self.tallyFiles, uid)


def tallyFiles(self, uid, dir, names):
self.totalDirs = self.totalDirs + 1
self.totalFiles = self.totalFiles + len(names)

if os.stat(dir)[4] == self.uid:
self.totalUserDirs = self.totalUserDirs + 1
for name in names:
try:
stat = os.stat(dir+'/'+name)
except OSError:
self.oserrors = self.oserrors + 1
break

self.totalFileSize = self.totalFileSize + stat[6]
if stat[4] == self.uid:
self.totalUserFiles = self.totalUserFiles + 1
self.totalUserFileSize = self.totalUserFileSize + stat[6]

if stat[6] < self.smallestUserFile[0]:
self.smallestUserFile[0] = stat[6]
self.smallestUserFile[1] = dir+'/'+name

if stat[6] > self.largestUserFile[0]:
self.largestUserFile[0] = stat[6]
self.largestUserFile[1] = dir+'/'+name


def printResults(self):
print "Results for path %s"\
%(self.path)
print " Searched %s dirs"\
%(self.totalDirs)
print " Searched %s files"\
%(self.totalFiles)
print " Total disk use for all files = %s bytes"\
%(self.totalFileSize/1024)
print " %s files returned errors"\
%(self.oserrors)
print "Results for user %s"\
%(self.uid)
print " User owns %s dirs"\
%(self.totalUserDirs)
print " User owns %s files"\
%(self.totalUserFiles)
print " Total disk use for user = %s bytes"\
%(self.totalUserFileSize/1024)
print " Users smallest file %s is %s bytes"\
%(self.smallestUserFile[1], self.smallestUserFile[0]/1024)
print " Users largest file %s is %s bytes"\
%(self.largestUserFile[1], self.largestUserFile[0]/1024)
print " Average user file size = %s bytes"\
%( (self.totalUserFileSize/self.totalUserFiles)/1024 )


if __name__ == '__main__':
if len(argv) == 2:
user= argv[1]
path = os.getcwd()
elif len(argv) == 3:
user = argv[1]
path = argv[2]
else:
print 'Usage: userFileStats.py uid path\n'
exit(1)

userFileStats = userFileStats()
userFileStats.walkPath(path, user)
userFileStats.printResults()

It is A LOT longer than the one liners (obviously) but it has way more
functionality. With a little tweaking you could easily do all sorts of
other useful things. I'm sure utils like this already exist out there
whether written in python or not.

Another question. The example my friend gave me takes the user name as
an argument not the uid. Does any one know how to convert usernames to
uids and vice versa in python ? Please also comment on the script, any
thoughts on simplification ?

thanks
-matthew
 
M

Matthew Thorley

I should have included this in the last post. The script gives output
that looks like this:

Results for path /linshr/hope
Searched 694 dirs
Searched 10455 files
Total disk use for all files = 4794176 bytes
6 files returned errors
Results for user 1000
User owns 692 dirs
User owns 10389 files
Total disk use for user = 4791474 bytes
Users smallest file /linshr/hope/.fonts.cache-1 is 0 bytes
Users largest file /linshr/hope/vmw/tgz/winXPPro_vmware.tgz is
2244111 bytes
Average user file size = 461 bytes

-matthew
 
S

Steve Lamb

I wasn't able to give him an afirmative answer because I've never used
python for things like this. I just spent the last while looking on
google and haven't found an answer yet. I was hoping some one out there
might have some thoughts ?

What would be better is defining the end result than cramming out shell
script that we've got to decipher. But, with that said to me it would be a
simple matter of os.path.walk() with a call to an appropriate function which
does the calculations as needed.
 
R

Roy Smith

Matthew Thorley said:
My friend sent me an email asking this:


I wasn't able to give him an afirmative answer because I've never used
python for things like this. I just spent the last while looking on
google and haven't found an answer yet. I was hoping some one out there
might have some thoughts ?

thanks much
-matthew

Neither of these are really tasks well suited to python at all.

I'm sure you could replicate this functionality in python using things
like os.walk() and os.stat(), but why bother? The result would be no
better than the quick on-liners you've got above. Even if you wanted to
replace the awk part with python, the idea of trying to replicate the
find functionality is just absurd.

I'm sure you could replicate them in perl too, but the same comment
applies. Find is an essential unix tool. If you're going to be doing
unix sysadmin work, you really should figure out how find works.
 
S

Steve Lamb

Neither of these are really tasks well suited to python at all.

Individually, no. Combined, yes.
I'm sure you could replicate this functionality in python using things
like os.walk() and os.stat(), but why bother? The result would be no
better than the quick on-liners you've got above.

Not true. The above one liners are two passes over the same data. With
an appropriate script you could make one pass and get both results. Sure you
could do that in shell but I'm of the opinion that anything other than one
liners should never be done in shell.
 
R

Roy Smith

Steve Lamb said:
Not true. The above one liners are two passes over the same data. With
an appropriate script you could make one pass and get both results.

You may be right that a python script would be faster. The shell pipe
does make two passes over the data, not to mention all the pipe
overhead, and the binary -> ascii -> binary double conversion.

But does it matter? Probably not. Groveling your way through a whole
file system is pretty inefficient any way you do it. It's extremely
rare to find a sysadmin task where this kind of efficiency tweaking
matters. As long as the overall process remains O(n), don't sweat it.
Sure you could do that in shell but I'm of the opinion that anything
other than one liners should never be done in shell.

To a certain extent, you're right, but the two examples given really
were effectively one liners.
 
D

Donn Cave

Steve Lamb said:
Individually, no. Combined, yes.


Not true. The above one liners are two passes over the same data. With
an appropriate script you could make one pass and get both results. Sure you
could do that in shell but I'm of the opinion that anything other than one
liners should never be done in shell.

I guess you're already conceding that your own point isn't
very relevant, but just in case this isn't clear, if the
intent was actually to do both tasks at the same time (which
isn't clear), the end clause could easily print the sum and
then the average. (The example erroneously fails to label
the end clause, but it should be fairly easy to see what was
intended.)

Awk is a nice language for its intended role - concise,
readable, efficient - and I use it a lot for things like
this, or somewhat more elaborate programs, because I believe
it's easier for my colleagues to deal with who aren't familiar
with Python (or awk, really.) It's also supported by the
UNIX platforms we use, as long as I avoid gawk-isms, while
Python will never be really reliably present until it can
stabilize enough that a platform vendor isn't signing on
for a big headache by trying to support it. (Wave and say
hi, Redhat.)

However, it's inadequate for complex programming - can't
store arrays in arrays, for example.

Donn Cave, (e-mail address removed)
 
S

Steve Lamb

You may be right that a python script would be faster. The shell pipe
does make two passes over the data, not to mention all the pipe
overhead, and the binary -> ascii -> binary double conversion.
But does it matter? Probably not. Groveling your way through a whole
file system is pretty inefficient any way you do it. It's extremely
rare to find a sysadmin task where this kind of efficiency tweaking
matters. As long as the overall process remains O(n), don't sweat it.

I'm sorry but when I look at things like this I look at the case where
such things would be used a couple hundred thousand times. Small
inefficiencies like multiple stat() passes and tons of system() calls pile up
fast and can baloon a run time from a managable "few hours" to well over a
day.
To a certain extent, you're right, but the two examples given really
were effectively one liners.

Yes, they were. But combined it is no longer a one liner since at that
point one is storing the count value and doing something with it. ;)
 
S

Steve Lamb

It is A LOT longer than the one liners (obviously) but it has way more
functionality. With a little tweaking you could easily do all sorts of
other useful things. I'm sure utils like this already exist out there
whether written in python or not.

Also it can be made part of a larger project with relative ease. :)
Another question. The example my friend gave me takes the user name as
an argument not the uid. Does any one know how to convert usernames to
uids and vice versa in python ? Please also comment on the script, any
thoughts on simplification ?

I'd just do a quick pass over the passwd file but then many times I'm
blisfully unaware of things already coded to do the work I'm after. I mean my
first stab at iterating over the file system didn't use os.path.walk(). :)
 
D

David M. Cooke

At some point said:
I'd just do a quick pass over the passwd file ...

That won't work (for all uids) if a network-based database (like NIS)
is used. You want the pwd module.
 
P

Pete Forman

I recently rewrote a short shell script in python. The latter was
about 30 times faster and I find myself reusing parts of it for other
tasks.

That said, I still would agree with others in this thread that one
liners are useful. It is a good idea to be familiar with awk, find,
grep, sed, xargs, etc.
 
S

Steve Lamb

That said, I still would agree with others in this thread that one
liners are useful. It is a good idea to be familiar with awk, find,
grep, sed, xargs, etc.

Then you, like some others, would have missed my point. I never said that
one liners aren't useful. I never said one should not know the standard tools
available on virtually all unix systems. I said, quite clearly, that I felt
*anything larger than a one liner* should not be done in shell.

That means one liners are cool in shell. They serve a purpose.
 
W

William Park

Steve Lamb said:
Then you, like some others, would have missed my point. I never
said that one liners aren't useful. I never said one should not
know the standard tools available on virtually all unix systems.
I said, quite clearly, that I felt *anything larger than a one
liner* should not be done in shell.

That means one liners are cool in shell. They serve a purpose.

I realize that this is Python list, but a dose of reality is needed
here. This is typical view of salary/wage recipient who would do
anything to waste time and money. How long does it take to type out
simple 2 line shell/awk script? And, how long to do it in Python?

"Right tool for right job" is the key insight here. Just as Python
evolves, other software evolves as well. For array/list and
glob/regexp, I mainly use Shell now. Shell can't do nesting (but don't
really need to). For more complicate stuffs, usually involving heavy
dictionary, I use Python. Awk would fall in between, usually involving
floating point and table parsing.

For OP, learn both Awk and Python. But, keep in mind, shell and editor
are the 2 most important tools/skills. Neither Awk or Python will do
you any good, if you can't type. :)
 
S

Steve Lamb

I realize that this is Python list, but a dose of reality is needed
here. This is typical view of salary/wage recipient who would do
anything to waste time and money. How long does it take to type out
simple 2 line shell/awk script? And, how long to do it in Python?

No, that is the view of someone who wants to get the job done and save
time/money.

How long does it take to write out a simple 2 line shell/awk script? Not
that long. How long does it take to do it in Python? Maybe a dozen or so
minutes longer. Yay, I save a few minutes!

How long does it take me to modify that script a few weeks later when my
boss asks me, "Y'know, I could really use this and this." In shell, quite a
few minutes, maybe a half hour. Python? Maybe 10 minutes tops.

How long does it take me to modify it a month or two after that when my
boss tells me, "We need to add in this feature, exclusions are nice, and what
about this?" Shell, now pushing a good 30-40m. Python, maybe 15m.

How long does it take me to rewrite it into a decent language when my boss
wonders why it is taking so long and the SAs are bitching at me because of the
runtimes of my shell script? In shell, an hour or two. In Python, oh, wait,
I don't have to.

Not gonna happen, ya say? Piffle. The only difference in the above
example was that I wasn't the one who made the choice to write the tool in
shell in the first place and the language I rewrote it in (to include the
exclusions they wanted which shell seemed incalable of doing) was Perl instead
of Python. I had the RCS revisions all the way back to the time when it was a
wee little shell script barely 3-4 lines long.

Now, if you like flexing your SA muscle and showing off how cool you are
by being able to whip out small shell scripts to do basic things, that's cool.
Go for it. But the reality check is that more time is saved in the long run
by spending the few extra minutes to do it *properly* instead of doing it
*quickly* because in the long run maintenance is easier and speed is
increased. It is amazing how much shell bogs down when you're running it over
several hundred thousand directories. :p
"Right tool for right job" is the key insight here. Just as Python
evolves, other software evolves as well.

What you're missing is that the job often evolves as well so it is better
to use a tool which has a broader scope and can evolve with the job more so
than the quick 'n dirty, get-it-done-now-and-to-hell-with-maintainability,
ultra-specialized, right-tool-for-the-job tool.

Do I use one-liners? All the time. When I need to delete a subset of
files or need to parse a particular string out of a log-file in a one-off
manner. I can chain the tools just as effectively as the next guy. But
anything more than that and out comes Python (previously Perl) because I
*knew* as soon as I needed a tool more than once I would also find more uses
for it, would have more uses of it requested of me and I'd have to maintain it
for months, sometimes years to come.
 
W

William Park

Steve Lamb said:
....
How long does it take me to modify that script a few weeks later
when my boss asks me, "Y'know, I could really use this and this."
In shell, quite a few minutes, maybe a half hour. Python? Maybe
10 minutes tops.

How long does it take me to modify it a month or two after that
when my boss tells me, "We need to add in this feature, exclusions
are nice, and what about this?" Shell, now pushing a good 30-40m.
Python, maybe 15m.

Too many maybe's, and you're timing hypotheoretical tasks which you hope
it would take that long.

Will you insist on supporting/modifying your old Python code, even if
it's cheaper and faster to throw it away and write a new code from
scratch? Because that's what Shell/Awk is for... write, throw away,
write, throw away, ...
....
It is amazing how much shell bogs down when you're running it over
several hundred thousand directories. :p

Same would be true for Python here.

What you're missing is that the job often evolves as well so it is
better to use a tool which has a broader scope and can evolve with
the job more so than the quick 'n dirty,
get-it-done-now-and-to-hell-with-maintainability,
ultra-specialized, right-tool-for-the-job tool.

Do I use one-liners? All the time. When I need to delete a
subset of files or need to parse a particular string out of a
log-file in a one-off manner. I can chain the tools just as
effectively as the next guy. But anything more than that and out
comes Python (previously Perl) because I *knew* as soon as I
needed a tool more than once I would also find more uses for it,
would have more uses of it requested of me and I'd have to
maintain it for months, sometimes years to come.

"If hammer is all you have, then everything becomes a nail". :)
 
S

Steve Lamb

Too many maybe's, and you're timing hypotheoretical tasks which you hope
it would take that long.

No, not too many maybes. That was based on a real-life experience which
is but a single example of many cases over the past several years of working
on my field (ISP/Web Hosting).
Because that's what Shell/Awk is for... write, throw away, write, throw
away, ...

And you don't consider this wasteful. Ooook then.
Same would be true for Python here.

Not as much as shell. With shell you have thousands upon thousands of
fork/execs getting in the way. In the case I am thinking of rewriting the
shell script into Perl and doing all the logic processing internally cut the
run time down from 7-8 hours down to *2*. I have seen similar perfomance
numbers from Python when compared to shell though not on that scale. I mean
the same could be said for C, or Assembly. Yes, iterations over large sets of
data is going to increase runtimes. However there are inefficiencies not
present in a scripting language which are present in shell which make it
exceptionally longer to run.
"If hammer is all you have, then everything becomes a nail". :)

Fortunately I don't have just a hammer then, isn't it? I restate, I use
shell for one liners. Beyond that it has been my experience, practice and
recommendation that far more time is saved in the long run by using the proper
tool. A proper scripting language and not a hobbled-together pseudo-language
with severe performance issues.
 
W

William Park

Steve Lamb said:
Not as much as shell. With shell you have thousands upon
thousands of fork/execs getting in the way. In the case I am
thinking of rewriting the shell script into Perl and doing all the
logic processing internally cut the run time down from 7-8 hours
down to *2*. I have seen similar perfomance numbers from Python
when compared to shell though not on that scale. I mean the same
could be said for C, or Assembly. Yes, iterations over large sets
of data is going to increase runtimes. However there are
inefficiencies not present in a scripting language which are
present in shell which make it exceptionally longer to run.

4x faster? Not very impressive. I suspect that it's poor quality shell
script to begin with. Would you post this script, so that others can
correct your misguided perception?
 
S

Steve Lamb

4x faster? Not very impressive. I suspect that it's poor quality shell
script to begin with. Would you post this script, so that others can
correct your misguided perception?

No.

1: It was an internal script for statistics gathering and I did not have
permission to expose that code to the public.

2: Even if I did I no longer work there.

The just of it though was that it was a disk usage script which tabulated
usage for a few hundred thousand customers. It had to go through several
slices (it wasn't a single large directory) find the customers in each of
those slices, calculate their disk usage and create a log of it.

The Perl recode came about when management wanted some exclusions put in
and the shell script was breaking at that point. They also wanted a lower
run-time if possible. So I spent an hour or two, most of it in the recursion
walk across the file-system (thank dog Python has os.path.walk!) rewriting it
in Perl. The stat calls were not reduced, we still had to do a stat on every
file to get the size as before. However we no longer were going through the
overhead of constantly opening and closing pipes to/from du as well as the
numerous exec calls.

4x faster measured in hours based pretty much on building up and tearing
down those pipes and executing the same program over and over is rather
impressive given how miniscule those operations are in general is impressive.
 
S

Steve Lamb

4x faster measured in hours based pretty much on building up and tearing
down those pipes and executing the same program over and over is rather
impressive given how miniscule those operations are in general is impressive.

Bleh, and the redundnacy in that sentence provided by the human at the
keyobard losing track of where he was during his context switches. Kids,
don't newsgroup post and browse at the same time. >.<
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top