python vs awk for simple sysamin tasks

Discussion in 'Python' started by Matthew Thorley, Jun 3, 2004.

  1. My friend sent me an email asking this:
    I wasn't able to give him an afirmative answer because I've never used
    python for things like this. I just spent the last while looking on
    google and haven't found an answer yet. I was hoping some one out there
    might have some thoughts ?

    thanks much
    -matthew
     
    Matthew Thorley, Jun 3, 2004
    #1
    1. Advertisements

  2. So I went ahead and combined them and added a little extra heres the script:

    #!/usr/bin/python

    import os
    from sys import argv, exit

    class userFileStats:
    def __init__(self):
    self.path = ''
    self.uid = ''
    self.userName = ''
    self.oserrors = 0
    self.totalDirs = 0
    self.totalFiles = 0
    self.totalFileSize = 0

    self.totalUserDirs = 0
    self.totalUserFiles = 0
    self.totalUserFileSize = 0
    self.smallestUserFile = [100**100, 'name']
    self.largestUserFile = [0, 'name']


    def walkPath(self, path, uid):
    self.path = path
    self.uid = int(uid)
    os.path.walk(path, self.tallyFiles, uid)


    def tallyFiles(self, uid, dir, names):
    self.totalDirs = self.totalDirs + 1
    self.totalFiles = self.totalFiles + len(names)

    if os.stat(dir)[4] == self.uid:
    self.totalUserDirs = self.totalUserDirs + 1
    for name in names:
    try:
    stat = os.stat(dir+'/'+name)
    except OSError:
    self.oserrors = self.oserrors + 1
    break

    self.totalFileSize = self.totalFileSize + stat[6]
    if stat[4] == self.uid:
    self.totalUserFiles = self.totalUserFiles + 1
    self.totalUserFileSize = self.totalUserFileSize + stat[6]

    if stat[6] < self.smallestUserFile[0]:
    self.smallestUserFile[0] = stat[6]
    self.smallestUserFile[1] = dir+'/'+name

    if stat[6] > self.largestUserFile[0]:
    self.largestUserFile[0] = stat[6]
    self.largestUserFile[1] = dir+'/'+name


    def printResults(self):
    print "Results for path %s"\
    %(self.path)
    print " Searched %s dirs"\
    %(self.totalDirs)
    print " Searched %s files"\
    %(self.totalFiles)
    print " Total disk use for all files = %s bytes"\
    %(self.totalFileSize/1024)
    print " %s files returned errors"\
    %(self.oserrors)
    print "Results for user %s"\
    %(self.uid)
    print " User owns %s dirs"\
    %(self.totalUserDirs)
    print " User owns %s files"\
    %(self.totalUserFiles)
    print " Total disk use for user = %s bytes"\
    %(self.totalUserFileSize/1024)
    print " Users smallest file %s is %s bytes"\
    %(self.smallestUserFile[1], self.smallestUserFile[0]/1024)
    print " Users largest file %s is %s bytes"\
    %(self.largestUserFile[1], self.largestUserFile[0]/1024)
    print " Average user file size = %s bytes"\
    %( (self.totalUserFileSize/self.totalUserFiles)/1024 )


    if __name__ == '__main__':
    if len(argv) == 2:
    user= argv[1]
    path = os.getcwd()
    elif len(argv) == 3:
    user = argv[1]
    path = argv[2]
    else:
    print 'Usage: userFileStats.py uid path\n'
    exit(1)

    userFileStats = userFileStats()
    userFileStats.walkPath(path, user)
    userFileStats.printResults()

    It is A LOT longer than the one liners (obviously) but it has way more
    functionality. With a little tweaking you could easily do all sorts of
    other useful things. I'm sure utils like this already exist out there
    whether written in python or not.

    Another question. The example my friend gave me takes the user name as
    an argument not the uid. Does any one know how to convert usernames to
    uids and vice versa in python ? Please also comment on the script, any
    thoughts on simplification ?

    thanks
    -matthew
     
    Matthew Thorley, Jun 3, 2004
    #2
    1. Advertisements

  3. I should have included this in the last post. The script gives output
    that looks like this:

    Results for path /linshr/hope
    Searched 694 dirs
    Searched 10455 files
    Total disk use for all files = 4794176 bytes
    6 files returned errors
    Results for user 1000
    User owns 692 dirs
    User owns 10389 files
    Total disk use for user = 4791474 bytes
    Users smallest file /linshr/hope/.fonts.cache-1 is 0 bytes
    Users largest file /linshr/hope/vmw/tgz/winXPPro_vmware.tgz is
    2244111 bytes
    Average user file size = 461 bytes

    -matthew
     
    Matthew Thorley, Jun 3, 2004
    #3
  4. Matthew Thorley

    Steve Lamb Guest

    What would be better is defining the end result than cramming out shell
    script that we've got to decipher. But, with that said to me it would be a
    simple matter of os.path.walk() with a call to an appropriate function which
    does the calculations as needed.
     
    Steve Lamb, Jun 3, 2004
    #4
  5. Matthew Thorley

    Roy Smith Guest

    Neither of these are really tasks well suited to python at all.

    I'm sure you could replicate this functionality in python using things
    like os.walk() and os.stat(), but why bother? The result would be no
    better than the quick on-liners you've got above. Even if you wanted to
    replace the awk part with python, the idea of trying to replicate the
    find functionality is just absurd.

    I'm sure you could replicate them in perl too, but the same comment
    applies. Find is an essential unix tool. If you're going to be doing
    unix sysadmin work, you really should figure out how find works.
     
    Roy Smith, Jun 3, 2004
    #5
  6. Matthew Thorley

    Steve Lamb Guest

    Individually, no. Combined, yes.
    Not true. The above one liners are two passes over the same data. With
    an appropriate script you could make one pass and get both results. Sure you
    could do that in shell but I'm of the opinion that anything other than one
    liners should never be done in shell.
     
    Steve Lamb, Jun 3, 2004
    #6
  7. Matthew Thorley

    Roy Smith Guest

    You may be right that a python script would be faster. The shell pipe
    does make two passes over the data, not to mention all the pipe
    overhead, and the binary -> ascii -> binary double conversion.

    But does it matter? Probably not. Groveling your way through a whole
    file system is pretty inefficient any way you do it. It's extremely
    rare to find a sysadmin task where this kind of efficiency tweaking
    matters. As long as the overall process remains O(n), don't sweat it.
    To a certain extent, you're right, but the two examples given really
    were effectively one liners.
     
    Roy Smith, Jun 3, 2004
    #7
  8. Matthew Thorley

    Donn Cave Guest

    I guess you're already conceding that your own point isn't
    very relevant, but just in case this isn't clear, if the
    intent was actually to do both tasks at the same time (which
    isn't clear), the end clause could easily print the sum and
    then the average. (The example erroneously fails to label
    the end clause, but it should be fairly easy to see what was
    intended.)

    Awk is a nice language for its intended role - concise,
    readable, efficient - and I use it a lot for things like
    this, or somewhat more elaborate programs, because I believe
    it's easier for my colleagues to deal with who aren't familiar
    with Python (or awk, really.) It's also supported by the
    UNIX platforms we use, as long as I avoid gawk-isms, while
    Python will never be really reliably present until it can
    stabilize enough that a platform vendor isn't signing on
    for a big headache by trying to support it. (Wave and say
    hi, Redhat.)

    However, it's inadequate for complex programming - can't
    store arrays in arrays, for example.

    Donn Cave,
     
    Donn Cave, Jun 3, 2004
    #8
  9. Matthew Thorley

    Steve Lamb Guest

    I'm sorry but when I look at things like this I look at the case where
    such things would be used a couple hundred thousand times. Small
    inefficiencies like multiple stat() passes and tons of system() calls pile up
    fast and can baloon a run time from a managable "few hours" to well over a
    day.
    Yes, they were. But combined it is no longer a one liner since at that
    point one is storing the count value and doing something with it. ;)
     
    Steve Lamb, Jun 3, 2004
    #9
  10. Matthew Thorley

    Steve Lamb Guest

    Also it can be made part of a larger project with relative ease. :)
    I'd just do a quick pass over the passwd file but then many times I'm
    blisfully unaware of things already coded to do the work I'm after. I mean my
    first stab at iterating over the file system didn't use os.path.walk(). :)
     
    Steve Lamb, Jun 3, 2004
    #10
  11. That won't work (for all uids) if a network-based database (like NIS)
    is used. You want the pwd module.
     
    David M. Cooke, Jun 4, 2004
    #11
  12. Matthew Thorley

    Pete Forman Guest

    I recently rewrote a short shell script in python. The latter was
    about 30 times faster and I find myself reusing parts of it for other
    tasks.

    That said, I still would agree with others in this thread that one
    liners are useful. It is a good idea to be familiar with awk, find,
    grep, sed, xargs, etc.
     
    Pete Forman, Jun 4, 2004
    #12
  13. Matthew Thorley

    Steve Lamb Guest

    Then you, like some others, would have missed my point. I never said that
    one liners aren't useful. I never said one should not know the standard tools
    available on virtually all unix systems. I said, quite clearly, that I felt
    *anything larger than a one liner* should not be done in shell.

    That means one liners are cool in shell. They serve a purpose.
     
    Steve Lamb, Jun 4, 2004
    #13
  14. Matthew Thorley

    William Park Guest

    I realize that this is Python list, but a dose of reality is needed
    here. This is typical view of salary/wage recipient who would do
    anything to waste time and money. How long does it take to type out
    simple 2 line shell/awk script? And, how long to do it in Python?

    "Right tool for right job" is the key insight here. Just as Python
    evolves, other software evolves as well. For array/list and
    glob/regexp, I mainly use Shell now. Shell can't do nesting (but don't
    really need to). For more complicate stuffs, usually involving heavy
    dictionary, I use Python. Awk would fall in between, usually involving
    floating point and table parsing.

    For OP, learn both Awk and Python. But, keep in mind, shell and editor
    are the 2 most important tools/skills. Neither Awk or Python will do
    you any good, if you can't type. :)
     
    William Park, Jun 4, 2004
    #14
  15. Matthew Thorley

    Steve Lamb Guest

    No, that is the view of someone who wants to get the job done and save
    time/money.

    How long does it take to write out a simple 2 line shell/awk script? Not
    that long. How long does it take to do it in Python? Maybe a dozen or so
    minutes longer. Yay, I save a few minutes!

    How long does it take me to modify that script a few weeks later when my
    boss asks me, "Y'know, I could really use this and this." In shell, quite a
    few minutes, maybe a half hour. Python? Maybe 10 minutes tops.

    How long does it take me to modify it a month or two after that when my
    boss tells me, "We need to add in this feature, exclusions are nice, and what
    about this?" Shell, now pushing a good 30-40m. Python, maybe 15m.

    How long does it take me to rewrite it into a decent language when my boss
    wonders why it is taking so long and the SAs are bitching at me because of the
    runtimes of my shell script? In shell, an hour or two. In Python, oh, wait,
    I don't have to.

    Not gonna happen, ya say? Piffle. The only difference in the above
    example was that I wasn't the one who made the choice to write the tool in
    shell in the first place and the language I rewrote it in (to include the
    exclusions they wanted which shell seemed incalable of doing) was Perl instead
    of Python. I had the RCS revisions all the way back to the time when it was a
    wee little shell script barely 3-4 lines long.

    Now, if you like flexing your SA muscle and showing off how cool you are
    by being able to whip out small shell scripts to do basic things, that's cool.
    Go for it. But the reality check is that more time is saved in the long run
    by spending the few extra minutes to do it *properly* instead of doing it
    *quickly* because in the long run maintenance is easier and speed is
    increased. It is amazing how much shell bogs down when you're running it over
    several hundred thousand directories. :p
    What you're missing is that the job often evolves as well so it is better
    to use a tool which has a broader scope and can evolve with the job more so
    than the quick 'n dirty, get-it-done-now-and-to-hell-with-maintainability,
    ultra-specialized, right-tool-for-the-job tool.

    Do I use one-liners? All the time. When I need to delete a subset of
    files or need to parse a particular string out of a log-file in a one-off
    manner. I can chain the tools just as effectively as the next guy. But
    anything more than that and out comes Python (previously Perl) because I
    *knew* as soon as I needed a tool more than once I would also find more uses
    for it, would have more uses of it requested of me and I'd have to maintain it
    for months, sometimes years to come.
     
    Steve Lamb, Jun 4, 2004
    #15
  16. Matthew Thorley

    William Park Guest

    Too many maybe's, and you're timing hypotheoretical tasks which you hope
    it would take that long.

    Will you insist on supporting/modifying your old Python code, even if
    it's cheaper and faster to throw it away and write a new code from
    scratch? Because that's what Shell/Awk is for... write, throw away,
    write, throw away, ...
    Same would be true for Python here.

    "If hammer is all you have, then everything becomes a nail". :)
     
    William Park, Jun 4, 2004
    #16
  17. Matthew Thorley

    Steve Lamb Guest

    No, not too many maybes. That was based on a real-life experience which
    is but a single example of many cases over the past several years of working
    on my field (ISP/Web Hosting).
    And you don't consider this wasteful. Ooook then.
    Not as much as shell. With shell you have thousands upon thousands of
    fork/execs getting in the way. In the case I am thinking of rewriting the
    shell script into Perl and doing all the logic processing internally cut the
    run time down from 7-8 hours down to *2*. I have seen similar perfomance
    numbers from Python when compared to shell though not on that scale. I mean
    the same could be said for C, or Assembly. Yes, iterations over large sets of
    data is going to increase runtimes. However there are inefficiencies not
    present in a scripting language which are present in shell which make it
    exceptionally longer to run.
    Fortunately I don't have just a hammer then, isn't it? I restate, I use
    shell for one liners. Beyond that it has been my experience, practice and
    recommendation that far more time is saved in the long run by using the proper
    tool. A proper scripting language and not a hobbled-together pseudo-language
    with severe performance issues.
     
    Steve Lamb, Jun 4, 2004
    #17
  18. Matthew Thorley

    William Park Guest

    4x faster? Not very impressive. I suspect that it's poor quality shell
    script to begin with. Would you post this script, so that others can
    correct your misguided perception?
     
    William Park, Jun 5, 2004
    #18
  19. Matthew Thorley

    Steve Lamb Guest

    No.

    1: It was an internal script for statistics gathering and I did not have
    permission to expose that code to the public.

    2: Even if I did I no longer work there.

    The just of it though was that it was a disk usage script which tabulated
    usage for a few hundred thousand customers. It had to go through several
    slices (it wasn't a single large directory) find the customers in each of
    those slices, calculate their disk usage and create a log of it.

    The Perl recode came about when management wanted some exclusions put in
    and the shell script was breaking at that point. They also wanted a lower
    run-time if possible. So I spent an hour or two, most of it in the recursion
    walk across the file-system (thank dog Python has os.path.walk!) rewriting it
    in Perl. The stat calls were not reduced, we still had to do a stat on every
    file to get the size as before. However we no longer were going through the
    overhead of constantly opening and closing pipes to/from du as well as the
    numerous exec calls.

    4x faster measured in hours based pretty much on building up and tearing
    down those pipes and executing the same program over and over is rather
    impressive given how miniscule those operations are in general is impressive.
     
    Steve Lamb, Jun 5, 2004
    #19
  20. Matthew Thorley

    Steve Lamb Guest

    Bleh, and the redundnacy in that sentence provided by the human at the
    keyobard losing track of where he was during his context switches. Kids,
    don't newsgroup post and browse at the same time. >.<
     
    Steve Lamb, Jun 5, 2004
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.