Evaluate my first python script, please


P

Pete Emerson

I've written my first python program, and would love suggestions for
improvement.

I'm a perl programmer and used a perl version of this program to guide
me. So in that sense, the python is "perlesque"

This script parses /etc/hosts for hostnames, and based on terms given
on the command line (argv), either prints the list of hostnames that
match all the criteria, or uses ssh to connect to the host if the
number of matches is unique.

I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".

I am aware that there are performance improvements and error checking
that could be made, such as making sure the file exists and is
readable and precompiling the regular expressions and not calculating
how many sys.argv arguments there are more than once. I'm not hyper
concerned with performance or idiot proofing for this particular
script.

Thanks in advance.

########################################################
#!/usr/bin/python

import sys, fileinput, re, os

filename = '/etc/hosts'

hosts = []

for line in open(filename, 'r'):
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
if match is None or re.search('^(?:float|localhost)\.', line):
continue
hostname = match.group(1)
count = 0
for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
count = count + 1
break
if count == len(sys.argv) - 1:
hosts.append(hostname)

if len(hosts) == 1:
os.system("ssh -A " + hosts[0])
else:
print '\n'.join(hosts)
 
Ad

Advertisements

T

Tim Wintle

I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".

(1) I would wrap it all in a function

def main():
# your code here

if __name__ == "__main__":
main()

(2) PEP8 (python style guidelines) suggests one import per line

(3) I'd use four spaces as tab width

(4)
I'd change this:
for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
count = count + 1
break

to something more like:

for section in hostname.split("."):
if section in sys.argv[1:]:
count += 1

(although as you suggested I'd only calculate sys.argv[1:] once)

.... or you could replace whole section between the for loop and
hosts.append with:

if sorted(hostname.split(".")) == sorted(sys.argv[1:]):
host.append(hostname)


, at a slight loss of clarity - but I think I'd stick with the more
verbose version personally.

Tim
 
M

MRAB

Pete said:
I've written my first python program, and would love suggestions for
improvement.

I'm a perl programmer and used a perl version of this program to guide
me. So in that sense, the python is "perlesque"

This script parses /etc/hosts for hostnames, and based on terms given
on the command line (argv), either prints the list of hostnames that
match all the criteria, or uses ssh to connect to the host if the
number of matches is unique.

I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".

I am aware that there are performance improvements and error checking
that could be made, such as making sure the file exists and is
readable and precompiling the regular expressions and not calculating
how many sys.argv arguments there are more than once. I'm not hyper
concerned with performance or idiot proofing for this particular
script.

Thanks in advance.

########################################################
#!/usr/bin/python

import sys, fileinput, re, os

filename = '/etc/hosts'

hosts = []

for line in open(filename, 'r'):
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

Use 'raw' strings for regular expressions.

'Normal' Python string literals use backslashes in escape sequences (eg,
'\n'), but so do regular expressions, and regular expressions are passed
to the 're' module in strings, so that can lead to a profusion of
backslashes!

You could either escape the backslashes:

match = re.search('\\d+\\.\\d+\\.\\d+\\.\\d+\s+(\\S+)', line)

or use a raw string:

match = re.search(r'\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

A raw string literal is just like a normal string literal, except that
backslashes aren't special.
if match is None or re.search('^(?:float|localhost)\.', line):
continue

It would be more 'Pythonic' to say "not match".

if not match or re.search(r'^(?:float|localhost)\.', line):
hostname = match.group(1)
count = 0
for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
count = count + 1

Shorter is:

count += 1
break
if count == len(sys.argv) - 1:
hosts.append(hostname)

if len(hosts) == 1:
os.system("ssh -A " + hosts[0])
else:
print '\n'.join(hosts)

You're splitting the hostname repeatedly. You could split it just once,
outside the outer loop, and maybe make it into a set. You could then
have:

host_parts = set(hostname.split('.'))
count = 0
for arg in sys.argv[1:]:
if arg in host_parts:
count += 1

Incidentally, the re module contains a cache, so the regular expressions
won't be recompiled every time (unless you have a lot of them and the re
module flushes the cache).
 
N

nn

Pete said:
I've written my first python program, and would love suggestions for
improvement.
I'm a perl programmer and used a perl version of this program to guide
me. So in that sense, the python is "perlesque"
This script parses /etc/hosts for hostnames, and based on terms given
on the command line (argv), either prints the list of hostnames that
match all the criteria, or uses ssh to connect to the host if the
number of matches is unique.
I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".
I am aware that there are performance improvements and error checking
that could be made, such as making sure the file exists and is
readable and precompiling the regular expressions and not calculating
how many sys.argv arguments there are more than once. I'm not hyper
concerned with performance or idiot proofing for this particular
script.
Thanks in advance.

import sys, fileinput, re, os
filename = '/etc/hosts'
hosts = []
for line in open(filename, 'r'):
   match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

Use 'raw' strings for regular expressions.

'Normal' Python string literals use backslashes in escape sequences (eg,
'\n'), but so do regular expressions, and regular expressions are passed
to the 're' module in strings, so that can lead to a profusion of
backslashes!

You could either escape the backslashes:

        match = re.search('\\d+\\.\\d+\\.\\d+\\.\\d+\s+(\\S+)', line)

or use a raw string:

        match = re.search(r'\d+\.\d+\.\d+\.\d+\s+(\S+)', line)

A raw string literal is just like a normal string literal, except that
backslashes aren't special.
   if match is None or re.search('^(?:float|localhost)\.', line):
continue

It would be more 'Pythonic' to say "not match".

        if not match or re.search(r'^(?:float|localhost)\.', line):
   hostname = match.group(1)
   count = 0
   for arg in sys.argv[1:]:
           for section in hostname.split('.'):
                   if section == arg:
                           count = count + 1

Shorter is:

                                count += 1
                           break
   if count == len(sys.argv) - 1:
           hosts.append(hostname)
if len(hosts) == 1:
   os.system("ssh -A " + hosts[0])
else:
   print '\n'.join(hosts)

You're splitting the hostname repeatedly. You could split it just once,
outside the outer loop, and maybe make it into a set. You could then
have:

         host_parts = set(hostname.split('.'))
        count = 0
        for arg in sys.argv[1:]:
                if arg in host_parts:
                        count += 1

Incidentally, the re module contains a cache, so the regular expressions
won't be recompiled every time (unless you have a lot of them and the re
module flushes the cache).

I think that you really just need a set intersection for the count
part and this should work:

host_parts = set(hostname.split('.'))
count = len(host_parts.intersection(sys.argv[1:]))
 
J

Jonathan Gardner

#!/usr/bin/python

More common:

#!/usr/bin/env python
import sys, fileinput, re, os

filename = '/etc/hosts'

hosts = []

for line in open(filename, 'r'):
       match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
       if match is None or re.search('^(?:float|localhost)\.', line):
continue
       hostname = match.group(1)

In Python, we use REs as a last resort. See "dir("")" or "help("")"
for why we don't use REs if we can avoid them.

The above is better represented with:

try:
ipaddr, hostnames = line.split(None, 1)
except IndexError:
continue

if line.find('float') >=0 or line.find('localhost') >= 0:
continue

       count = 0
       for arg in sys.argv[1:]:
               for section in hostname.split('.'):
                       if section == arg:
                               count = count + 1
                               break
       if count == len(sys.argv) - 1:
               hosts.append(hostname)

You can use the "else" in a "for". as well.

for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
break
else: # Run only if for completes naturally
hosts.append(hostname)


It may be clearer to do set arithmetic as well.
if len(hosts) == 1:
       os.system("ssh -A " + hosts[0])

You probably want one of the os.exec* methods instead, since you
aren't going to do anything after this.
else:
       print '\n'.join(hosts)

Rather, the idiom is usually:

for host in hosts:
print host
 
S

sjdevnull

I've written my first python program, and would love suggestions for
improvement.

I'm a perl programmer and used a perl version of this program to guide
me. So in that sense, the python is "perlesque"

This script parses /etc/hosts for hostnames, and based on terms given
on the command line (argv), either prints the list of hostnames that
match all the criteria, or uses ssh to connect to the host if the
number of matches is unique.

I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".

I am aware that there are performance improvements and error checking
that could be made, such as making sure the file exists and is
readable and precompiling the regular expressions and not calculating
how many sys.argv arguments there are more than once. I'm not hyper
concerned with performance or idiot proofing for this particular
script.

Thanks in advance.

########################################################
#!/usr/bin/python

import sys, fileinput, re, os

'Some people, when confronted with a problem, think "I know, I’ll use
regular expressions." Now they have two problems.' — Jamie Zawinski

Seriously, regexes can be very useful but there's no need for them
here. Simpler is usually better, and easier to understand.
filename = '/etc/hosts'

hosts = []

for line in open(filename, 'r'):
        match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
        if match is None or re.search('^(?:float|localhost)\.', line):
continue
        hostname = match.group(1)

I find this much clearer without regexes:

try:
ip, hostname = line.strip().split(None, 1)
except IndexError:
continue
# I think this is equivalent to your re, but I'm not sure it's what
# you actually meant...
#if line.startswith("float.") or line.startswith("localhost."):
# continue
# I'm going with:
if hostname.startswith("float.") or hostname.startswith("localhost"):
continue

        count = 0
        for arg in sys.argv[1:]:
                for section in hostname.split('.'):
                        if section == arg:
                                count = count + 1
                                break
        if count == len(sys.argv) - 1:
                hosts.append(hostname)

A perfect application of sets.

#initialize at program outset
args = set(sys.argv[1:])
....
hostparts = set(hostname.split("."))
if hostparts & args:
hosts.append(hostname)


Full program:

import sys
import os
filename = '/etc/hosts'

hosts = []
args = set(sys.argv[1:])
for line in open(filename, 'r'):
# Parse line into ip address and hostname, skipping bogus lines
try:
ipaddr, hostname = line.strip().split(None, 1)
except ValueError:
continue
if hostname.startswith("float.") or
hostname.startswith("localhost."):
continue

# Add to hosts if it matches at least one argument
hostparts = set(hostname.split("."))
if hostparts & args:
hosts.append(hostname)

# If there's only one match, ssh to it--otherwise print out the
matches
if len(hosts) == 1:
os.system("ssh -A %s"%hosts[0])
else:
for host in hosts:
print host
 
Ad

Advertisements

P

Pete Emerson

Great responses, thank you all very much. I read Jonathan Gardner's
solution first and investigated sets. It's clearly superior to my
first cut.

I love the comment about regular expressions. In perl, I've reached
for regexes WAY too much. That's a big lesson learned too, and from my
point of view not because of performance (although that's most likely
a bonus) but because of code readability.

Here's the version I have now. It looks quite similar to sjdevnull's.
Further comments appreciated, and thanks again.

#!/usr/bin/env python

import sys, fileinput, os

filename = '/etc/hosts'

hosts = []

search_terms = set(sys.argv[1:])

for line in open(filename, 'r'):
if line.startswith('#'): continue
try:
hostname = line.strip().split('\t')[2] # The host I want is always
at the 3rd tab.
except IndexError:
continue
if hostname.startswith('float.') or
hostname.startswith('localhost.'):
continue
if search_terms <= set(hostname.split('.')): # same as if
search_terms.issubset(hostname.split('.')):
hosts.append(hostname)

if len(hosts) == 1:
os.execl("/usr/bin/ssh", '-A', hosts[0])
else:
for host in hosts:
print host
 
J

Jean-Michel Pichavant

Pete said:
I've written my first python program, and would love suggestions for
improvement.

I'm a perl programmer and used a perl version of this program to guide
me. So in that sense, the python is "perlesque"

This script parses /etc/hosts for hostnames, and based on terms given
on the command line (argv), either prints the list of hostnames that
match all the criteria, or uses ssh to connect to the host if the
number of matches is unique.

I am looking for advice along the lines of "an easier way to do this"
or "a more python way" (I'm sure that's asking for trouble!) or
"people commonly do this instead" or "here's a slick trick" or "oh,
interesting, here's my version to do the same thing".

I am aware that there are performance improvements and error checking
that could be made, such as making sure the file exists and is
readable and precompiling the regular expressions and not calculating
how many sys.argv arguments there are more than once. I'm not hyper
concerned with performance or idiot proofing for this particular
script.

Thanks in advance.

########################################################
#!/usr/bin/python

import sys, fileinput, re, os

filename = '/etc/hosts'

hosts = []

for line in open(filename, 'r'):
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)
if match is None or re.search('^(?:float|localhost)\.', line):
continue
hostname = match.group(1)
count = 0
for arg in sys.argv[1:]:
for section in hostname.split('.'):
if section == arg:
count = count + 1
break
if count == len(sys.argv) - 1:
hosts.append(hostname)

if len(hosts) == 1:
os.system("ssh -A " + hosts[0])
else:
print '\n'.join(hosts)
You've already been given good advices.
Unlike some of those, I don't think using regexp an issue in any way.
Yes they are not readable, for sure, I 100% agree to that statement, but
you can make them readable very easily.

# not readable
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line)


# readable
match = re.search('\d+\.\d+\.\d+\.\d+\s+(\S+)', line) # match
'192.168.200.1 (foo123)'

I gladly understand ppl that are not willing to use regexp (or as last
resort), but for a perl guy, that should not be an issue.

JM
 
J

Jean-Michel Pichavant

Duncan said:
The problem with your comment is that just by existing it means people will
be misled into thinking the regex is being used to match strings like
'192.168.200.1 (foo123)'
when in fact the OP is expecting to match strings like
'192.168.200.1 foo123'

i.e. you may mislead some people into thinking the parentheses are part of
what is being matched when they're actually grouping within the regex.

Another problem with either regular expression line is that it makes it
less easy to see that it doesn't do what the OP wanted. /etc/hosts contains
lines with an IP address followed by multiple hostnames. It may also
contain comments either as complete lines or at the end of lines. The code
given will miss hostnames after the first and will include hostnames in
commented out lines.
And tell me how not using regexp will ensure the /etc/hosts processing
is correct ? The non regexp solutions provided in this thread did not
handled what you rightfully pointed out about host list and commented lines.

And FYI, the OP pattern does match '192.168.200.1 (foo123)'
....
Ok that's totally unfair :D You're right I made a mistake. Still the
comment is absolutely required (provided it's correct).

JM
 
P

Pete Emerson

Thanks for your response, further questions inline.

(1) I would wrap it all in a function

def main():
    # your code here

if __name__ == "__main__":
    main()

Is this purely aesthetic reasons, or will I appreciate this when I
write my own modules, or something else?
(2) PEP8 (python style guidelines) suggests one import per line

(3) I'd use four spaces as tab width

(4)
I'd change this:
    for arg in sys.argv[1:]:
        for section in hostname.split('.'):
            if section == arg:
                count = count + 1
                break

to something more like:

    for section in hostname.split("."):
        if section in sys.argv[1:]:
            count += 1

Ah, yes, I like that. It moves towards the set notation I've wound up
with. Definitely more readable to me.
(although as you suggested I'd only calculate sys.argv[1:] once)

... or you could replace whole section between the for loop and
hosts.append with:

    if sorted(hostname.split(".")) == sorted(sys.argv[1:]):
        host.append(hostname)

This doesn't actually work, because I'm not looking for a one to one
mapping of args to sections of hostname, but rather a subset. So
passing in 'prod sfo' would register a match for '001.webapp.prod.sfo'.
 
P

Pete Emerson

Jean-Michel Pichavant said:
And tell me how not using regexp will ensure the /etc/hosts processing
is correct ? The non regexp solutions provided in this thread did not
handled what you rightfully pointed out about host list and commented
lines.

It won't make is automatically correct, but I'd guess that written without
being so dependent on regexes might have made someone point out those
deficiencies sooner. The point being that casual readers of the code won't
take the time to decode the regex, they'll glance over it and assume it
does something or other sensible.

If I was writing that code, I'd read each line, strip off comments and
leading whitespace (so you can use re.match instead of re.search), split on
whitespace and take all but the first field. I might check that the field
I'm ignoring it something like a numeric ip address, but if I did want to
do then I'd include range checking for valid octets so still no regex.

The whole of that I'd wrap in a generator so what you get back is a
sequence of host names.

However that's just me. I'm not averse to regular expressions, I've written
some real mammoths from time to time, but I do avoid them when there are
simpler clearer alternatives.
And FYI, the OP pattern does match '192.168.200.1 (foo123)'
...
Ok that's totally unfair :D You're right I made a mistake.  Still the
comment is absolutely required (provided it's correct).

Yes, the comment would have been good had it been correct. I'd also go for
a named group as that provides additional context within the regex.

Also if there are several similar regular expressions in the code, or if
they get too complex I'd build them up in parts. e.g.

OCTET = r'(?:\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])'
ADDRESS = (OCTET + r'\.') * 3 + OCTET
HOSTNAME = r'[-a-zA-Z0-9]+(?:\.[-a-zA-Z0-9]+)*'
  # could use \S+ but my Linux manual says
  # alphanumeric, dash and dots only
... and so on ...

which provides another way of documenting the intentions of the regex.

BTW, I'm not advocating that here, the above patterns would be overkill,
but in more complex situations thats what I'd do.

All good comments here. The takeaway for my lazy style of regexes
(which makes it harder for non-regex fiends to read, regardless of the
language) is that there are ways to make regexes much more readable to
the untrained eye. Duncan, I like your method of defining sections of
the regex outside the regex itself, even if it's a one time use.
 
Ad

Advertisements

T

Tim Wintle

Thanks for your response, further questions inline.



Is this purely aesthetic reasons, or will I appreciate this when I
write my own modules, or something else?

It's for when you reuse this code.

Consider it's in "mymodule.py" (so run with ./mymodule.py) - if you then
make a "tests.py" (for example) you can "import mymodule" without it
automatically running your code.

re-writing it

def main(args):
#your code

if __name__ == "__main__":
main(sys.argv[1:])

would obviously be more sensible for actually writing tests.

... or you could replace whole section between the for loop and
hosts.append with:

if sorted(hostname.split(".")) == sorted(sys.argv[1:]):
host.append(hostname)

This doesn't actually work, because I'm not looking for a one to one
mapping of args to sections of hostname, but rather a subset. So
passing in 'prod sfo' would register a match for '001.webapp.prod.sfo'.

Ah - good point - I guess the the set intersection technique someone
else mentioned is best in that case.

Tim
 
S

sjdevnull

Thanks for your response, further questions inline.






Is this purely aesthetic reasons, or will I appreciate this when I
write my own modules, or something else?

Suppose the above code is in mymodule.py. By wrapping main() you can:
1. Have another module do:
import mymodule
.... (so some stuff, perhaps munge sys.argv)
mymodule.main()
2. If mymodule has a small function in it, someone else can import it
and call that function
3. You can run pylint, pychecker and other source-code checkers that
need to be able to import your module to check it (I wouldn't be
surprised if recent versions of one or the other of those don't
require imports, and some checkers like pyflakes certainly don't).
4. You can easily have a unit tester call into the module

etc.

+1 on both; it's good to get into the habit of writing standard-
looking Python code.
 
Ad

Advertisements

P

Pete Emerson

Suppose the above code is in mymodule.py.  By wrapping main() you can:
1. Have another module do:
import mymodule
... (so some stuff, perhaps munge sys.argv)
mymodule.main()
2. If mymodule has a small function in it, someone else can import it
and call that function
3. You can run pylint, pychecker and other source-code checkers that
need to be able to import your module to check it (I wouldn't be
surprised if recent versions of one or the other of those don't
require imports, and some checkers like pyflakes certainly don't).
4. You can easily have a unit tester call into the module

etc.


+1 on both; it's good to get into the habit of writing standard-
looking Python code.

Agreed, noted, and appreciated, with the caveat that using spaces
instead of tabs might border on an emacs vs. vi flamewar in some
circles. I personally will use spaces going forward.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top