Comments on my first script?

Discussion in 'Python' started by Phillip B Oldham, Jun 12, 2008.

  1. I'm keen on learning python, with a heavy lean on doing things the
    "pythonic" way, so threw the following script together in a few hours
    as a first-attempt in programming python.

    I'd like the community's thoughts/comments on what I've done;
    improvements I can make, "don'ts" I should be avoiding, etc. I'm not
    so much bothered about the resulting data - for the moment it meets my
    needs. But any comment is welcome!

    #!/usr/bin/env python
    ## Open a file containing a list of domains (1 per line),
    ## request and parse it's whois record and push to a csv
    ## file.

    import subprocess
    import re

    src = open('./domains.txt')

    dest = open('./whois.csv', 'w');

    sep = "|"
    headers = ["Domain","Registrant","Registrant's
    Address","Registrar","Registrant Type","Date Registered","Renewal
    Date","Last Updated","Name Servers"]

    dest.write(sep.join(headers)+"\n")

    def trim( txt ):
    x = []
    for line in txt.split("\n"):
    if line.strip() == "":
    continue
    if line.strip().startswith('WHOIS'):
    continue
    if line.strip().startswith('>>>'):
    continue
    if line.strip().startswith('%'):
    continue
    if line.startswith("--"):
    return ''.join(x)
    x.append(" "+line)
    return "\n".join(x)

    def clean( txt ):
    x = []
    isok = re.compile("^\s?([^:]+): ").match
    for line in txt.split("\n"):
    match = isok(line)
    if not match:
    continue
    x.append(line)
    return "\n".join(x);

    def clean_co_uk( rec ):
    rec = rec.replace('Company number:', 'Company number -')
    rec = rec.replace("\n\n", "\n")
    rec = rec.replace("\n", "")
    rec = rec.replace(": ", ":\n")
    rec = re.sub("([^(][a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
    rec = rec.replace(":\n", ": ")
    rec = re.sub("^[ ]+\n", "", rec)
    return rec

    def clean_net( rec ):
    rec = rec.replace("\n\n", "\n")
    rec = rec.replace("\n", "")
    rec = rec.replace(": ", ":\n")
    rec = re.sub("([a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
    rec = rec.replace(":\n", ": ")
    return rec

    def clean_info( rec ):
    x = []
    for line in rec.split("\n"):
    x.append(re.sub("^([^:]+):", "\g<0> ", line))
    return "\n".join(x)

    def record(domain, record):
    details = ['','','','','','','','','']
    for k, v in record.items():
    try:
    details[0] = domain.lower()
    result = {
    "registrant": lambda: 1,
    "registrant name": lambda: 1,
    "registrant type": lambda: 4,
    "registrant's address": lambda: 2,
    "registrant address1": lambda: 2,
    "registrar": lambda: 3,
    "sponsoring registrar": lambda: 3,
    "registered on": lambda: 5,
    "registered": lambda: 5,
    "domain registeration date": lambda: 5,
    "renewal date": lambda: 6,
    "last updated": lambda: 7,
    "domain last updated date": lambda: 7,
    "name servers": lambda: 8,
    "name server": lambda: 8,
    "nameservers": lambda: 8,
    "updated date": lambda: 7,
    "creation date": lambda: 5,
    "expiration date": lambda: 6,
    "domain expiration date": lambda: 6,
    "administrative contact": lambda: 2
    }[k.lower()]()
    if v != '':
    details[result] = v
    except:
    continue

    dest.write(sep.join(details)+"\n")

    ## Loop through domains
    for domain in src:

    domain = domain.strip()

    if domain == '':
    continue

    rec = subprocess.Popen(["whois",domain],
    stdout=subprocess.PIPE).communicate()[0]

    if rec.startswith("No whois server") == True:
    continue

    if rec.startswith("This TLD has no whois server") == True:
    continue

    rec = trim(rec)

    if domain.endswith(".net"):
    rec = clean_net(rec)

    if domain.endswith(".com"):
    rec = clean_net(rec)

    if domain.endswith(".tv"):
    rec = clean_net(rec)

    if domain.endswith(".co.uk"):
    rec = clean_co_uk(rec)

    if domain.endswith(".info"):
    rec = clean_info(rec)

    rec = clean(rec)

    details = {}

    try:
    for line in rec.split("\n"):
    bits = line.split(': ')
    a = bits.pop(0)
    b = bits.pop(0)
    details[a.strip()] = b.strip().replace("\t", ", ")
    except:
    continue

    record(domain, details)

    ## Cleanup
    src.close()
    dest.close()
     
    Phillip B Oldham, Jun 12, 2008
    #1
    1. Advertisements

  2. Phillip B Oldham

    John Salerno Guest

    I'm not expert, but here are a few thoughts. I hope they help.
    You might want to look into doc strings as a method of providing longer
    documentation like this about what your program does.
    Semicolon!!!! :)
    Is all this properly indented? One thing you can do is put each of these on
    one line, since they are fairly simple:

    if line.strip().startswith('WHOIS'): continue

    although I still like proper indentation. But you have a lot of them so it
    might save a good amount of space to do it this way.

    Also, just my personal preference, I like to be consistent with the type of
    quotes I use for strings. Here, you mix both single and double quotes on
    different lines.
    Semicolon!!!! :) :)
    I don't have Python available to me right now, but I think you can do this
    instead:

    details = [''] * 9
    Non-specific except clauses usually aren't preferred since they catch
    everything, even something you might not want to catch.
    You can say:

    if not domain

    instead of that equivalence test. But what does this if statement do?
    Like above, you don't need "== True" here.
    Hmm, my first thought is to do something like this with all these if tests:

    for extension in [<list all the extensions as strings here>]:
    rec = clean_net(extension)

    But for that to work, you may need to generalize the clean_net function so
    it works for all of them, instead of having to call different functions
    depending on the extension.

    Anyway, I hope some of that helps!
     
    John Salerno, Jun 12, 2008
    #2
    1. Advertisements

  3. Phillip B Oldham

    John Salerno Guest

    Whoops, you'd still need an if test in there I suppose!

    for extension in [<list all the extensions as strings here>]:
    if domain.endswith(extension):
    rec = clean_net(extension)

    Not sure if this is ideal.
     
    John Salerno, Jun 12, 2008
    #3
  4. Phillip B Oldham

    Chris Guest

    Just a few quick things before I leave work.

    #!/usr/bin/env python
    """Open a file containing a list of domains (1 per line),
    request and parse it's whois record and push to a csv
    file.
    """ # Rather use docstrings than multiline commenting like that.

    def trim(txt):
    x = []
    for line in txt.splitlines(): # Strings have a built in function
    if not line.strip() or line.startswith('WHOIS') \
    or line.startswith('>>>') or line.startswith('%'):
    continue # you can do them in one if statement
    if line.startswith('--'): return ''.join(x)
    x.append(' '+line)
    return '\n'.join(x)

    for domain in src:
    if not domain.strip(): continue # A line with nothing is False

    rec = subprocess.Popen(["whois",domain.strip()],
    stdout=subprocess.PIPE).communicate()[0]
    if rec.startswith('No whois server') \
    or rec.startswith('This TLD has no whois server'):
    continue # Startswith will return True/False so it is enough

    rec = trim(rec)
    if domain.endswith('.net'):
    rec = clean_net(rec)
    elif domain.endswith('.com'):
    # Rather use if/elif statements unless somehow you think you
    will match more than one.
    ....

    for line in rec.splitlines():
    try:
    a, b = line.split(': ')[:2]
    details[a.strip()] = b.strip().replace('\t', ', ')
    except IndexError: # No matches
    continue

    Hope that's a start.
     
    Chris, Jun 12, 2008
    #4
  5. Thanks guys. Those comments are really helpful. The odd semi-colon is
    my PHP background. Will probably be a hard habbit to break, that
    one! ;) If I do accidentally drop a semi-colon at the end of the line,
    will that cause any weird errors?

    Also, Chris, can you explain this:
    a, b = line.split(': ')[:2]

    I understand the first section, but I've not seen [:2] before.
     
    Phillip B Oldham, Jun 13, 2008
    #5
  6. Phillip B Oldham

    Chris Guest

    That's slicing at work. What it is doing is only taking the first two
    elements of the list that is built by the line.split.
     
    Chris, Jun 13, 2008
    #6
  7. Phillip B Oldham a écrit :
    Ok, since you asked for it, let's go:
    Might be better to allow the user to pass source and destination as
    arguments, defaulting to stdin and stdout.

    Also, you may want to have a look at the csv module in the stdlib.
    You're doing way to may calls to line.strip(). Call it once and store
    the result.

    def trim_test(line):
    line = line.strip()
    if not line:
    return False
    for test in ("WHOIS", ">>>", "%",):
    if line.startswith(test):
    return False
    return True

    def trim(txt):
    lines = []
    for line in txt.split.splitlines():
    if trim_test(line):
    if line.starstwith("--"):
    return "".join(lines)
    lines.append(" " + line)
    return "\n".join(lines)


    Would be better to extract the regex compilation out of the function.
    If you don't use the match object itself, don't ever bother to bind it:

    for line in txt.split("\n"):
    if not isok(line):
    continue
    x.append(line)

    Then, you may find the intent and flow most obvious if you get rid of
    the double negation (the not and the continue):

    for line in txt.splitlines():
    if isok(line):
    x.append(line)

    which is easy to rewrite as a either a list comprehension:

    x = [line for line in txt.splitlines() if isok(line)]

    or in a more lispish/functional style:

    x = filter(isok, txt.splitlines())

    In both way, you now can get rid of the binding to 'x' (a very bad name
    for a list of lines BTW - what about something more explicit, like
    'lines' ?)
    isok = re.compile("^\s?([^:]+): ").match

    def clean(txt):
    return "\n".join(filter(isok, txt.splitlines()))

    Given the following, this above statement is useless.
    All this could probably be simplified.
    details = [''] * 9
    Ok, let's summarize. On each iteration, you define a dict with the very
    same 21 key:value pairs. Isn't it a bit wasteful ? What about defining
    the dict only once, outside the function ?

    Also, the values in the dict are constant functions. Why not just use
    the constant results of the functions then ? I mean : what's wrong with
    just :

    {
    "registrant": 1,
    "registrant name": 1,
    "registrant type": 4,
    (etc...)
    }

    As an icing on the cake, you build this whole dict, look up a function
    in it, an call the function *before* you even decide if you need that
    result.
    Friendly advice : *never* use a bare except clause that discards the
    exception. Never ever do that.

    Your except clause here should specifically catch KeyError. But anyway
    you don't ever need to worry about exceptions here, you just have to use
    dict.get(key, default) instead.


    FIELDS_POSITIONS = {
    "registrant": 1,
    "registrant name": 1,
    "registrant type": 4,
    "registrant's address": 2,
    (etc...)
    }

    def record(domain, rec):
    details = [domain.lower()] + [''] * 8
    for k, v in record.items():
    if v:
    pos = FIELDS_POSITIONS.get(k.lower(), None)
    if pos is not None:
    details[pos] = v

    # I'm leaving this here, but I'd personnaly split the
    # two unrelated concerns of formatting the record and
    # writing it somewhere.

    dest.write(sep.join(details)+"\n")

    Since the domain is very unlikely to match more than one test, at least
    use if/elif/.../else to avoid redundant useless tests.

    Now *this* would have been a good use of a dict of functions:


    REC_CLEANERS = {
    '.net' : clean_net,
    '.com' : clean_com,
    '.tv' : clean_net,
    '.uk' : clean_co_uk,
    (etc...)
    }

    for domain in rec:
    # code here
    ext = domain.rsplit('.', 1)[1]
    cleaner = REC_CLEANERS.get(ext, None)
    if cleaner:
    rec = cleaner(rec)
    if you expect only one ': ', then:
    a, b = line.split(': ')

    if you can have many but don't care about the others:
    bits = line.split(': ')
    a, b = bits[0], bits[1]
    cf above. Please, *don't* do that.
    There are other possible improvements of course. Like:

    - putting the main loop in it's own function taking source and dest (two
    opened (resp in 'r' and 'w' mode) filelike objects)
    - conditionnally call it from the top-level *if* the script has been
    called as a script (vs imported as a module) so you can reuse this code
    from another script.

    The test is:

    if __name__ == '__main__':
    # has been called as a script
    else:
    # has been imported

    HTH
     
    Bruno Desthuilliers, Jun 13, 2008
    #7
  8. Phillip B Oldham

    Aidan Guest

    slicing is a very handy feature... I'll expand on it a little

    OK so, first I'll create a sequence of integers
    [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

    "take element with index 4 and everything after it"
    [4, 5, 6, 7, 8, 9]

    "take everything up to, but not including, the element with index 4"
    [0, 1, 2, 3]

    "take the element with index 3 and everything up to, but not including,
    the element with index 6"
    [3, 4, 5]

    then there's the step argument

    "take every second element from the whole sequence"
    [0, 2, 4, 6, 8]

    "take every second element from the element with index 2 up to, but not
    including, the element with index 8"
    [2, 4, 6]


    Hope that helps.
     
    Aidan, Jun 13, 2008
    #8
  9. Good commentary. One small improvement:
    How about this?

    for domain in rec:
    # code here
    ext = domain.rsplit('.', 1)[1]
    rec = REC_CLEANERS.get(ext, lambda x: x)

    I suppose you could predefine the default function as well. This saves
    a binding and a test at the expense of a possible lambda call.
     
    D'Arcy J.M. Cain, Jun 13, 2008
    #9
  10. Phillip B Oldham

    Lie Guest

    Be careful with this, as python's string is immutable, this is ok, but
    if you're replicating a mutable item here, the result would be nasty.
     
    Lie, Jun 13, 2008
    #10
  11. Phillip B Oldham

    Lie Guest

    or:
    a, b = line.split(': ')[:1]
     
    Lie, Jun 13, 2008
    #11
  12. FWIW, the keys should not start with a '.'. My fault...
    Depends on if you want to know if there's a match or if you just don't
    care.
    Yeps. That's usually what I do when I end up using the above variant
    more than once in a module.
     
    bruno.desthuilliers, Jun 14, 2008
    #12
  13. ValueError: need more than 1 value to unpack

    You probably meant:

    a, b = line.split(': ')[:2]
     
    bruno.desthuilliers, Jun 14, 2008
    #13
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.