Comments on my first script?

Phillip B Oldham · Jun 12, 2008

I'm keen on learning python, with a heavy lean on doing things the
"pythonic" way, so threw the following script together in a few hours
as a first-attempt in programming python.

I'd like the community's thoughts/comments on what I've done;
improvements I can make, "don'ts" I should be avoiding, etc. I'm not
so much bothered about the resulting data - for the moment it meets my
needs. But any comment is welcome!

#!/usr/bin/env python
## Open a file containing a list of domains (1 per line),
## request and parse it's whois record and push to a csv
## file.

import subprocess
import re

src = open('./domains.txt')

dest = open('./whois.csv', 'w');

sep = "|"
headers = ["Domain","Registrant","Registrant's
Address","Registrar","Registrant Type","Date Registered","Renewal
Date","Last Updated","Name Servers"]

dest.write(sep.join(headers)+"\n")

def trim( txt ):
x = []
for line in txt.split("\n"):
if line.strip() == "":
continue
if line.strip().startswith('WHOIS'):
continue
if line.strip().startswith('>>>'):
continue
if line.strip().startswith('%'):
continue
if line.startswith("--"):
return ''.join(x)
x.append(" "+line)
return "\n".join(x)

def clean( txt ):
x = []
isok = re.compile("^\s?([^:]+): ").match
for line in txt.split("\n"):
match = isok(line)
if not match:
continue
x.append(line)
return "\n".join(x);

def clean_co_uk( rec ):
rec = rec.replace('Company number:', 'Company number -')
rec = rec.replace("\n\n", "\n")
rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([^(][a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
rec = re.sub("^[ ]+\n", "", rec)
return rec

def clean_net( rec ):
rec = rec.replace("\n\n", "\n")
rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
return rec

def clean_info( rec ):
x = []
for line in rec.split("\n"):
x.append(re.sub("^([^:]+):", "\g<0> ", line))
return "\n".join(x)

def record(domain, record):
details = ['','','','','','','','','']
for k, v in record.items():
try:
details[0] = domain.lower()
result = {
"registrant": lambda: 1,
"registrant name": lambda: 1,
"registrant type": lambda: 4,
"registrant's address": lambda: 2,
"registrant address1": lambda: 2,
"registrar": lambda: 3,
"sponsoring registrar": lambda: 3,
"registered on": lambda: 5,
"registered": lambda: 5,
"domain registeration date": lambda: 5,
"renewal date": lambda: 6,
"last updated": lambda: 7,
"domain last updated date": lambda: 7,
"name servers": lambda: 8,
"name server": lambda: 8,
"nameservers": lambda: 8,
"updated date": lambda: 7,
"creation date": lambda: 5,
"expiration date": lambda: 6,
"domain expiration date": lambda: 6,
"administrative contact": lambda: 2
}[k.lower()]()
if v != '':
details[result] = v
except:
continue

dest.write(sep.join(details)+"\n")

## Loop through domains
for domain in src:

domain = domain.strip()

if domain == '':
continue

rec = subprocess.Popen(["whois",domain],
stdout=subprocess.PIPE).communicate()[0]

if rec.startswith("No whois server") == True:
continue

if rec.startswith("This TLD has no whois server") == True:
continue

rec = trim(rec)

if domain.endswith(".net"):
rec = clean_net(rec)

if domain.endswith(".com"):
rec = clean_net(rec)

if domain.endswith(".tv"):
rec = clean_net(rec)

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

if domain.endswith(".info"):
rec = clean_info(rec)

rec = clean(rec)

details = {}

try:
for line in rec.split("\n"):
bits = line.split(': ')
a = bits.pop(0)
b = bits.pop(0)
details[a.strip()] = b.strip().replace("\t", ", ")
except:
continue

record(domain, details)

## Cleanup
src.close()
dest.close()

John Salerno · Jun 12, 2008

Phillip B Oldham said:
I'd like the community's thoughts/comments on what I've done;
improvements I can make, "don'ts" I should be avoiding, etc. I'm not
so much bothered about the resulting data - for the moment it meets my
needs. But any comment is welcome!

I'm not expert, but here are a few thoughts. I hope they help.

#!/usr/bin/env python
## Open a file containing a list of domains (1 per line),
## request and parse it's whois record and push to a csv
## file.

You might want to look into doc strings as a method of providing longer
documentation like this about what your program does.

dest = open('./whois.csv', 'w');

Semicolon!!!!

def trim( txt ):
x = []
for line in txt.split("\n"):
if line.strip() == "":
continue
if line.strip().startswith('WHOIS'):
continue
if line.strip().startswith('>>>'):
continue
if line.strip().startswith('%'):
continue
if line.startswith("--"):
return ''.join(x)

Is all this properly indented? One thing you can do is put each of these on
one line, since they are fairly simple:

if line.strip().startswith('WHOIS'): continue

although I still like proper indentation. But you have a lot of them so it
might save a good amount of space to do it this way.

Also, just my personal preference, I like to be consistent with the type of
quotes I use for strings. Here, you mix both single and double quotes on
different lines.

return "\n".join(x);

Semicolon!!!!

details = ['','','','','','','','','']

I don't have Python available to me right now, but I think you can do this
instead:

details = [''] * 9

except:
continue

Non-specific except clauses usually aren't preferred since they catch
everything, even something you might not want to catch.

if domain == '':
continue

You can say:

if not domain

instead of that equivalence test. But what does this if statement do?

if rec.startswith("No whois server") == True:
continue

if rec.startswith("This TLD has no whois server") == True:
continue

Like above, you don't need "== True" here.

if domain.endswith(".net"):
rec = clean_net(rec)

if domain.endswith(".com"):
rec = clean_net(rec)

if domain.endswith(".tv"):
rec = clean_net(rec)

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

if domain.endswith(".info"):
rec = clean_info(rec)

Hmm, my first thought is to do something like this with all these if tests:

for extension in [<list all the extensions as strings here>]:
rec = clean_net(extension)

But for that to work, you may need to generalize the clean_net function so
it works for all of them, instead of having to call different functions
depending on the extension.

Anyway, I hope some of that helps!

John Salerno · Jun 12, 2008

John Salerno said:
if domain.endswith(".net"):
rec = clean_net(rec)

if domain.endswith(".com"):
rec = clean_net(rec)

if domain.endswith(".tv"):
rec = clean_net(rec)

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

if domain.endswith(".info"):
rec = clean_info(rec)

Click to expand...

Hmm, my first thought is to do something like this with all these if
tests:

for extension in [<list all the extensions as strings here>]:
rec = clean_net(extension)

Whoops, you'd still need an if test in there I suppose!

for extension in [<list all the extensions as strings here>]:
if domain.endswith(extension):
rec = clean_net(extension)

Not sure if this is ideal.

Chris · Jun 12, 2008

I'm keen on learning python, with a heavy lean on doing things the
"pythonic" way, so threw the following script together in a few hours
as a first-attempt in programming python.

I'd like the community's thoughts/comments on what I've done;
improvements I can make, "don'ts" I should be avoiding, etc. I'm not
so much bothered about the resulting data - for the moment it meets my
needs. But any comment is welcome!

#!/usr/bin/env python
## Open a file containing a list of domains (1 per line),
## request and parse it's whois record and push to a csv
## file.

import subprocess
import re

src = open('./domains.txt')

dest = open('./whois.csv', 'w');

sep = "|"
headers = ["Domain","Registrant","Registrant's
Address","Registrar","Registrant Type","Date Registered","Renewal
Date","Last Updated","Name Servers"]

dest.write(sep.join(headers)+"\n")

def trim( txt ):
x = []
for line in txt.split("\n"):
if line.strip() == "":
continue
if line.strip().startswith('WHOIS'):
continue
if line.strip().startswith('>>>'):
continue
if line.strip().startswith('%'):
continue
if line.startswith("--"):
return ''.join(x)
x.append(" "+line)
return "\n".join(x)

def clean( txt ):
x = []
isok = re.compile("^\s?([^:]+): ").match
for line in txt.split("\n"):
match = isok(line)
if not match:
continue
x.append(line)
return "\n".join(x);

def clean_co_uk( rec ):
rec = rec.replace('Company number:', 'Company number -')
rec = rec.replace("\n\n", "\n")
rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([^(][a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
rec = re.sub("^[ ]+\n", "", rec)
return rec

def clean_net( rec ):
rec = rec.replace("\n\n", "\n")
rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
return rec

def clean_info( rec ):
x = []
for line in rec.split("\n"):
x.append(re.sub("^([^:]+):", "\g<0> ", line))
return "\n".join(x)

def record(domain, record):
details = ['','','','','','','','','']
for k, v in record.items():
try:
details[0] = domain.lower()
result = {
"registrant": lambda: 1,
"registrant name": lambda: 1,
"registrant type": lambda: 4,
"registrant's address": lambda: 2,
"registrant address1": lambda: 2,
"registrar": lambda: 3,
"sponsoring registrar": lambda: 3,
"registered on": lambda: 5,
"registered": lambda: 5,
"domain registeration date": lambda: 5,
"renewal date": lambda: 6,
"last updated": lambda: 7,
"domain last updated date": lambda: 7,
"name servers": lambda: 8,
"name server": lambda: 8,
"nameservers": lambda: 8,
"updated date": lambda: 7,
"creation date": lambda: 5,
"expiration date": lambda: 6,
"domain expiration date": lambda: 6,
"administrative contact": lambda: 2
}[k.lower()]()
if v != '':
details[result] = v
except:
continue

dest.write(sep.join(details)+"\n")

## Loop through domains
for domain in src:

domain = domain.strip()

if domain == '':
continue

rec = subprocess.Popen(["whois",domain],
stdout=subprocess.PIPE).communicate()[0]

if rec.startswith("No whois server") == True:
continue

if rec.startswith("This TLD has no whois server") == True:
continue

rec = trim(rec)

if domain.endswith(".net"):
rec = clean_net(rec)

if domain.endswith(".com"):
rec = clean_net(rec)

if domain.endswith(".tv"):
rec = clean_net(rec)

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

if domain.endswith(".info"):
rec = clean_info(rec)

rec = clean(rec)

details = {}

try:
for line in rec.split("\n"):
bits = line.split(': ')
a = bits.pop(0)
b = bits.pop(0)
details[a.strip()] = b.strip().replace("\t", ", ")
except:
continue

record(domain, details)

## Cleanup
src.close()
dest.close()

Just a few quick things before I leave work.

#!/usr/bin/env python
"""Open a file containing a list of domains (1 per line),
request and parse it's whois record and push to a csv
file.
""" # Rather use docstrings than multiline commenting like that.

def trim(txt):
x = []
for line in txt.splitlines(): # Strings have a built in function
if not line.strip() or line.startswith('WHOIS') \
or line.startswith('>>>') or line.startswith('%'):
continue # you can do them in one if statement
if line.startswith('--'): return ''.join(x)
x.append(' '+line)
return '\n'.join(x)

for domain in src:
if not domain.strip(): continue # A line with nothing is False

rec = subprocess.Popen(["whois",domain.strip()],
stdout=subprocess.PIPE).communicate()[0]
if rec.startswith('No whois server') \
or rec.startswith('This TLD has no whois server'):
continue # Startswith will return True/False so it is enough

rec = trim(rec)
if domain.endswith('.net'):
rec = clean_net(rec)
elif domain.endswith('.com'):
# Rather use if/elif statements unless somehow you think you
will match more than one.
....

for line in rec.splitlines():
try:
a, b = line.split(': ')[:2]
details[a.strip()] = b.strip().replace('\t', ', ')
except IndexError: # No matches
continue

Hope that's a start.

Phillip B Oldham · Jun 13, 2008

Thanks guys. Those comments are really helpful. The odd semi-colon is
my PHP background. Will probably be a hard habbit to break, that
one!

If I do accidentally drop a semi-colon at the end of the line,
will that cause any weird errors?

Also, Chris, can you explain this:
a, b = line.split(': ')[:2]

I understand the first section, but I've not seen [:2] before.

Chris · Jun 13, 2008

Thanks guys. Those comments are really helpful. The odd semi-colon is
my PHP background. Will probably be a hard habbit to break, that
one! If I do accidentally drop a semi-colon at the end of the line,
will that cause any weird errors?

Also, Chris, can you explain this:
a, b = line.split(': ')[:2]

I understand the first section, but I've not seen [:2] before.

That's slicing at work. What it is doing is only taking the first two
elements of the list that is built by the line.split.

Bruno Desthuilliers · Jun 13, 2008

Phillip B Oldham a écrit :

I'm keen on learning python, with a heavy lean on doing things the
"pythonic" way, so threw the following script together in a few hours
as a first-attempt in programming python.

I'd like the community's thoughts/comments on what I've done;
improvements I can make, "don'ts" I should be avoiding, etc. I'm not
so much bothered about the resulting data - for the moment it meets my
needs. But any comment is welcome!

Ok, since you asked for it, let's go:

#!/usr/bin/env python
## Open a file containing a list of domains (1 per line),
## request and parse it's whois record and push to a csv
## file.

import subprocess
import re

src = open('./domains.txt')

dest = open('./whois.csv', 'w');

Might be better to allow the user to pass source and destination as
arguments, defaulting to stdin and stdout.

Also, you may want to have a look at the csv module in the stdlib.

sep = "|"
headers = ["Domain","Registrant","Registrant's
Address","Registrar","Registrant Type","Date Registered","Renewal
Date","Last Updated","Name Servers"]

dest.write(sep.join(headers)+"\n")

Click to expand...

def trim( txt ):
x = []
for line in txt.split("\n"):
if line.strip() == "":
continue
if line.strip().startswith('WHOIS'):
continue
if line.strip().startswith('>>>'):
continue
if line.strip().startswith('%'):
continue
if line.startswith("--"):
return ''.join(x)
x.append(" "+line)
return "\n".join(x)

You're doing way to may calls to line.strip(). Call it once and store
the result.

def trim_test(line):
line = line.strip()
if not line:
return False
for test in ("WHOIS", ">>>", "%",):
if line.startswith(test):
return False
return True

def trim(txt):
lines = []
for line in txt.split.splitlines():
if trim_test(line):
if line.starstwith("--"):
return "".join(lines)
lines.append(" " + line)
return "\n".join(lines)

def clean( txt ):
x = []
isok = re.compile("^\s?([^:]+): ").match

Would be better to extract the regex compilation out of the function.

for line in txt.split("\n"):
match = isok(line)
if not match:
continue
x.append(line)

If you don't use the match object itself, don't ever bother to bind it:

for line in txt.split("\n"):
if not isok(line):
continue
x.append(line)

Then, you may find the intent and flow most obvious if you get rid of
the double negation (the not and the continue):

for line in txt.splitlines():
if isok(line):
x.append(line)

which is easy to rewrite as a either a list comprehension:

x = [line for line in txt.splitlines() if isok(line)]

or in a more lispish/functional style:

x = filter(isok, txt.splitlines())

In both way, you now can get rid of the binding to 'x' (a very bad name
for a list of lines BTW - what about something more explicit, like
'lines' ?)

return "\n".join(x);

isok = re.compile("^\s?([^:]+): ").match

def clean(txt):
return "\n".join(filter(isok, txt.splitlines()))

def clean_co_uk( rec ):
rec = rec.replace('Company number:', 'Company number -')
rec = rec.replace("\n\n", "\n")

Given the following, this above statement is useless.

rec = rec.replace("\n", "")

rec = rec.replace(": ", ":\n")
rec = re.sub("([^(][a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
rec = re.sub("^[ ]+\n", "", rec)

All this could probably be simplified.

return rec

def clean_net( rec ):
rec = rec.replace("\n\n", "\n")
rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
return rec
Idem.

def clean_info( rec ):
x = []
for line in rec.split("\n"):
x.append(re.sub("^([^:]+):", "\g<0> ", line))
return "\n".join(x)

def record(domain, record):
details = ['','','','','','','','','']

details = [''] * 9

for k, v in record.items():
try:
details[0] = domain.lower()
result = {
"registrant": lambda: 1,
"registrant name": lambda: 1,
"registrant type": lambda: 4,
"registrant's address": lambda: 2,
"registrant address1": lambda: 2,
"registrar": lambda: 3,
"sponsoring registrar": lambda: 3,
"registered on": lambda: 5,
"registered": lambda: 5,
"domain registeration date": lambda: 5,
"renewal date": lambda: 6,
"last updated": lambda: 7,
"domain last updated date": lambda: 7,
"name servers": lambda: 8,
"name server": lambda: 8,
"nameservers": lambda: 8,
"updated date": lambda: 7,
"creation date": lambda: 5,
"expiration date": lambda: 6,
"domain expiration date": lambda: 6,
"administrative contact": lambda: 2
}[k.lower()]()

Ok, let's summarize. On each iteration, you define a dict with the very
same 21 key:value pairs. Isn't it a bit wasteful ? What about defining
the dict only once, outside the function ?

Also, the values in the dict are constant functions. Why not just use
the constant results of the functions then ? I mean : what's wrong with
just :

{
"registrant": 1,
"registrant name": 1,
"registrant type": 4,
(etc...)
}

if v != '':
details[result] = v

As an icing on the cake, you build this whole dict, look up a function
in it, an call the function *before* you even decide if you need that
result.

except:
> continue

Friendly advice : *never* use a bare except clause that discards the
exception. Never ever do that.

Your except clause here should specifically catch KeyError. But anyway
you don't ever need to worry about exceptions here, you just have to use
dict.get(key, default) instead.

FIELDS_POSITIONS = {
"registrant": 1,
"registrant name": 1,
"registrant type": 4,
"registrant's address": 2,
(etc...)
}

def record(domain, rec):
details = [domain.lower()] + [''] * 8
for k, v in record.items():
if v:
pos = FIELDS_POSITIONS.get(k.lower(), None)
if pos is not None:
details[pos] = v

# I'm leaving this here, but I'd personnaly split the
# two unrelated concerns of formatting the record and
# writing it somewhere.

dest.write(sep.join(details)+"\n")

## Loop through domains
for domain in src:

domain = domain.strip()

if domain == '':
continue

rec = subprocess.Popen(["whois",domain],
stdout=subprocess.PIPE).communicate()[0]

if rec.startswith("No whois server") == True:
continue

if rec.startswith("This TLD has no whois server") == True:
continue

rec = trim(rec)

if domain.endswith(".net"):
rec = clean_net(rec)

if domain.endswith(".com"):
rec = clean_net(rec)

if domain.endswith(".tv"):
rec = clean_net(rec)

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

if domain.endswith(".info"):
rec = clean_info(rec)

Since the domain is very unlikely to match more than one test, at least
use if/elif/.../else to avoid redundant useless tests.

Now *this* would have been a good use of a dict of functions:

REC_CLEANERS = {
'.net' : clean_net,
'.com' : clean_com,
'.tv' : clean_net,
'.uk' : clean_co_uk,
(etc...)
}

for domain in rec:
# code here
ext = domain.rsplit('.', 1)[1]
cleaner = REC_CLEANERS.get(ext, None)
if cleaner:
rec = cleaner(rec)

rec = clean(rec)

details = {}

try:
for line in rec.split("\n"):
bits = line.split(': ')
a = bits.pop(0)
b = bits.pop(0)

if you expect only one ': ', then:
a, b = line.split(': ')

if you can have many but don't care about the others:
bits = line.split(': ')
a, b = bits[0], bits[1]

details[a.strip()] = b.strip().replace("\t", ", ")
except:

cf above. Please, *don't* do that.

continue

record(domain, details)

## Cleanup
src.close()
dest.close()

There are other possible improvements of course. Like:

- putting the main loop in it's own function taking source and dest (two
opened (resp in 'r' and 'w' mode) filelike objects)
- conditionnally call it from the top-level *if* the script has been
called as a script (vs imported as a module) so you can reuse this code
from another script.

The test is:

if __name__ == '__main__':
# has been called as a script
else:
# has been imported

HTH

Aidan · Jun 13, 2008

Chris said:
Thanks guys. Those comments are really helpful. The odd semi-colon is
my PHP background. Will probably be a hard habbit to break, that
one! If I do accidentally drop a semi-colon at the end of the line,
will that cause any weird errors?

Also, Chris, can you explain this:
a, b = line.split(': ')[:2]

I understand the first section, but I've not seen [:2] before.

Click to expand...

That's slicing at work. What it is doing is only taking the first two
elements of the list that is built by the line.split.

slicing is a very handy feature... I'll expand on it a little

OK so, first I'll create a sequence of integers
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

"take element with index 4 and everything after it"

>>> seq[4:]

Click to expand...

Click to expand...

[4, 5, 6, 7, 8, 9]

"take everything up to, but not including, the element with index 4"

>>> seq[:4]

Click to expand...

Click to expand...

[0, 1, 2, 3]

"take the element with index 3 and everything up to, but not including,
the element with index 6"

>>> seq[3:6]

Click to expand...

Click to expand...

[3, 4, 5]

then there's the step argument

"take every second element from the whole sequence"

>>> seq[::2]

Click to expand...

Click to expand...

[0, 2, 4, 6, 8]

"take every second element from the element with index 2 up to, but not
including, the element with index 8"

>>> seq[2:8:2]

Click to expand...

Click to expand...

[2, 4, 6]

Hope that helps.

D'Arcy J.M. Cain · Jun 13, 2008

Ok, since you asked for it, let's go:

Good commentary. One small improvement:

REC_CLEANERS = {
'.net' : clean_net,
'.com' : clean_com,
'.tv' : clean_net,
'.uk' : clean_co_uk,
(etc...)
}

for domain in rec:
# code here
ext = domain.rsplit('.', 1)[1]
cleaner = REC_CLEANERS.get(ext, None)
if cleaner:
rec = cleaner(rec)

How about this?

for domain in rec:
# code here
ext = domain.rsplit('.', 1)[1]
rec = REC_CLEANERS.get(ext, lambda x: x)

I suppose you could predefine the default function as well. This saves
a binding and a test at the expense of a possible lambda call.

Lie · Jun 13, 2008

Phillip B Oldham said:
Phillip B Oldham said:

I'd like the community's thoughts/comments on what I've done;
improvements I can make, "don'ts" I should be avoiding, etc. I'm not
so much bothered about the resulting data - for the moment it meets my
needs. But any comment is welcome!

Click to expand...

I'm not expert, but here are a few thoughts. I hope they help.

#!/usr/bin/env python
## Open a file containing a list of domains (1 per line),
## request and parse it's whois record and push to a csv
## file.

Click to expand...

You might want to look into doc strings as a method of providing longer
documentation like this about what your program does.

dest = open('./whois.csv', 'w');

Click to expand...

Semicolon!!!!

def trim( txt ):
x = []
for line in txt.split("\n"):
if line.strip() == "":
continue
if line.strip().startswith('WHOIS'):
continue
if line.strip().startswith('>>>'):
continue
if line.strip().startswith('%'):
continue
if line.startswith("--"):
return ''.join(x)

Click to expand...

Is all this properly indented? One thing you can do is put each of these on
one line, since they are fairly simple:

if line.strip().startswith('WHOIS'): continue

although I still like proper indentation. But you have a lot of them so it
might save a good amount of space to do it this way.

Also, just my personal preference, I like to be consistent with the type of
quotes I use for strings. Here, you mix both single and double quotes on
different lines.

return "\n".join(x);

Click to expand...

Semicolon!!!!

details = ['','','','','','','','','']

Click to expand...

I don't have Python available to me right now, but I think you can do this
instead:

details = [''] * 9

Be careful with this, as python's string is immutable, this is ok, but
if you're replicating a mutable item here, the result would be nasty.

except:
continue

Click to expand...

Non-specific except clauses usually aren't preferred since they catch
everything, even something you might not want to catch.

if domain == '':
continue

Click to expand...

You can say:

if not domain

instead of that equivalence test. But what does this if statement do?

if rec.startswith("No whois server") == True:
continue

Click to expand...

if rec.startswith("This TLD has no whois server") == True:
continue

Click to expand...

Like above, you don't need "== True" here.

if domain.endswith(".net"):
rec = clean_net(rec)

Click to expand...

if domain.endswith(".com"):
rec = clean_net(rec)

Click to expand...

if domain.endswith(".tv"):
rec = clean_net(rec)

Click to expand...

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

Click to expand...

if domain.endswith(".info"):
rec = clean_info(rec)

Click to expand...

Hmm, my first thought is to do something like this with all these if tests:

for extension in [<list all the extensions as strings here>]:
rec = clean_net(extension)

But for that to work, you may need to generalize the clean_net function so
it works for all of them, instead of having to call different functions
depending on the extension.

Anyway, I hope some of that helps!

Lie · Jun 13, 2008

Phillip B Oldham a écrit :

I'm keen on learning python, with a heavy lean on doing things the
"pythonic" way, so threw the following script together in a few hours
as a first-attempt in programming python.

Click to expand...

I'd like the community's thoughts/comments on what I've done;
improvements I can make, "don'ts" I should be avoiding, etc. I'm not
so much bothered about the resulting data - for the moment it meets my
needs. But any comment is welcome!

Click to expand...

Ok, since you asked for it, let's go:

#!/usr/bin/env python
## Open a file containing a list of domains (1 per line),
## request and parse it's whois record and push to a csv
## file.

Click to expand...

import subprocess
import re

Click to expand...

src = open('./domains.txt')

Click to expand...

dest = open('./whois.csv', 'w');

Click to expand...

Might be better to allow the user to pass source and destination as
arguments, defaulting to stdin and stdout.

Also, you may want to have a look at the csv module in the stdlib.

sep = "|"
headers = ["Domain","Registrant","Registrant's
Address","Registrar","Registrant Type","Date Registered","Renewal
Date","Last Updated","Name Servers"]
dest.write(sep.join(headers)+"\n")
def trim( txt ):
x = []
for line in txt.split("\n"):
if line.strip() == "":
continue
if line.strip().startswith('WHOIS'):
continue
if line.strip().startswith('>>>'):
continue
if line.strip().startswith('%'):
continue
if line.startswith("--"):
return ''.join(x)
x.append(" "+line)
return "\n".join(x)

Click to expand...

You're doing way to may calls to line.strip(). Call it once and store
the result.

def trim_test(line):
line = line.strip()
if not line:
return False
for test in ("WHOIS", ">>>", "%",):
if line.startswith(test):
return False
return True

def trim(txt):
lines = []
for line in txt.split.splitlines():
if trim_test(line):
if line.starstwith("--"):
return "".join(lines)
lines.append(" " + line)
return "\n".join(lines)

def clean( txt ):
x = []
isok = re.compile("^\s?([^:]+): ").match

Click to expand...

Would be better to extract the regex compilation out of the function.

for line in txt.split("\n"):
match = isok(line)
if not match:
continue
x.append(line)

Click to expand...

If you don't use the match object itself, don't ever bother to bind it:

for line in txt.split("\n"):
if not isok(line):
continue
x.append(line)

Then, you may find the intent and flow most obvious if you get rid of
the double negation (the not and the continue):

for line in txt.splitlines():
if isok(line):
x.append(line)

which is easy to rewrite as a either a list comprehension:

x = [line for line in txt.splitlines() if isok(line)]

or in a more lispish/functional style:

x = filter(isok, txt.splitlines())

In both way, you now can get rid of the binding to 'x' (a very bad name
for a list of lines BTW - what about something more explicit, like
'lines' ?)

return "\n".join(x);

Click to expand...

isok = re.compile("^\s?([^:]+): ").match

def clean(txt):
return "\n".join(filter(isok, txt.splitlines()))

def clean_co_uk( rec ):
rec = rec.replace('Company number:', 'Company number -')
rec = rec.replace("\n\n", "\n")

Click to expand...

Given the following, this above statement is useless.

rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([^(][a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
rec = re.sub("^[ ]+\n", "", rec)

Click to expand...

All this could probably be simplified.

return rec

Click to expand...

def clean_net( rec ):
rec = rec.replace("\n\n", "\n")
rec = rec.replace("\n", "")
rec = rec.replace(": ", ":\n")
rec = re.sub("([a-zA-Z']+\s?[a-zA-Z]*:\n)", "\n\g<0>", rec)
rec = rec.replace(":\n", ": ")
return rec
Idem.

def clean_info( rec ):
x = []
for line in rec.split("\n"):
x.append(re.sub("^([^:]+):", "\g<0> ", line))
return "\n".join(x)

Click to expand...

def record(domain, record):
details = ['','','','','','','','','']

Click to expand...

details = [''] * 9

for k, v in record.items():
try:
details[0] = domain.lower()
result = {
"registrant": lambda: 1,
"registrant name": lambda: 1,
"registrant type": lambda: 4,
"registrant's address": lambda: 2,
"registrant address1": lambda: 2,
"registrar": lambda: 3,
"sponsoring registrar": lambda: 3,
"registered on": lambda: 5,
"registered": lambda: 5,
"domain registeration date": lambda: 5,
"renewal date": lambda: 6,
"last updated": lambda: 7,
"domain last updated date": lambda: 7,
"name servers": lambda: 8,
"name server": lambda: 8,
"nameservers": lambda: 8,
"updated date": lambda: 7,
"creation date": lambda: 5,
"expiration date": lambda: 6,
"domain expiration date": lambda: 6,
"administrative contact": lambda: 2
}[k.lower()]()

Click to expand...

Ok, let's summarize. On each iteration, you define a dict with the very
same 21 key:value pairs. Isn't it a bit wasteful ? What about defining
the dict only once, outside the function ?

Also, the values in the dict are constant functions. Why not just use
the constant results of the functions then ? I mean : what's wrong with
just :

{
"registrant": 1,
"registrant name": 1,
"registrant type": 4,
(etc...)

}

if v != '':
details[result] = v

Click to expand...

As an icing on the cake, you build this whole dict, look up a function
in it, an call the function *before* you even decide if you need that
result.

except:

Click to expand...

> continue

Friendly advice : *never* use a bare except clause that discards the
exception. Never ever do that.

Your except clause here should specifically catch KeyError. But anyway
you don't ever need to worry about exceptions here, you just have to use
dict.get(key, default) instead.

FIELDS_POSITIONS = {
"registrant": 1,
"registrant name": 1,
"registrant type": 4,
"registrant's address": 2,
(etc...)

}

def record(domain, rec):
details = [domain.lower()] + [''] * 8
for k, v in record.items():
if v:
pos = FIELDS_POSITIONS.get(k.lower(), None)
if pos is not None:
details[pos] = v

# I'm leaving this here, but I'd personnaly split the
# two unrelated concerns of formatting the record and
# writing it somewhere.

dest.write(sep.join(details)+"\n")

## Loop through domains
for domain in src:

Click to expand...

domain = domain.strip()

Click to expand...

if domain == '':
continue

Click to expand...

rec = subprocess.Popen(["whois",domain],
stdout=subprocess.PIPE).communicate()[0]

Click to expand...

if rec.startswith("No whois server") == True:
continue

Click to expand...

if rec.startswith("This TLD has no whois server") == True:
continue

Click to expand...

rec = trim(rec)

Click to expand...

if domain.endswith(".net"):
rec = clean_net(rec)

Click to expand...

if domain.endswith(".com"):
rec = clean_net(rec)

Click to expand...

if domain.endswith(".tv"):
rec = clean_net(rec)

Click to expand...

if domain.endswith(".co.uk"):
rec = clean_co_uk(rec)

Click to expand...

if domain.endswith(".info"):
rec = clean_info(rec)

Click to expand...

Since the domain is very unlikely to match more than one test, at least
use if/elif/.../else to avoid redundant useless tests.

Now *this* would have been a good use of a dict of functions:

REC_CLEANERS = {
'.net' : clean_net,
'.com' : clean_com,
'.tv' : clean_net,
'.uk' : clean_co_uk,
(etc...)

}

for domain in rec:
# code here
ext = domain.rsplit('.', 1)[1]
cleaner = REC_CLEANERS.get(ext, None)
if cleaner:
rec = cleaner(rec)

rec = clean(rec)

Click to expand...

details = {}

Click to expand...

try:
for line in rec.split("\n"):
bits = line.split(': ')
a = bits.pop(0)
b = bits.pop(0)

Click to expand...

if you expect only one ': ', then:
a, b = line.split(': ')

if you can have many but don't care about the others:
bits = line.split(': ')
a, b = bits[0], bits[1]

or:
a, b = line.split(': ')[:1]

details[a.strip()] = b.strip().replace("\t", ", ")
except:

Click to expand...

cf above. Please, *don't* do that.

continue

Click to expand...

record(domain, details)

Click to expand...

## Cleanup
src.close()
dest.close()

Click to expand...

There are other possible improvements of course. Like:

- putting the main loop in it's own function taking source and dest (two
opened (resp in 'r' and 'w' mode) filelike objects)
- conditionnally call it from the top-level *if* the script has been
called as a script (vs imported as a module) so you can reuse this code
from another script.

The test is:

if __name__ == '__main__':
# has been called as a script
else:
# has been imported

HTH

bruno.desthuilliers · Jun 14, 2008

Good commentary. One small improvement:

FWIW, the keys should not start with a '.'. My fault...

for domain in rec:
# code here
ext = domain.rsplit('.', 1)[1]
cleaner = REC_CLEANERS.get(ext, None)
if cleaner:
rec = cleaner(rec)

Click to expand...

How about this?

for domain in rec:
# code here
ext = domain.rsplit('.', 1)[1]
rec = REC_CLEANERS.get(ext, lambda x: x)

Depends on if you want to know if there's a match or if you just don't
care.

I suppose you could predefine the default function as well.

Yeps. That's usually what I do when I end up using the above variant
more than once in a module.

bruno.desthuilliers · Jun 14, 2008

Phillip B Oldham a écrit : (snip)

Click to expand...

if you expect only one ': ', then:
a, b = line.split(': ')

Click to expand...

if you can have many but don't care about the others:
bits = line.split(': ')
a, b = bits[0], bits[1]

Click to expand...

or:
a, b = line.split(': ')[:1]

ValueError: need more than 1 value to unpack

You probably meant:

a, b = line.split(': ')[:2]

Need help with this script	4	Mar 12, 2023
First Tkinter script: requesting comments	2	May 21, 2010
Help with my responsive home page	2	Dec 14, 2022
Q's on my first python script	8	May 10, 2009
Request for tips on my first python script.	6	Sep 8, 2006
brand new to python	2	Mar 13, 2005
Python script for MySQL Passwords Unreliable on first boot (rc.local)	1	Sep 13, 2010
My first Python program -- a lexer	25	Nov 8, 2008

Comments on my first script?

Phillip B Oldham

John Salerno

John Salerno

Chris

Phillip B Oldham

Chris

Bruno Desthuilliers

Aidan

D'Arcy J.M. Cain

Lie

Lie

bruno.desthuilliers

bruno.desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads