Refactoring; arbitrary expression in lists

F

Frans Englich

As continuation to a previous thread, "PyChecker messages", I have a question
regarding code refactoring which the following snippet leads to:
That is also advice. Generally you use a dict of functions, or some other
structure to lookup what you want to do.

More specifically, my function looks like this:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

if extension == "php":
return "application/x-php"

elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
# etcetera
elif extension == "xsl":
return "text/xsl"

elif basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError
#--------------------------------------------------------------
(don't bother if the MIME detection looks like stone age, it's temporary until
PyXDG gets support for the XDG mime type spec..)

I'm now wondering if it's possible to write this in a more compact way, such
that the if-clause isn't necessary? Of course, the current code works, but
perhaps it could be prettier.

I'm thinking along the lines of nested lists, but what is the obstacle for me
is that both the test and return statement are simple expressions; not
functions or a particular data type. Any ideas?


Cheers,

Frans
 
P

Paul McGuire

Frans Englich said:
As continuation to a previous thread, "PyChecker messages", I have a question
regarding code refactoring which the following snippet leads to:
That is also advice. Generally you use a dict of functions, or some other
structure to lookup what you want to do.

More specifically, my function looks like this:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

if extension == "php":
return "application/x-php"

elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
# etcetera
<snip>

Since the majority of your tests will be fairly direct 'extension "XYZ"
means mimetype "aaa/bbb"', this really sounds like a dictionary type
solution is called for. Still, you might have some desire to do some
order-dependent testing. Here are two ideas - the first iterates over a
list of expressions and resulting types, the other uses a dictionary lookup.

-- Paul


import os

extToMimeMap = [
('"php"', "application/x-php"),
('"cpp" or extension.endswith("cc")', "text/x-c++-src"),
('"xsl"', "text/xsl"),
]

def detectMimeType1( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

for exp,mimetype in extToMimeMap:
if eval("extension=="+exp): return mimetype

# do other non-extension-related tests here
if basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError


extToMimeDict = {
"php": "application/x-php",
"cpp": "text/x-c++-src",
"xsl": "text/xsl",
}

def detectMimeType2( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

# check for straight extension matches
try:
return extToMimeDict[extension]
except KeyError:
pass

# do more complex extension and other non-extension-related tests here
if extension.endswith("cc"):
return extToMimeDict["cpp"]

if basename.find( "Makefile" ) != -1:
return "text/x-makefile"

raise NoMimeError

for detectMimeType in (detectMimeType1, detectMimeType2):
for s in ("a.php","z.acc","Makefile","blork.xsl"):
print s,"->",detectMimeType(s)
 
W

wittempj

I can not break the original code in 2.4, if I try this:
import os, sys
class NoMimeError(Exception):
pass

def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)
if extension == "php":
return "application/x-php"
elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
elif extension == "1":
return 'BlahBlah'
elif extension == "2":
return 'BlahBlah'
elif extension == "3":
return 'BlahBlah'
elif extension == "4":
return 'BlahBlah'
elif extension == "5":
return 'BlahBlah'
elif extension == "6":
return 'BlahBlah'
elif extension == "7":
return 'BlahBlah'
elif extension == "8":
return 'BlahBlah'
elif extension == "9":
return 'BlahBlah'
elif extension == "10":
return 'BlahBlah'
elif extension == "11":
return 'BlahBlah'
elif extension == "12":
return 'BlahBlah'
elif extension == "13":
return 'BlahBlah'
elif extension == "14":
return 'BlahBlah'
elif extension == "15":
return 'BlahBlah'
elif extension == "16":
return 'BlahBlah'
elif extension == "17":
return 'BlahBlah'
elif extension == "18":
return 'BlahBlah'
elif extension == "19":
return 'BlahBlah'
elif extension == "20":
return 'BlahBlah'

elif extension == "xsl":
return "text/xsl"

elif basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError
try:
print detectMimeType(r'c:\test.php')
print detectMimeType('c:\test.xsl')
print detectMimeType('c:\test.xxx')
except Exception, e:
print >> sys.stderr, '%s: %s' %(e.__class__.__name__, e)

I get
application/x-php
text/xsl
NoMimeError:

So although the dictionary solution is much nicer nothing seems wrong
with your code as it is - or am I missing something?
 
F

Frans Englich

I can not break the original code in 2.4, if I try this:
[...]


So although the dictionary solution is much nicer nothing seems wrong
with your code as it is - or am I missing something?

Nope, the current code works. I'm just looking at Python's cool ways of
solving problems. (the matter about 11 returns was a coding-style report from
PyChecker).


Cheers,

Frans
 
J

Jeff Shannon

Paul said:
Frans Englich said:
#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]

You might consider using os.path.splitext() here, instead of always
assuming that the last three characters are the extension. That way
you'll be consistent even with extensions like .c, .cc, .h, .gz, etc.

Note that os.path.splitext() does include the extension separator (the
dot), so that you'll need to test against, e.g., ".php" and ".cpp".
Since the majority of your tests will be fairly direct 'extension "XYZ"
means mimetype "aaa/bbb"', this really sounds like a dictionary type
solution is called for.

I strongly agree with this. The vast majority of your cases seem to
be a direct mapping of extension-string to mimetype-string; using a
dictionary (i.e. mapping ;) ) for this is ideal. For those cases
where you can't key off of an extension string (such as makefiles),
you can do special-case processing if the dictionary lookup fails.

if extension.endswith("cc"):
return extToMimeDict["cpp"]

If the intent of this is to catch .cc files, it's easy to add an extra
entry into the dict to map '.cc' to the same string as '.cpp'.

Jeff Shannon
Technician/Programmer
Credit International
 
B

Bengt Richter

As continuation to a previous thread, "PyChecker messages", I have a question
regarding code refactoring which the following snippet leads to:
That is also advice. Generally you use a dict of functions, or some other
structure to lookup what you want to do.

More specifically, my function looks like this:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

if extension == "php":
return "application/x-php"

elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
# etcetera
elif extension == "xsl":
return "text/xsl"

elif basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError
#--------------------------------------------------------------
(don't bother if the MIME detection looks like stone age, it's temporary until
PyXDG gets support for the XDG mime type spec..)

I'm now wondering if it's possible to write this in a more compact way, such
that the if-clause isn't necessary? Of course, the current code works, but
perhaps it could be prettier.

I'm thinking along the lines of nested lists, but what is the obstacle for me
is that both the test and return statement are simple expressions; not
functions or a particular data type. Any ideas?
I think I would refactor along these lines: (untested)

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')
try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Regards,
Bengt Richter
 
S

Stephen Thorne

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')
try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Why not use a regexp based approach.
extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
for regexp, mimetype in extensionlist:
if regexp.match(filename):
return mimetype

if you were really concerned about efficiency, you could use something like:
class SimpleMatch:
def __init__(self, pattern): self.pattern = pattern
def match(self, subject): return subject[-len(self.pattern):] == self.pattern

Regards,
Stephen Thorne
 
B

Bengt Richter

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')
extension = os.path.splitext(filename)[1].replace('.', '').lower() # better
try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Why not use a regexp based approach.
ISTM the dict setup closely reflects the OP's if/elif tests and makes for an efficient substitute
for the functionality when later used for lookup. The regex list is O(n) and the regexes themselves
are at least that, so I don't see a benefit. If you are going to loop through extensionlist, you
might as well write (untested)

flowerew = filename.lower().endswith
for ext, mimetype:
if flowerew(ext): return mimetype
else:
if 'makefile' in filename.lower(): return 'text/x-makefile'
raise NoMimeError

using a lower case extension list including the dot. I think it would run faster
than a regex, and not scare anyone unnecessarily ;-)

The dict eliminates the loop, and is easy to understand, so IMO it's a better choice.
extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
for regexp, mimetype in extensionlist:
if regexp.match(filename):
return mimetype

if you were really concerned about efficiency, you could use something like:
class SimpleMatch:
def __init__(self, pattern): self.pattern = pattern
def match(self, subject): return subject[-len(self.pattern):] == self.pattern

I'm not clear on what you are doing here, but if you think you are going to compete
with the timbot's dict efficiency with a casual few lines, I suspect you are PUI ;-)
(Posting Under the Influence ;-)

Regards,
Bengt Richter
 
S

Stephen Thorne

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')
extension = os.path.splitext(filename)[1].replace('.', '').lower() # better
try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Why not use a regexp based approach.
ISTM the dict setup closely reflects the OP's if/elif tests and makes for an efficient substitute
for the functionality when later used for lookup. The regex list is O(n) and the regexes themselves
are at least that, so I don't see a benefit. If you are going to loop through extensionlist, you
might as well write (untested)
<code snipped>

*shrug*, O(n*m) actually, where n is the number of mime-types and m is
the length of the extension.
extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
for regexp, mimetype in extensionlist:
if regexp.match(filename):
return mimetype

if you were really concerned about efficiency, you could use something like:
class SimpleMatch:
def __init__(self, pattern): self.pattern = pattern
def match(self, subject): return subject[-len(self.pattern):] == self.pattern

I'm not clear on what you are doing here, but if you think you are going to compete
with the timbot's dict efficiency with a casual few lines, I suspect you are PUI ;-)
(Posting Under the Influence ;-)

Sorry about that, what I was trying to say was something along the lines of:

extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
can be made more efficient by doing something like this:
extensionlist = [
SimpleMatch(".php"), "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
Where SimpleMatch uses a slice and a comparison instead of a regular
expression engine. SimpleMatch and re.compile both return an object
that when you call .match(s) returns a value that can be interpreted
as a boolean.

As for the overall efficiency concerns, I feel that talking about any
of this is premature optimisation. The optimisation that is really
required in this situation is the same as with any
large-switch-statement idiom, be it C or Python. First one must do a
frequency analysis of the inputs to the switch statement in order to
discover the optimal order of tests!

Regards,
Stephen Thorne
 
N

Nick Craig-Wood

Stephen Thorne said:
Why not use a regexp based approach.

Good idea... You could also use sre.Scanner which is supposed to be
fast like this...

import re, sre

scanner = sre.Scanner([
(r"\.php$", "application/x-php"),
(r"\.(cc|cpp)$", "text/x-c++-src"),
(r"\.xsl$", "xsl"),
(r"Makefile", "text/x-makefile"),
(r".", None),
])

def detectMimeType( filename ):
t = scanner.scan(filename)[0]
if len(t) < 1:
return None
# raise NoMimeError
return t[0]


for f in ("index.php", "index.php3", "prog.cc", "prog.cpp", "flodge.xsl", "Makefile", "myMakefile", "potato.123"):
print f, detectMimeType(f)

....

prints

index.php application/x-php
index.php3 None
prog.cc text/x-c++-src
prog.cpp text/x-c++-src
flodge.xsl xsl
Makefile text/x-makefile
myMakefile text/x-makefile
potato.123 None
 
P

Paul McGuire

Despite the regexp alternatives, I refuse to post a pyparsing solution to
this! :)

-- Paul
 
J

Jeff Shannon

Stephen said:
As for the overall efficiency concerns, I feel that talking about any
of this is premature optimisation.

I disagree -- using REs is adding unnecessary complication and
dependency. Premature optimization is a matter of using a
conceptually more-complicated method when a simpler one would do; REs
are, in fairly simple cases such as this, clearly *not* simpler than
dict lookups.
The optimisation that is really
required in this situation is the same as with any
large-switch-statement idiom, be it C or Python. First one must do a
frequency analysis of the inputs to the switch statement in order to
discover the optimal order of tests!

But if you're using a dictionary lookup, then the frequency of inputs
is irrelevant. Regardless of the value of the input, you're doing a
single hash-compute and (barring hash collisions) a single
bucket-lookup. Since dicts are unordered, the ordering of the literal
(or of a set of statements adding to the dict) doesn't matter.

Jeff Shannon
Technician/Programmer
Credit International
 
?

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

# do other non-extension-related tests here
if basename.find( "Makefile" ) != -1:
return "text/x-makefile"

I believe this can be nicelier written as:

if "Makefile" in basename:
 
S

Steven Bethard

BJörn Lindqvist said:
I believe this can be nicelier written as:

if "Makefile" in basename:

+1 for "nicelier" as VOTW (Vocabulation of the week) =)

Steve
 
S

Steve Holden

Peter said:
Steven said:
BJörn Lindqvist wrote:
[...]
I believe this can be nicelier written as:

if "Makefile" in basename:

+1 for "nicelier" as VOTW (Vocabulation of the week) =)


Me too, because nicelier is nicer than more nicely. :)

Is that really the niceliest way to express that thought?

regards-li-er y'rs - steve
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top