Refactoring; arbitrary expression in lists

Frans Englich · Jan 12, 2005

As continuation to a previous thread, "PyChecker messages", I have a question
regarding code refactoring which the following snippet leads to:

That is also advice. Generally you use a dict of functions, or some other
structure to lookup what you want to do.

More specifically, my function looks like this:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

if extension == "php":
return "application/x-php"

elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
# etcetera
elif extension == "xsl":
return "text/xsl"

elif basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError
#--------------------------------------------------------------
(don't bother if the MIME detection looks like stone age, it's temporary until
PyXDG gets support for the XDG mime type spec..)

I'm now wondering if it's possible to write this in a more compact way, such
that the if-clause isn't necessary? Of course, the current code works, but
perhaps it could be prettier.

I'm thinking along the lines of nested lists, but what is the obstacle for me
is that both the test and return statement are simple expressions; not
functions or a particular data type. Any ideas?

Cheers,

Frans

Paul McGuire · Jan 12, 2005

Frans Englich said:
As continuation to a previous thread, "PyChecker messages", I have a question
regarding code refactoring which the following snippet leads to:

That is also advice. Generally you use a dict of functions, or some other
structure to lookup what you want to do.

Click to expand...

More specifically, my function looks like this:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

if extension == "php":
return "application/x-php"

elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
# etcetera

<snip>

Since the majority of your tests will be fairly direct 'extension "XYZ"
means mimetype "aaa/bbb"', this really sounds like a dictionary type
solution is called for. Still, you might have some desire to do some
order-dependent testing. Here are two ideas - the first iterates over a
list of expressions and resulting types, the other uses a dictionary lookup.

-- Paul

import os

extToMimeMap = [
('"php"', "application/x-php"),
('"cpp" or extension.endswith("cc")', "text/x-c++-src"),
('"xsl"', "text/xsl"),
]

def detectMimeType1( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

for exp,mimetype in extToMimeMap:
if eval("extension=="+exp): return mimetype

# do other non-extension-related tests here
if basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError

extToMimeDict = {
"php": "application/x-php",
"cpp": "text/x-c++-src",
"xsl": "text/xsl",
}

def detectMimeType2( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

# check for straight extension matches
try:
return extToMimeDict[extension]
except KeyError:
pass

# do more complex extension and other non-extension-related tests here
if extension.endswith("cc"):
return extToMimeDict["cpp"]

if basename.find( "Makefile" ) != -1:
return "text/x-makefile"

raise NoMimeError

for detectMimeType in (detectMimeType1, detectMimeType2):
for s in ("a.php","z.acc","Makefile","blork.xsl"):
print s,"->",detectMimeType(s)

wittempj · Jan 12, 2005

I can not break the original code in 2.4, if I try this:
import os, sys
class NoMimeError(Exception):
pass

def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)
if extension == "php":
return "application/x-php"
elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
elif extension == "1":
return 'BlahBlah'
elif extension == "2":
return 'BlahBlah'
elif extension == "3":
return 'BlahBlah'
elif extension == "4":
return 'BlahBlah'
elif extension == "5":
return 'BlahBlah'
elif extension == "6":
return 'BlahBlah'
elif extension == "7":
return 'BlahBlah'
elif extension == "8":
return 'BlahBlah'
elif extension == "9":
return 'BlahBlah'
elif extension == "10":
return 'BlahBlah'
elif extension == "11":
return 'BlahBlah'
elif extension == "12":
return 'BlahBlah'
elif extension == "13":
return 'BlahBlah'
elif extension == "14":
return 'BlahBlah'
elif extension == "15":
return 'BlahBlah'
elif extension == "16":
return 'BlahBlah'
elif extension == "17":
return 'BlahBlah'
elif extension == "18":
return 'BlahBlah'
elif extension == "19":
return 'BlahBlah'
elif extension == "20":
return 'BlahBlah'

elif extension == "xsl":
return "text/xsl"

elif basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError
try:
print detectMimeType(r'c:\test.php')
print detectMimeType('c:\test.xsl')
print detectMimeType('c:\test.xxx')
except Exception, e:
print >> sys.stderr, '%s: %s' %(e.__class__.__name__, e)

I get

application/x-php
text/xsl
NoMimeError:

So although the dictionary solution is much nicer nothing seems wrong
with your code as it is - or am I missing something?

Frans Englich · Jan 12, 2005

I can not break the original code in 2.4, if I try this:
[...]

So although the dictionary solution is much nicer nothing seems wrong
with your code as it is - or am I missing something?

Nope, the current code works. I'm just looking at Python's cool ways of
solving problems. (the matter about 11 returns was a coding-style report from
PyChecker).

Cheers,

Frans

Jeff Shannon · Jan 12, 2005

Paul said:
Frans Englich said:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]

Click to expand...

You might consider using os.path.splitext() here, instead of always
assuming that the last three characters are the extension. That way
you'll be consistent even with extensions like .c, .cc, .h, .gz, etc.

Note that os.path.splitext() does include the extension separator (the
dot), so that you'll need to test against, e.g., ".php" and ".cpp".

Since the majority of your tests will be fairly direct 'extension "XYZ"
means mimetype "aaa/bbb"', this really sounds like a dictionary type
solution is called for.

I strongly agree with this. The vast majority of your cases seem to
be a direct mapping of extension-string to mimetype-string; using a
dictionary (i.e. mapping

) for this is ideal. For those cases
where you can't key off of an extension string (such as makefiles),
you can do special-case processing if the dictionary lookup fails.

if extension.endswith("cc"):
return extToMimeDict["cpp"]

If the intent of this is to catch .cc files, it's easy to add an extra
entry into the dict to map '.cc' to the same string as '.cpp'.

Jeff Shannon
Technician/Programmer
Credit International

Bengt Richter · Jan 13, 2005

As continuation to a previous thread, "PyChecker messages", I have a question
regarding code refactoring which the following snippet leads to:

That is also advice. Generally you use a dict of functions, or some other
structure to lookup what you want to do.

Click to expand...

More specifically, my function looks like this:

#--------------------------------------------------------------
def detectMimeType( filename ):

extension = filename[-3:]
basename = os.path.basename(filename)

if extension == "php":
return "application/x-php"

elif extension == "cpp" or extension.endswith("cc"):
return "text/x-c++-src"
# etcetera
elif extension == "xsl":
return "text/xsl"

elif basename.find( "Makefile" ) != -1:
return "text/x-makefile"
else:
raise NoMimeError
#--------------------------------------------------------------
(don't bother if the MIME detection looks like stone age, it's temporary until
PyXDG gets support for the XDG mime type spec..)

I'm now wondering if it's possible to write this in a more compact way, such
that the if-clause isn't necessary? Of course, the current code works, but
perhaps it could be prettier.

I'm thinking along the lines of nested lists, but what is the obstacle for me
is that both the test and return statement are simple expressions; not
functions or a particular data type. Any ideas?

I think I would refactor along these lines: (untested)

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')
try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Regards,
Bengt Richter

Stephen Thorne · Jan 13, 2005

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')
try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Why not use a regexp based approach.
extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
for regexp, mimetype in extensionlist:
if regexp.match(filename):
return mimetype

if you were really concerned about efficiency, you could use something like:
class SimpleMatch:
def __init__(self, pattern): self.pattern = pattern
def match(self, subject): return subject[-len(self.pattern):] == self.pattern

Regards,
Stephen Thorne

Bengt Richter · Jan 13, 2005

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')

Click to expand...

extension = os.path.splitext(filename)[1].replace('.', '').lower() # better

try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Click to expand...

Why not use a regexp based approach.

ISTM the dict setup closely reflects the OP's if/elif tests and makes for an efficient substitute
for the functionality when later used for lookup. The regex list is O(n) and the regexes themselves
are at least that, so I don't see a benefit. If you are going to loop through extensionlist, you
might as well write (untested)

flowerew = filename.lower().endswith
for ext, mimetype:
if flowerew(ext): return mimetype
else:
if 'makefile' in filename.lower(): return 'text/x-makefile'
raise NoMimeError

using a lower case extension list including the dot. I think it would run faster
than a regex, and not scare anyone unnecessarily ;-)

The dict eliminates the loop, and is easy to understand, so IMO it's a better choice.

extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
for regexp, mimetype in extensionlist:
if regexp.match(filename):
return mimetype

if you were really concerned about efficiency, you could use something like:
class SimpleMatch:
def __init__(self, pattern): self.pattern = pattern
def match(self, subject): return subject[-len(self.pattern):] == self.pattern

I'm not clear on what you are doing here, but if you think you are going to compete
with the timbot's dict efficiency with a casual few lines, I suspect you are PUI ;-)
(Posting Under the Influence ;-)

Regards,
Bengt Richter

Stephen Thorne · Jan 13, 2005

extensiondict = dict(
php = 'application/x-php',
cpp = 'text/x-c-src',
# etcetera
xsl = 'test/xsl'
)

def detectMimeType(filename):
extension = os.path.splitext(filename)[1].replace('.', '')

Click to expand...

Click to expand...

extension = os.path.splitext(filename)[1].replace('.', '').lower() # better

try: return extensiondict[extension]
except KeyError:
basename = os.path.basename(filename)
if "Makefile" in basename: return 'text/x-makefile' # XXX case sensitivity?
raise NoMimeError

Click to expand...

Why not use a regexp based approach.

Click to expand...

ISTM the dict setup closely reflects the OP's if/elif tests and makes for an efficient substitute
for the functionality when later used for lookup. The regex list is O(n) and the regexes themselves
are at least that, so I don't see a benefit. If you are going to loop through extensionlist, you
might as well write (untested)

<code snipped>

*shrug*, O(n*m) actually, where n is the number of mime-types and m is
the length of the extension.

extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
for regexp, mimetype in extensionlist:
if regexp.match(filename):
return mimetype

if you were really concerned about efficiency, you could use something like:
class SimpleMatch:
def __init__(self, pattern): self.pattern = pattern
def match(self, subject): return subject[-len(self.pattern):] == self.pattern

Click to expand...

I'm not clear on what you are doing here, but if you think you are going to compete
with the timbot's dict efficiency with a casual few lines, I suspect you are PUI ;-)
(Posting Under the Influence ;-)

Sorry about that, what I was trying to say was something along the lines of:

extensionlist = [
(re.compile(r'.*\.php') , "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
can be made more efficient by doing something like this:
extensionlist = [
SimpleMatch(".php"), "application/x-crap-language"),
(re.compile(r'.*\.(cpp|c)') , 'text/x-c-src'),
(re.compile(r'[Mm]akefile') , 'text/x-makefile'),
]
Where SimpleMatch uses a slice and a comparison instead of a regular
expression engine. SimpleMatch and re.compile both return an object
that when you call .match(s) returns a value that can be interpreted
as a boolean.

As for the overall efficiency concerns, I feel that talking about any
of this is premature optimisation. The optimisation that is really
required in this situation is the same as with any
large-switch-statement idiom, be it C or Python. First one must do a
frequency analysis of the inputs to the switch statement in order to
discover the optimal order of tests!

Regards,
Stephen Thorne

Nick Craig-Wood · Jan 13, 2005

Stephen Thorne said:
Why not use a regexp based approach.

Good idea... You could also use sre.Scanner which is supposed to be
fast like this...

import re, sre

scanner = sre.Scanner([
(r"\.php$", "application/x-php"),
(r"\.(cc|cpp)$", "text/x-c++-src"),
(r"\.xsl$", "xsl"),
(r"Makefile", "text/x-makefile"),
(r".", None),
])

def detectMimeType( filename ):
t = scanner.scan(filename)[0]
if len(t) < 1:
return None
# raise NoMimeError
return t[0]

for f in ("index.php", "index.php3", "prog.cc", "prog.cpp", "flodge.xsl", "Makefile", "myMakefile", "potato.123"):
print f, detectMimeType(f)

....

prints

index.php application/x-php
index.php3 None
prog.cc text/x-c++-src
prog.cpp text/x-c++-src
flodge.xsl xsl
Makefile text/x-makefile
myMakefile text/x-makefile
potato.123 None

Paul McGuire · Jan 13, 2005

Despite the regexp alternatives, I refuse to post a pyparsing solution to
this!

-- Paul

Jeff Shannon · Jan 13, 2005

Stephen said:
As for the overall efficiency concerns, I feel that talking about any
of this is premature optimisation.

I disagree -- using REs is adding unnecessary complication and
dependency. Premature optimization is a matter of using a
conceptually more-complicated method when a simpler one would do; REs
are, in fairly simple cases such as this, clearly *not* simpler than
dict lookups.

The optimisation that is really
required in this situation is the same as with any
large-switch-statement idiom, be it C or Python. First one must do a
frequency analysis of the inputs to the switch statement in order to
discover the optimal order of tests!

But if you're using a dictionary lookup, then the frequency of inputs
is irrelevant. Regardless of the value of the input, you're doing a
single hash-compute and (barring hash collisions) a single
bucket-lookup. Since dicts are unordered, the ordering of the literal
(or of a set of statements adding to the dict) doesn't matter.

Jeff Shannon
Technician/Programmer
Credit International

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?= · Jan 13, 2005

# do other non-extension-related tests here

if basename.find( "Makefile" ) != -1:
return "text/x-makefile"

I believe this can be nicelier written as:

if "Makefile" in basename:

Steven Bethard · Jan 13, 2005

BJörn Lindqvist said:
I believe this can be nicelier written as:

if "Makefile" in basename:

+1 for "nicelier" as VOTW (Vocabulation of the week) =)

Steve

Peter Maas · Jan 13, 2005

Steven said:
BJörn Lindqvist wrote: [...]

I believe this can be nicelier written as:

if "Makefile" in basename:

Click to expand...

+1 for "nicelier" as VOTW (Vocabulation of the week) =)

Me too, because nicelier is nicer than more nicely.

Steve Holden · Jan 14, 2005

Peter said:
Steven said:

BJörn Lindqvist wrote:
[...]

I believe this can be nicelier written as:

if "Makefile" in basename:

Click to expand...

+1 for "nicelier" as VOTW (Vocabulation of the week) =)

Click to expand...

Me too, because nicelier is nicer than more nicely.

Is that really the niceliest way to express that thought?

regards-li-er y'rs - steve

my latest system for presentations	0	Jul 28, 2004
Modify Python Code - no idea at all	0	Nov 5, 2003
anybody help me	1	Feb 10, 2006
Python source colorizer	1	Aug 9, 2004
how to create a DocumentFragment in Xalan java extension function?	0	Mar 1, 2004
ANN: script to visualize python profiling data with kcachegrind	3	Sep 17, 2003
Getting Apache error when trying to load mod_ruby	0	Jan 16, 2008
python-dev summary for 2005-07-01 to 2005-07-15	1	Jul 31, 2005

Refactoring; arbitrary expression in lists

Frans Englich

Paul McGuire

wittempj

Frans Englich

Jeff Shannon

Bengt Richter

Stephen Thorne

Bengt Richter

Stephen Thorne

Nick Craig-Wood

Paul McGuire

Jeff Shannon

=?ISO-8859-1?Q?BJ=F6rn_Lindqvist?=

Steven Bethard

Peter Maas

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads