S
Steven D'Aprano
I have a function which guesses the likely encoding used by text files by
reading the BOM (byte order mark) at the beginning of the file. A
simplified version:
def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default
The idea is that you can call the function with a file name and a default
encoding to return if one can't be guessed. I want to provide a default
value for the default argument (a default default), but one which will
unconditionally fail if you blindly go ahead and use it.
E.g. I want to either provide a default:
enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)
or I want to write:
enc = guess_encoding_from_bom("filename")
if enc == something:
# Can't guess, fall back on an alternative strategy
...
else:
f = open("filename", encoding=enc)
If I forget to check the returned result, I should get an explicit
failure as soon as I try to use it, rather than silently returning the
wrong results.
What should I return as the default default? I have four possibilities:
(1) 'undefined', which is an standard encoding guaranteed to
raise an exception when used;
(2) 'unknown', which best describes the result, and currently
there is no encoding with that name;
(3) None, which is not the name of an encoding; or
(4) Don't return anything, but raise an exception. (But
which exception?)
Apart from option (4), here are the exceptions you get from blindly using
options (1) through (3):
py> 'abc'.encode('undefined')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in
encode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding
py> 'abc'.encode('unknown')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown
py> 'abc'.encode(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None
At the moment, I'm leaning towards option (1). Thoughts?
reading the BOM (byte order mark) at the beginning of the file. A
simplified version:
def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default
The idea is that you can call the function with a file name and a default
encoding to return if one can't be guessed. I want to provide a default
value for the default argument (a default default), but one which will
unconditionally fail if you blindly go ahead and use it.
E.g. I want to either provide a default:
enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)
or I want to write:
enc = guess_encoding_from_bom("filename")
if enc == something:
# Can't guess, fall back on an alternative strategy
...
else:
f = open("filename", encoding=enc)
If I forget to check the returned result, I should get an explicit
failure as soon as I try to use it, rather than silently returning the
wrong results.
What should I return as the default default? I have four possibilities:
(1) 'undefined', which is an standard encoding guaranteed to
raise an exception when used;
(2) 'unknown', which best describes the result, and currently
there is no encoding with that name;
(3) None, which is not the name of an encoding; or
(4) Don't return anything, but raise an exception. (But
which exception?)
Apart from option (4), here are the exceptions you get from blindly using
options (1) through (3):
py> 'abc'.encode('undefined')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in
encode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding
py> 'abc'.encode('unknown')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown
py> 'abc'.encode(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None
At the moment, I'm leaning towards option (1). Thoughts?