Guessing the encoding from a BOM

S

Steven D'Aprano

I have a function which guesses the likely encoding used by text files by
reading the BOM (byte order mark) at the beginning of the file. A
simplified version:


def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default


The idea is that you can call the function with a file name and a default
encoding to return if one can't be guessed. I want to provide a default
value for the default argument (a default default), but one which will
unconditionally fail if you blindly go ahead and use it.

E.g. I want to either provide a default:

enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)


or I want to write:

enc = guess_encoding_from_bom("filename")
if enc == something:
# Can't guess, fall back on an alternative strategy
...
else:
f = open("filename", encoding=enc)


If I forget to check the returned result, I should get an explicit
failure as soon as I try to use it, rather than silently returning the
wrong results.

What should I return as the default default? I have four possibilities:

(1) 'undefined', which is an standard encoding guaranteed to
raise an exception when used;

(2) 'unknown', which best describes the result, and currently
there is no encoding with that name;

(3) None, which is not the name of an encoding; or

(4) Don't return anything, but raise an exception. (But
which exception?)


Apart from option (4), here are the exceptions you get from blindly using
options (1) through (3):

py> 'abc'.encode('undefined')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in
encode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

py> 'abc'.encode('unknown')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown

py> 'abc'.encode(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None


At the moment, I'm leaning towards option (1). Thoughts?
 
C

Chris Angelico

if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'

I'd swap the order of these two checks. If the file starts FF FE 00
00, your code will guess that it's UTF-16 and begins with a U+0000.

ChrisA
 
S

Steven D'Aprano

I'd swap the order of these two checks. If the file starts FF FE 00 00,
your code will guess that it's UTF-16 and begins with a U+0000.

Good catch, thank you.
 
S

Steven D'Aprano

Yes, agreed.


+0.5. This describes the outcome of the guess.


+0. This *better* describes the outcome, but I don't think adding a new
name is needed nor very helpful.

And there is a chance -- albeit a small chance -- that someday the std
lib will gain an encoding called "unknown".

+1. I'd like a custom exception class, sub-classed from ValueError.

Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.



Thanks to everyone for the feedback.
 
E

Ethan Furman

Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.

From the docs [1]:
============================

exception RuntimeError

Raised when an error is detected that doesn’t fall in any
of the other categories. The associated value is a string
indicating what precisely went wrong.

It doesn't sound like RuntimeError is any more informative than Exception or AssertionError, and to my mind at least is
usually close to catastrophic in nature [2].

I'd say a ValueError subclass because, while not an strictly an error, it is values you don't know how to deal with.
But either that or plain Exception, just not RuntimeError.

--
~Ethan~


[1] http://docs.python.org/3/library/exceptions.html#RuntimeError
[2] verified by a (very) brief grep of the sources
 
B

Björn Lindqvist

2014/1/16 Steven D'Aprano said:
def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default

You might want to add the utf8 bom too: '\xEF\xBB\xBF'.
(4) Don't return anything, but raise an exception. (But
which exception?)

I like this option the most because it is the most "fail fast". If you
return 'undefined' the error might happen hours later or not at all in
some cases.
 
T

Tim Chase

I'd actually rather not. It would tempt people to pollute UTF-8
files with a BOM, which is not necessary unless you are MS Notepad.

If the intent is to just sniff and parse the file accordingly, I get
enough of these junk UTF-8 BOMs at $DAY_JOB that I've had to create
utility-openers much like Steven is doing here. It's particularly
problematic for me in combination with csv.DictReader, where I go
looking for $COLUMN_NAME and get KeyError exceptions because it wants
me to ask for $UTF_BOM+$COLUMN_NAME for the first column.

-tkc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top