Guessing the encoding from a BOM

Steven D'Aprano · Jan 15, 2014

I have a function which guesses the likely encoding used by text files by
reading the BOM (byte order mark) at the beginning of the file. A
simplified version:

def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default

The idea is that you can call the function with a file name and a default
encoding to return if one can't be guessed. I want to provide a default
value for the default argument (a default default), but one which will
unconditionally fail if you blindly go ahead and use it.

E.g. I want to either provide a default:

enc = guess_encoding_from_bom("filename", 'latin1')
f = open("filename", encoding=enc)

or I want to write:

enc = guess_encoding_from_bom("filename")
if enc == something:
# Can't guess, fall back on an alternative strategy
...
else:
f = open("filename", encoding=enc)

If I forget to check the returned result, I should get an explicit
failure as soon as I try to use it, rather than silently returning the
wrong results.

What should I return as the default default? I have four possibilities:

(1) 'undefined', which is an standard encoding guaranteed to
raise an exception when used;

(2) 'unknown', which best describes the result, and currently
there is no encoding with that name;

(3) None, which is not the name of an encoding; or

(4) Don't return anything, but raise an exception. (But
which exception?)

Apart from option (4), here are the exceptions you get from blindly using
options (1) through (3):

py> 'abc'.encode('undefined')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.3/encodings/undefined.py", line 19, in
encode
raise UnicodeError("undefined encoding")
UnicodeError: undefined encoding

py> 'abc'.encode('unknown')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
LookupError: unknown encoding: unknown

py> 'abc'.encode(None)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: encode() argument 1 must be str, not None

At the moment, I'm leaning towards option (1). Thoughts?

Chris Angelico · Jan 16, 2014

if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'

I'd swap the order of these two checks. If the file starts FF FE 00
00, your code will guess that it's UTF-16 and begins with a U+0000.

ChrisA

Ethan Furman · Jan 16, 2014

+1. I'd like a custom exception class, sub-classed from ValueError.

+1

Steven D'Aprano · Jan 16, 2014

I'd swap the order of these two checks. If the file starts FF FE 00 00,
your code will guess that it's UTF-16 and begins with a U+0000.

Good catch, thank you.

Steven D'Aprano · Jan 16, 2014

Yes, agreed.

+0.5. This describes the outcome of the guess.

+0. This *better* describes the outcome, but I don't think adding a new
name is needed nor very helpful.

And there is a chance -- albeit a small chance -- that someday the std
lib will gain an encoding called "unknown".

+1. I'd like a custom exception class, sub-classed from ValueError.

Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.

Thanks to everyone for the feedback.

Ethan Furman · Jan 16, 2014

Why ValueError? It's not really a "invalid value" error, it's more "my
heuristic isn't good enough" failure. (Maybe the file starts with another
sort of BOM which I don't know about.)

If I go with an exception, I'd choose RuntimeError, or a custom error
that inherits directly from Exception.

From the docs [1]:
============================

exception RuntimeError

Raised when an error is detected that doesnâ€™t fall in any
of the other categories. The associated value is a string
indicating what precisely went wrong.

It doesn't sound like RuntimeError is any more informative than Exception or AssertionError, and to my mind at least is
usually close to catastrophic in nature [2].

I'd say a ValueError subclass because, while not an strictly an error, it is values you don't know how to deal with.
But either that or plain Exception, just not RuntimeError.

--
~Ethan~

[1] http://docs.python.org/3/library/exceptions.html#RuntimeError
[2] verified by a (very) brief grep of the sources

Björn Lindqvist · Jan 16, 2014

2014/1/16 Steven D'Aprano said:
def guess_encoding_from_bom(filename, default):
with open(filename, 'rb') as f:
sig = f.read(4)
if sig.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif sig.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
else:
return default

You might want to add the utf8 bom too: '\xEF\xBB\xBF'.

(4) Don't return anything, but raise an exception. (But
which exception?)

I like this option the most because it is the most "fail fast". If you
return 'undefined' the error might happen hours later or not at all in
some cases.

Tim Chase · Jan 16, 2014

I'd actually rather not. It would tempt people to pollute UTF-8
files with a BOM, which is not necessary unless you are MS Notepad.

If the intent is to just sniff and parse the file accordingly, I get
enough of these junk UTF-8 BOMs at $DAY_JOB that I've had to create
utility-openers much like Steven is doing here. It's particularly
problematic for me in combination with csv.DictReader, where I go
looking for $COLUMN_NAME and get KeyError exceptions because it wants
me to ask for $UTF_BOM+$COLUMN_NAME for the first column.

-tkc

Sniffing encoding type by looking at file BOM header	2	Mar 24, 2010
files.py (encoding error)	0	Jun 10, 2013
files.py (weird encoding error)	0	Jun 10, 2013
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
encoding error	1	Feb 19, 2013
Universal BMP Steganography Tool (AES-128-CTR + SP800-90A CSPRNG) Full Encoder/Decoder with 3LSB Payload, PasswordDerived Key & External Key File	4	Mar 26, 2026
Object serialization: transfer from a to b (non-implemented code on b)	8	Apr 14, 2010
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009

Guessing the encoding from a BOM

Steven D'Aprano

Chris Angelico

Ethan Furman

Steven D'Aprano

Steven D'Aprano

Ethan Furman

Björn Lindqvist

Tim Chase

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads