remove BOM from string read from utf-8 file

Achim Domma · Feb 27, 2004

Hi,

I read some text from a utf-8 encoded text file like this:

text = codecs.open('example.txt','r','utf8').read()

If I pass this text to a COM object, I can see that there is still the BOM
in the file, which marks the file as utf-8. Simply removing the first
character in the string is not ok, because the BOM is optional. So I tried
something like this:

if text.startswith(codecs.BOM_UTF8):
print "found BOM"

but then I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What's the right way to remove the BOM from the string?

regards,
Achim

Piet van Oostrum · Feb 27, 2004

AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:

AD> text = codecs.open('example.txt','r','utf8').read()

AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:

The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.

Check text[0] and len(text) to verify.

Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.

Achim Domma · Feb 27, 2004

Check text[0] and len(text) to verify.

That's what I did. The file contains 24 chinese characters and len(text) is
25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Achim

Matt Gerrans · Feb 27, 2004

I found myself often needing to read text files that might be utf-8, unicode
or ansi, without knowing beforehand which, so I wrote a single function to
do it. I don't know if this is the correct way to handle this situation,
but I couldn't find any function that would simply open a file with the
appropriate codec automatically, so I use this (it doesn't handle all cases,
but just the ones I've needed so far):

import os, codecs
#---------------------------------------------------------------------------
-
# OpenTextFile()
#
# Opens a file correctly whether it is unicode or ansi. If the file
# doesn't exist, then the default encoding is unicode (UTF-16).
#
# Python documentation of the codecs module is pretty weak; for instance
# there are all these:
# BOM
# BOM_BE
# BOM_LE
# BOM_UTF8
# BOM_UTF16
# BOM_UTF16_BE
# BOM_UTF16_LE
# BOM_UTF32
# BOM_UTF32_BE
# BOM_UTF32_LE
# but no explanation of how they map to the encodings like 'utf-16'. Some
# can be inferred, but some are not so clear.
#---------------------------------------------------------------------------
-
def OpenTextFile(filename,mode='r',encoding=None):
if os.path.isfile(filename):
f = file(filename,'rb')
header = f.read(4) # Read just the first four bytes.
f.close()
# Don't change this to a map, because it is ordered!!!
encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
( codecs.BOM_UTF16, 'utf-16' ),
( codecs.BOM_UTF8, 'utf-8' ) ]
for h,e in encodings:
if header.find(h) == 0:
encoding = e
break
return codecs.open(filename,mode,encoding)

Piet van Oostrum · Feb 27, 2004

Check text[0] and len(text) to verify.

Click to expand...

AD> That's what I did. The file contains 24 chinese characters and len(text) is
AD> 25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Sorry, I was wrong.
You have to check for text.startswith(u'\ufeff')

skfh82 · May 12, 2007

Note that the file() below should be changed to open(). Otherwise a new global file handle will be created and when you try to do file=open(...) somewhere else in the program, you will get strange errors such as AttributeError: StreamReaderWriter instance has no __call__ method.

Also updated to eat the Byte Order Mark on Windows. I heard this was fixed in Python 2.5 (I am on 2.4).

Matt Gerrans said:

Code:

import os, codecs
def OpenTextFile(filename, mode='r', encoding = 'utf-8'):
	hasBOM = False
	if os.path.isfile(filename):
		f = [b]open[/b](filename,'rb')
		header = f.read(4)
		f.close()
		
		# Don't change this to a map, because it is ordered
		encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
			( codecs.BOM_UTF16, 'utf-16' ),
			( codecs.BOM_UTF8, 'utf-8' ) ]
		
		for h, e in encodings:
			if header.startswith(h):
				encoding = e
				hasBOM = True
				break
		
	f = codecs.open(filename,mode,encoding)
	# Eat the byte order mark
	if hasBOM:
		f.read(1)
	return f

codec for UTF-8 with BOM	3	May 2, 2011
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
Read utf-8 file	1	Mar 18, 2013
2to3 ParseError with UTF-8 BOM	3	Nov 5, 2009
print UTF-8 file with BOM	5	Dec 22, 2005
error when printing a UTF-8 string (python 2.6.2)	9	Apr 21, 2010
usage of <string>.encode('utf-8','xmlcharrefreplace')?	7	Feb 19, 2008
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011

remove BOM from string read from utf-8 file

Achim Domma

Piet van Oostrum

Achim Domma

Matt Gerrans

Piet van Oostrum

skfh82

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads