remove BOM from string read from utf-8 file

A

Achim Domma

Hi,

I read some text from a utf-8 encoded text file like this:

text = codecs.open('example.txt','r','utf8').read()

If I pass this text to a COM object, I can see that there is still the BOM
in the file, which marks the file as utf-8. Simply removing the first
character in the string is not ok, because the BOM is optional. So I tried
something like this:

if text.startswith(codecs.BOM_UTF8):
print "found BOM"

but then I get the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 0:
ordinal not in range(128)

What's the right way to remove the BOM from the string?

regards,
Achim
 
P

Piet van Oostrum

AD> Hi,
AD> I read some text from a utf-8 encoded text file like this:

AD> text = codecs.open('example.txt','r','utf8').read()

AD> If I pass this text to a COM object, I can see that there is still the BOM
AD> in the file, which marks the file as utf-8. Simply removing the first
AD> character in the string is not ok, because the BOM is optional. So I tried
AD> something like this:

The BOM is in the file, but not in the string 'text'
text is a unicode string which consists of Unicode characters and the BOM
is not a Unicode character.

Check text[0] and len(text) to verify.

Moreover BOM_UTF8 is a (non-ASCII) byte string, not a Unicode string, that
is the reason for the complaint.
 
A

Achim Domma

Check text[0] and len(text) to verify.

That's what I did. The file contains 24 chinese characters and len(text) is
25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Achim
 
M

Matt Gerrans

I found myself often needing to read text files that might be utf-8, unicode
or ansi, without knowing beforehand which, so I wrote a single function to
do it. I don't know if this is the correct way to handle this situation,
but I couldn't find any function that would simply open a file with the
appropriate codec automatically, so I use this (it doesn't handle all cases,
but just the ones I've needed so far):

import os, codecs
#---------------------------------------------------------------------------
-
# OpenTextFile()
#
# Opens a file correctly whether it is unicode or ansi. If the file
# doesn't exist, then the default encoding is unicode (UTF-16).
#
# Python documentation of the codecs module is pretty weak; for instance
# there are all these:
# BOM
# BOM_BE
# BOM_LE
# BOM_UTF8
# BOM_UTF16
# BOM_UTF16_BE
# BOM_UTF16_LE
# BOM_UTF32
# BOM_UTF32_BE
# BOM_UTF32_LE
# but no explanation of how they map to the encodings like 'utf-16'. Some
# can be inferred, but some are not so clear.
#---------------------------------------------------------------------------
-
def OpenTextFile(filename,mode='r',encoding=None):
if os.path.isfile(filename):
f = file(filename,'rb')
header = f.read(4) # Read just the first four bytes.
f.close()
# Don't change this to a map, because it is ordered!!!
encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
( codecs.BOM_UTF16, 'utf-16' ),
( codecs.BOM_UTF8, 'utf-8' ) ]
for h,e in encodings:
if header.find(h) == 0:
encoding = e
break
return codecs.open(filename,mode,encoding)
 
P

Piet van Oostrum

Check text[0] and len(text) to verify.

AD> That's what I did. The file contains 24 chinese characters and len(text) is
AD> 25. And 0xef is the hex code for the BOM if I'm not completely wrong.

Sorry, I was wrong.
You have to check for text.startswith(u'\ufeff')
 
Joined
May 12, 2007
Messages
1
Reaction score
0
Note that the file() below should be changed to open(). Otherwise a new global file handle will be created and when you try to do file=open(...) somewhere else in the program, you will get strange errors such as AttributeError: StreamReaderWriter instance has no __call__ method.

Also updated to eat the Byte Order Mark on Windows. I heard this was fixed in Python 2.5 (I am on 2.4).

Matt Gerrans said:
Code:
import os, codecs
def OpenTextFile(filename, mode='r', encoding = 'utf-8'):
	hasBOM = False
	if os.path.isfile(filename):
		f = [b]open[/b](filename,'rb')
		header = f.read(4)
		f.close()
		
		# Don't change this to a map, because it is ordered
		encodings = [ ( codecs.BOM_UTF32, 'utf-32' ),
			( codecs.BOM_UTF16, 'utf-16' ),
			( codecs.BOM_UTF8, 'utf-8' ) ]
		
		for h, e in encodings:
			if header.startswith(h):
				encoding = e
				hasBOM = True
				break
		
	f = codecs.open(filename,mode,encoding)
	# Eat the byte order mark
	if hasBOM:
		f.read(1)
	return f
 
Last edited:

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top