print UTF-8 file with BOM


D

davihigh

Hi Friends:

fileObj = codecs.open( filename, "r", "utf-8" )
u = fileObj.read() # Returns a Unicode string from the UTF-8 bytes in
the file
print u

It says error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

I want to know how read from UTF-8 file, and convert to specified
locale (default is current system locale) and print out string. I hope
put away BOM header automatically.

Rgds, David
 
Ad

Advertisements

D

davihigh

FYI. I had just receive something from a friend, he give me following
nice example!

I have one more question on this: How to write if I want to specify
locale other than current locale? For example, program runn on Korea
locale system, and try reading a UTF-8 file that save chinese
characters.

-------------- The code is here --------------------
import codecs
def read_utf8_txt_file (filename):
fileObj = codecs.open( filename, "r", "utf-8" )
content = fileObj.read()
content = content[1:] #exclude BOM
print content
fileObj.close()
 
C

Carsten Haese

2005/12/23 said:
Hi Kuan:

Thanks a lot! One more question here: How to write if I want
to
specify locale other than current locale?

For example, running on Korea locale system, and try read a
UTF-8 file
that save chinese.

Use the encode method to translate the unicode object into whatever
encoding you want.

unicodeStr = ...
print unicodeStr.encode('big5')

Hope this helps,

Carsten.
 
J

John Bauman

UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.
 
?

=?ISO-8859-1?Q?Walter_D=F6rwald?=

John said:
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.

However there's a pending patch (http://bugs.python.org/1177307) for a
new encoding named utf-8-sig, that would output a leading BOM on writing
and skip it on reading.

Bye,
Walter Dörwald
 
Ad

Advertisements

?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

John said:
UTF-8 shouldn't need a BOM, as it is designed for character streams, and
there is only one logical ordering of the bytes. Only UTF-16 and greater
should output a BOM, AFAIK.

Yes and no. Yes, UTF-8 does not need a BOM to identify endianness. No,
usage of the BOM with UTF-8 is explicitly allowed in the Unicode specs
(so output of the BOM doesn't *have* to be restricted to UTF-16 and
greater), and the BOM has a well-defined meaning for UTF-8 (namely,
as the UTF-8 signature).

Regards,
Martin
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top