the same strings, different utf-8 repr values?

S

slowness.chen

I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('ÖÐÎÄ')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
ÖÐÎÄ
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?
 
J

John Machin

I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('中文')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
中文
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John
 
S

Slowness Chen

got it. thanks.
John Machin 写é“:
Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,241
Latest member
Lisa1997

Latest Threads

Top