the same strings, different utf-8 repr values?

slowness.chen · Sep 7, 2006

I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('ÖÐÎÄ')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
ÖÐÎÄ
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?

John Machin · Sep 7, 2006

I have two files:

test.py:
--------------------------------------------------
# -*- encoding : utf8 -*-
print 'in this file', repr('ä¸æ–‡')

# tt.txt is saved as utf8 encoding
f = file('tt.txt')
line1 = f.readline().strip()
print 'another file', repr(line1)
-------------------------------------------------------

tt.txt:
----------------------------------------------------
ä¸æ–‡
test
-------------------------------------------------------
run test.py and I get the following output:
in this file '\xe4\xb8\xad\xe6\x96\x87'
another file '\xef\xbb\xbf\xe4\xb8\xad\xe6\x96\x87'

and I cann't encode line1 like:
line1.decode('utf8').encode('gbk')
get this error:
UnicodeEncodeError: 'gbk' codec can't encode character u'\ufeff' in
position 0:
illegal multibyte sequence

why did I get the different repr values?

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John

Slowness Chen · Sep 8, 2006

got it. thanks.
John Machin å†™é“ï¼š

Because whatever you used to "save as" that file has retained or
inserted a BOM (byte order mark, U+FEFF) at the start of the file
before encoding as UTF-8. It's the '\xef\xbb\xbf' at the start of the
file, and also the u'\ufeff' that is giving the gbk codec indigestion.
You can remove it in your script.

HTH
John

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
windows active directory ldap output encoding	2	Jul 8, 2008
WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007
print UTF-8 file with BOM	5	Dec 23, 2005

the same strings, different utf-8 repr values?

slowness.chen

John Machin

Slowness Chen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads