Puzzled by code pages

Adam Tauno Williams · May 15, 2010

I'm trying to process OpenStep plist files in Python. I have a parser
which works, but only for strict ASCII. However plist files may contain
accented characters - equivalent to ISO-8859-2 (I believe). For example
I read in the line:
' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
NSFileName;\n'

What is the correct way to re-encode this data into UTF-8 so I can use
unicode strings, and then write the output back to ISO8859-?

I can read the file using codecs as ISO8859-2, but it still doesn't seem
correct.
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

Lie Ryan · May 15, 2010

I'm trying to process OpenStep plist files in Python. I have a parser
which works, but only for strict ASCII. However plist files may contain
accented characters - equivalent to ISO-8859-2 (I believe). For example
I read in the line:

' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
NSFileName;\n'

I presume you're using Python 2.x.

What is the correct way to re-encode this data into UTF-8 so I can use
unicode strings, and then write the output back to ISO8859-?

I can read the file using codecs as ISO8859-2, but it still doesn't seem
correct.

u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

When printing in the interactive interpreter, python uses __repr__
representation by default. If you want to use __str__ representation use
"print data" (note, your terminal must support printing unicode
characters); either way, even though the string looks like '\u0102' when
printed on the terminal, the binary pattern inside the memory should
correctly represents the accented character.

f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
f2 = codecs.open("out.txt", 'wb', encoding="utf-8")
s = f.read()
f2.write(s)
f.close()
f2.close()

Adam Tauno Williams · May 15, 2010

I presume you're using Python 2.x.

Yes. But the days of all-unicode-strings will be wonderful when it
comes.

When printing in the interactive interpreter, python uses __repr__
representation by default. If you want to use __str__ representation use
"print data" (note, your terminal must support printing unicode
characters);

Using GNOME Terminal, so Unicode characters should display correctly.
And I do see the characters when I 'cat' the file.

either way, even though the string looks like '\u0102' when
printed on the terminal, the binary pattern inside the memory should
correctly represents the accented character.

Yep. But in the interpreter both unicode() and repr() produce the same
output. Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

'ascii' codec can't encode characters in position 33-34: ordinal not in
range(128)

unicode(data)
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

repr(data)
'u\' "skyp4_filelist_10201/localit\\u0102\\xa0 termali_sortfield" =
NSFileName;\\n\''

I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails.

Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

What is the 'official' way to encode something from UTF-8 to another
code page. I *assumed* that if I wrote a unicode stream back through:

h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
data = writer.store(defaults)
h.write(data)
h.close()

that is would be re-encoded [word?]. But maybe not?

Lie Ryan · May 15, 2010

Yes. But the days of all-unicode-strings will be wonderful when it
comes.

Using GNOME Terminal, so Unicode characters should display correctly.
And I do see the characters when I 'cat' the file.

'cat' works because 'cat' works in bytes and doesn't try to interpret
the stream it is writing. You can tell python to output string instead
of unicode to get the same effect.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

'ascii' codec can't encode characters in position 33-34: ordinal not in
range(128)

this means either your terminal can't print unicode or python for some
reason thinks that the terminal is ascii terminal. You can encode the
string manually, e.g.:

print u'\u0102\xa0'.encode('utf-8')

or you should figure out a way to set your terminal properly so python
recognizes it as utf-8 terminal, see
http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/

when python tries to print unicode object, python first needs to encode
that 'unicode' object into 'str'; by default python uses
sys.stdout.encoding to determine the encoding to use when printing
unicode object.

unicode(data)
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

If data is a 'unicode', this is not surprising, as 'unicode(data)'
simply returns 'data'.

I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails.
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

unicode string doesn't have encoding (well, python needs some encoding
to store the unicode data in RAM, but that's implementation detail).
unicode string is not a stream of bytes encoded in specific way, it's an
encoding-agnostic block of text.

What is the 'official' way to encode something from UTF-8 to another
code page. I *assumed* that if I wrote a unicode stream back through:

h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
data = writer.store(defaults)
h.write(data)
h.close()

what's "writer.store(defaults)"? It should return a 'unicode' if you
want h.write() to work properly. Otherwise, if data is 'str', h.write
will try to decode the 'str' to 'unicode' using the default decoder
(usually ascii), then encode that 'unicode' to 'iso8859-2'.

Mark Tolonen · May 15, 2010

[snip]

Yep. But in the interpreter both unicode() and repr() produce the same
output. Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

Here you are correctly reading an iso8859-2-encoded file and converting it
to Unicode.

Try "print data". "str(data)" converts from Unicode strings to byte
strings, but only uses the default encoding, which is 'ascii'. print will
use the stdout encoding of your terminal, if known. Try these commands on
your system (mine is Windows XP):
'cp437'

You should only attempt to "print" Unicode strings or byte strings encoded
in the stdout encoding. Printing byte strings in any other encoding will
often print garbage.

[snip]

I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails.
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

"internal data" is Unicode, not UTF-8. Unicode is the absence of an
encoding (Python uses UTF-16 or UTF-32 internally, but that is an
implementation detail). UTF-8 is a byte-encoding.

If you actually need the internal data as UTF-8 (maybe you are working with
a library that works with UTF-8 strings, then:

(process data as UTF-8 here).

Note you *decode* byte strings to Unicode and *encode* Unicode into byte
strings.

-Mark

Help with code	0	Jun 12, 2022
WinXP, Python3.1.2,dir-listing to XML - problem with unicode file names	0	Apr 3, 2010
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Unicode characters in btye-strings	5	Mar 12, 2010
generate and send mail with python: tutorial	8	Aug 11, 2011
codecs.register_error for "strict", unicode.encode() and str.decode()	0	Jul 27, 2012
encoding error in python 27	4	Feb 22, 2013
HOWTO: Parsing email using Python part2	1	Jul 15, 2011

Puzzled by code pages

Adam Tauno Williams

Lie Ryan

Adam Tauno Williams

Lie Ryan

Mark Tolonen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads