Puzzled by code pages

  • Thread starter Adam Tauno Williams
  • Start date
A

Adam Tauno Williams

I'm trying to process OpenStep plist files in Python. I have a parser
which works, but only for strict ASCII. However plist files may contain
accented characters - equivalent to ISO-8859-2 (I believe). For example
I read in the line:
' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
NSFileName;\n'

What is the correct way to re-encode this data into UTF-8 so I can use
unicode strings, and then write the output back to ISO8859-?

I can read the file using codecs as ISO8859-2, but it still doesn't seem
correct.
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'
 
L

Lie Ryan

I'm trying to process OpenStep plist files in Python. I have a parser
which works, but only for strict ASCII. However plist files may contain
accented characters - equivalent to ISO-8859-2 (I believe). For example
I read in the line:

' "skyp4_filelist_10201/localit\xc3\xa0 termali_sortfield" =
NSFileName;\n'

I presume you're using Python 2.x.
What is the correct way to re-encode this data into UTF-8 so I can use
unicode strings, and then write the output back to ISO8859-?

I can read the file using codecs as ISO8859-2, but it still doesn't seem
correct.

u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

When printing in the interactive interpreter, python uses __repr__
representation by default. If you want to use __str__ representation use
"print data" (note, your terminal must support printing unicode
characters); either way, even though the string looks like '\u0102' when
printed on the terminal, the binary pattern inside the memory should
correctly represents the accented character.

f = codecs.open("in.txt", 'rb', encoding="iso8859-2")
f2 = codecs.open("out.txt", 'wb', encoding="utf-8")
s = f.read()
f2.write(s)
f.close()
f2.close()
 
A

Adam Tauno Williams

I presume you're using Python 2.x.

Yes. But the days of all-unicode-strings will be wonderful when it
comes. :)
When printing in the interactive interpreter, python uses __repr__
representation by default. If you want to use __str__ representation use
"print data" (note, your terminal must support printing unicode
characters);

Using GNOME Terminal, so Unicode characters should display correctly.
And I do see the characters when I 'cat' the file.
either way, even though the string looks like '\u0102' when
printed on the terminal, the binary pattern inside the memory should
correctly represents the accented character.

Yep. But in the interpreter both unicode() and repr() produce the same
output. Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

'ascii' codec can't encode characters in position 33-34: ordinal not in
range(128)

unicode(data)
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

repr(data)
'u\' "skyp4_filelist_10201/localit\\u0102\\xa0 termali_sortfield" =
NSFileName;\\n\''

I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

What is the 'official' way to encode something from UTF-8 to another
code page. I *assumed* that if I wrote a unicode stream back through:

h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
data = writer.store(defaults)
h.write(data)
h.close()

that is would be re-encoded [word?]. But maybe not?
 
L

Lie Ryan

Yes. But the days of all-unicode-strings will be wonderful when it
comes. :)


Using GNOME Terminal, so Unicode characters should display correctly.
And I do see the characters when I 'cat' the file.

'cat' works because 'cat' works in bytes and doesn't try to interpret
the stream it is writing. You can tell python to output string instead
of unicode to get the same effect.
h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

'ascii' codec can't encode characters in position 33-34: ordinal not in
range(128)

this means either your terminal can't print unicode or python for some
reason thinks that the terminal is ascii terminal. You can encode the
string manually, e.g.:

print u'\u0102\xa0'.encode('utf-8')

or you should figure out a way to set your terminal properly so python
recognizes it as utf-8 terminal, see
http://drj11.wordpress.com/2007/05/14/python-how-is-sysstdoutencoding-chosen/

when python tries to print unicode object, python first needs to encode
that 'unicode' object into 'str'; by default python uses
sys.stdout.encoding to determine the encoding to use when printing
unicode object.
unicode(data)
u' "skyp4_filelist_10201/localit\u0102\xa0 termali_sortfield" =
NSFileName;\n'

If data is a 'unicode', this is not surprising, as 'unicode(data)'
simply returns 'data'.
I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

unicode string doesn't have encoding (well, python needs some encoding
to store the unicode data in RAM, but that's implementation detail).
unicode string is not a stream of bytes encoded in specific way, it's an
encoding-agnostic block of text.
What is the 'official' way to encode something from UTF-8 to another
code page. I *assumed* that if I wrote a unicode stream back through:

h = codecs.open(output_filename, 'wb', encoding='iso8859-2')
data = writer.store(defaults)
h.write(data)
h.close()

what's "writer.store(defaults)"? It should return a 'unicode' if you
want h.write() to work properly. Otherwise, if data is 'str', h.write
will try to decode the 'str' to 'unicode' using the default decoder
(usually ascii), then encode that 'unicode' to 'iso8859-2'.
 
M

Mark Tolonen

[snip]
Yep. But in the interpreter both unicode() and repr() produce the same
output. Nothing displays the accented character.

h = codecs.open('file.txt', 'rb', encoding='iso8859-2')
data = h.read()
h.close()
str(data)

Here you are correctly reading an iso8859-2-encoded file and converting it
to Unicode.

Try "print data". "str(data)" converts from Unicode strings to byte
strings, but only uses the default encoding, which is 'ascii'. print will
use the stdout encoding of your terminal, if known. Try these commands on
your system (mine is Windows XP):
'cp437'

You should only attempt to "print" Unicode strings or byte strings encoded
in the stdout encoding. Printing byte strings in any other encoding will
often print garbage.

[snip]
I think I'm getting close. Parsing the file seems to work, and while
writing it out does not error, rereading my own output fails. :(
Possibly I'm 'accidentally' writing the output as UTF-8 and not
ISO8859-2. I need the internal data to be UTF-8 but read as ISO8859-2
and rewritten back to ISO8859-2 [at least that is what I believe from
the OpenStep files I'm seeing].

"internal data" is Unicode, not UTF-8. Unicode is the absence of an
encoding (Python uses UTF-16 or UTF-32 internally, but that is an
implementation detail). UTF-8 is a byte-encoding.

If you actually need the internal data as UTF-8 (maybe you are working with
a library that works with UTF-8 strings, then:

(process data as UTF-8 here).

Note you *decode* byte strings to Unicode and *encode* Unicode into byte
strings.

-Mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top