Character Encodings and display of strings

J

JKPeck

I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.

The hex display looks like this:
['id', '\x90\xab\x95\xca', '\x90\xb6\x94N\x8c\x8e\x93\xfa',
'\x8fA\x8aw\x94N\x90\x94', '\x90E\x8e\xed', '\x8b\x8b\x97^',
'\x8f\x89\x94C\x8b\x8b', '\x8d\xdd\x90\xd0\x8c\x8e\x90\x94',
'\x90E\x96\xb1\x8co\x97\xf0', '\x90l\x8e\xed']

and correctly shows the hex values of the Japanese characters.

Why are these different?

TIA,
Jon Peck
 
F

Fredrik Lundh

JKPeck said:
I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.

a list is not a string, so it's converted to one using the standard list representation
rules -- which is to do repr() on all the items, and add brackets and commas as
necessary.

for some more tips on printing, see:

http://effbot.org/zone/python-list.htm#printing

</F>
 
J

JKPeck

Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.
 
D

Diez B. Roggisch

JKPeck said:
Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.

The encoding in the source has nothing to do with that. How should an
encoding (and possibly a gazillion different ones in gazillion other
sourcefiles of yours) influence the list repr code?

The encoding in the source-file is solely used to correctly parse unicode
literals, as these need a specific encoding to be generated from the
byte-string they are in the sourcecode.

Diez
 
J

JKPeck

It seemed to me that this sentence

For many types, this function makes an attempt to return a string that
would yield an object with the same value when passed to eval().

might mean that the encoding setting of the source file might influence
how repr represented the contents of the string. Nothing to do with
Unicode. If a source file could have a declared encoding of, say,
cp932 via the # coding comment, I thought there was a chance that eval
would respond to that, too.
 
L

Leo Kislov

JKPeck said:
It seemed to me that this sentence

For many types, this function makes an attempt to return a string that
would yield an object with the same value when passed to eval().

might mean that the encoding setting of the source file might influence
how repr represented the contents of the string. Nothing to do with
Unicode. If a source file could have a declared encoding of, say,
cp932 via the # coding comment, I thought there was a chance that eval
would respond to that, too.

Not a chance :) Encoding is a property of an input/output object
(console, web page, plain text file, MS Word file, etc...). All
input/output object have specific rules determining their encoding,
there is absolutely no connection between encoding of the source file
and any other input/output object.

repr escapes bytes 128..255 because it doesn't know where you're going
to output its result so repr uses the safest encoding: ascii.

-- Leo
 
M

Martin Miller

It is possible derive your own string class from the built-in one and
override what 'repr' does (and make it do whatever you want). Here's an
example of what I mean:

##### Sample #####

# -*- coding: iso-8859-1 -*-

# Special string class to override the default
# representation method. Main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):

def __repr__(self):
asciispace = ord(' ')
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"

rep = [quotechar]
for ch in self:
if ord(ch) < asciispace:
rep += repr(str(ch)).strip("'")
elif ch == quotechar:
rep += "\\"
rep += ch
else:
rep += ch
rep += quotechar

return "".join(rep)

if __name__ == "__main__":
s = MsgStr("\tWürttemberg\"")
print s
print repr(s)
print str(s)
print repr(str(s))
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top