Character Encodings and display of strings

JKPeck · Nov 13, 2006

I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.

The hex display looks like this:
['id', '\x90\xab\x95\xca', '\x90\xb6\x94N\x8c\x8e\x93\xfa',
'\x8fA\x8aw\x94N\x90\x94', '\x90E\x8e\xed', '\x8b\x8b\x97^',
'\x8f\x89\x94C\x8b\x8b', '\x8d\xdd\x90\xd0\x8c\x8e\x90\x94',
'\x90E\x96\xb1\x8co\x97\xf0', '\x90l\x8e\xed']

and correctly shows the hex values of the Japanese characters.

Why are these different?

TIA,
Jon Peck

Fredrik Lundh · Nov 13, 2006

JKPeck said:
I am trying to understand why, with nonwestern strings, I sometimes get
a hex display and sometimes get the string printed as characters.

With my Python locale set to Japanese and with or without a # coding of
cp932 (this is Windows) at the top of the file, I read a list of
Japanese strings into a list, say, catlis.

With this code
for item in catlis:
print item
print catlis
print " ".join(catlis)

the first print (print item) displays Japanese text as characters..
The second print (print catlis) displays a list with the double byte
characters in hex notation.
The third print (print " ".join(catlis)) prints a combined string of
Japanese characters properly.

According to the print documentation,
"If an object is not a string, it is first converted to a string using
the rules for string conversions"

but the result is different with a list of strings.

a list is not a string, so it's converted to one using the standard list representation
rules -- which is to do repr() on all the items, and add brackets and commas as
necessary.

for some more tips on printing, see:

http://effbot.org/zone/python-list.htm#printing

</F>

JKPeck · Nov 13, 2006

Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.

Diez B. Roggisch · Nov 13, 2006

JKPeck said:
Thanks for the quick answer. I thought repr was involved here, but
when I use repr explicitly I get a notation where the backslashes are
escaped. I also though that with the encoding explictily declared in
the source, that repr would take that into account and use the
character form, but obviously it doesn't.

The encoding in the source has nothing to do with that. How should an
encoding (and possibly a gazillion different ones in gazillion other
sourcefiles of yours) influence the list repr code?

The encoding in the source-file is solely used to correctly parse unicode
literals, as these need a specific encoding to be generated from the
byte-string they are in the sourcecode.

Diez

JKPeck · Nov 13, 2006

It seemed to me that this sentence

For many types, this function makes an attempt to return a string that
would yield an object with the same value when passed to eval().

might mean that the encoding setting of the source file might influence
how repr represented the contents of the string. Nothing to do with
Unicode. If a source file could have a declared encoding of, say,
cp932 via the # coding comment, I thought there was a chance that eval
would respond to that, too.

Leo Kislov · Nov 13, 2006

JKPeck said:
It seemed to me that this sentence

For many types, this function makes an attempt to return a string that
would yield an object with the same value when passed to eval().

might mean that the encoding setting of the source file might influence
how repr represented the contents of the string. Nothing to do with
Unicode. If a source file could have a declared encoding of, say,
cp932 via the # coding comment, I thought there was a chance that eval
would respond to that, too.

Not a chance

Encoding is a property of an input/output object
(console, web page, plain text file, MS Word file, etc...). All
input/output object have specific rules determining their encoding,
there is absolutely no connection between encoding of the source file
and any other input/output object.

repr escapes bytes 128..255 because it doesn't know where you're going
to output its result so repr uses the safest encoding: ascii.

-- Leo

Martin Miller · Nov 14, 2006

It is possible derive your own string class from the built-in one and
override what 'repr' does (and make it do whatever you want). Here's an
example of what I mean:

##### Sample #####

# -*- coding: iso-8859-1 -*-

# Special string class to override the default
# representation method. Main purpose is to
# prefer using double quotes and avoid hex
# representation on chars with an ord > 128
class MsgStr(str):

def __repr__(self):
asciispace = ord(' ')
if self.count("'") >= self.count('"'):
quotechar = '"'
else:
quotechar = "'"

rep = [quotechar]
for ch in self:
if ord(ch) < asciispace:
rep += repr(str(ch)).strip("'")
elif ch == quotechar:
rep += "\\"
rep += ch
else:
rep += ch
rep += quotechar

return "".join(rep)

if __name__ == "__main__":
s = MsgStr("\tWürttemberg\"")
print s
print repr(s)
print str(s)
print repr(str(s))

WSGI/wsgiref: modifying output on windows ?	2	Jun 3, 2007
Questions about working with character encodings	1	Dec 15, 2005
python3 raw strings and \u escapes	10	May 30, 2012
Anyone can give some instructions on the function of this asm?	7	Mar 2, 2006
Matching Strings	8	Feb 10, 2007
id functions of ints, floats and strings	6	Apr 3, 2008
Hex editor display - can this be more pythonic?	7	Jul 29, 2007
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012

Character Encodings and display of strings

JKPeck

Fredrik Lundh

JKPeck

Diez B. Roggisch

JKPeck

Leo Kislov

Martin Miller

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads