string processing question

Kurt Mueller · Apr 30, 2009

Hi,

on a Linux system and python 2.5.1 I have the
following behaviour which I do not understand:

case 1

python -c 'a="ä"; print a ; print a.center(6,"-") ; b=unicode(a, "utf8"); print b.center(6,"-")' ä
--ä--
--ä---

case 2
----- an UnicodeEncodeError in this case:

python -c 'a="ä"; print a ; print a.center(20,"-") ; b=unicode(a, "utf8"); print b.center(20,"-")' | cat

Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 9: ordinal not in range(128)
ä
--ä--

The behaviour changes if I pipe the output to another prog or to a file.
and
centering with the string a is not correct, but with string b.

Could somebody please explain this to me?

Thanks in advance

Paul McGuire · Apr 30, 2009

Hi,

on a Linux system and python 2.5.1 I have the
following behaviour which I do not understand:

case 1> python -c 'a="ä"; print a ; print a.center(6,"-") ; b=unicode(a, "utf8"); print b.center(6,"-")'

ä
--ä--
--ä---

Weird. What happens if you change the second print statement to:

print b.center(6,u"-")

-- Paul

Kurt Mueller · May 1, 2009

Same behavior.

I have an even more minimal example:

:> python -c 'print unicode("ä", "utf8")'
ä

:> python -c 'print unicode("ä", "utf8")' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)

Just the difference of having piped the output to another program or to
a file.
Maybe we leave the other issue with the different centering for the moment.

My goal is to have my python programs unicode enabled.

TIA

Kurt Mueller · May 1, 2009

Scott said:
To discover what is happening, try something like:
python -c 'for a in "ä", unicode("ä"): print len(a), a'

I suspect that in your encoding, "ä" is two bytes long, and in
unicode it is converted to to a single character.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>

Yes it is. That is one of the two problems I see.
The solution for this is to unicode(<string>, <coding>) each string.

I'd like to have my python programs unicode enabled.

:> python -c 'for a in "ä", unicode("ä"): print len(a), a'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

It seems that the default encoding is "ascii", so unicode() cannot cope
with "ä".
If I specify "utf8" for the encoding, unicode() works.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>

But the print statement yelds an UnicodeEncodeError
if I pipe the output to a program or a file.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
2 ä
1 :>

So it seems to me, that piping the output changes the behavior of the
print statement:

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'
ä 2 <type 'str'>
ä 1 <type 'unicode'>

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
ä 2 <type 'str'>
:>

How can I achieve that my python programs are unicode enabled:
- Input strings can have different encodings (mostly ascii, latin_1 or utf8)
- My python programs should always output "utf8".

Is that a good idea??

TIA

Sion Arrowsmith · May 1, 2009

Kurt Mueller said:
:> python -c 'print unicode("ä", "utf8")'
ä

:> python -c 'print unicode("ä", "utf8")' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)

$ python -c 'import sys; print sys.stdout.encoding'
UTF-8
$ python -c 'import sys; print sys.stdout.encoding' | cat
None

If print gets a Unicode string, it does an implicit
..encode(sys.stdout.encoding or sys.getdefaultencoding()) on it.
If you want your output to be guaranteed UTF-8, you'll need to
explicitly .encode("utf8") it yourself.

(I dare say this is slightly different in 3.x .)

Kurt Mueller · May 1, 2009

Sion said:
$ python -c 'import sys; print sys.stdout.encoding'
UTF-8
$ python -c 'import sys; print sys.stdout.encoding' | cat
None

If print gets a Unicode string, it does an implicit
.encode(sys.stdout.encoding or sys.getdefaultencoding()) on it.
If you want your output to be guaranteed UTF-8, you'll need to
explicitly .encode("utf8") it yourself.

This works now correct with and without piping:

python -c 'a=unicode("ä", "utf8") ; print (a.encode("utf8"))'

In my python source code I have these two lines first:
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :

So the source code itself and the strings in the source code
are interpreted as utf-8.

But from the command line python interprets the code
as 'latin_1' I presume. That is why I have to convert
the "ä" with unicode().
Am I right?

(I dare say this is slightly different in 3.x .)

I heard about it but I wait to go to 3.x until its time to...

Thanks

Piet van Oostrum · May 1, 2009

Kurt Mueller said:
KM> But from the command line python interprets the code
KM> as 'latin_1' I presume. That is why I have to convert
KM> the "ä" with unicode().
KM> Am I right?

There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
sequence of bytes and passes them to the shell. How the characters
are encodes depends on the encoding used in the terminal emulator. So
for example when the terminal is set to utf-8, your "ä" is converted
to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command.
3. The python interpreter must interpret these bytes with some decoding.
If you use them in a bytes string they are copied as such, so in the
example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
If your terminal encoding would have been iso-8859-1, the string
would have had a single byte '\xe4'. If you use it in a unicode
string the Python parser has to convert it to unicode. If there is an
encoding declaration in the source than that is used. Of course it
should be the same as the actual encoding used by the shell (or the
editor when you have a script saved in a file) otherwise you have a
problem. If there is no encoding declaration in the source Python has
to guess. It appears that in Python 2.x the default is iso-8859-1 but
in Python 3.x it will be utf-8. You should avoid making any
assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
a file, passed as file names or arguments to other processes etc.
have to be encoded again to a sequence of bytes. In this case Python
refuses to guess. Also you can't use the same encoding as in step 3,
because the program can run on a completely different system than
were it was compiled to byte code. So if the (unicode) string isn't
ASCII and no encoding is given you get an error. The encoding can be
given explicitely, or depending on the context, by sys.stdout.encoding,
sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).

Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.

Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.

If I do
python -c 'print u"ä"' in my terminal I therefore get two characters: Ã¤
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.

However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.

norseman · May 2, 2009

Piet said:
There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
sequence of bytes and passes them to the shell. How the characters
are encodes depends on the encoding used in the terminal emulator. So
for example when the terminal is set to utf-8, your "ä" is converted
to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command.
3. The python interpreter must interpret these bytes with some decoding.
If you use them in a bytes string they are copied as such, so in the
example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
If your terminal encoding would have been iso-8859-1, the string
would have had a single byte '\xe4'. If you use it in a unicode
string the Python parser has to convert it to unicode. If there is an
encoding declaration in the source than that is used. Of course it
should be the same as the actual encoding used by the shell (or the
editor when you have a script saved in a file) otherwise you have a
problem. If there is no encoding declaration in the source Python has
to guess. It appears that in Python 2.x the default is iso-8859-1 but
in Python 3.x it will be utf-8. You should avoid making any
assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
a file, passed as file names or arguments to other processes etc.
have to be encoded again to a sequence of bytes. In this case Python
refuses to guess. Also you can't use the same encoding as in step 3,
because the program can run on a completely different system than
were it was compiled to byte code. So if the (unicode) string isn't
ASCII and no encoding is given you get an error. The encoding can be
given explicitely, or depending on the context, by sys.stdout.encoding,
sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).

Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.

Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.

If I do
python -c 'print u"ä"' in my terminal I therefore get two characters: Ã¤
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.

However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.

===============================

Thank you. I knew there had to be something simpler than brute force.

I have missed seeing the explanations for:
python -c '# -*- coding:utf-8 -*-
in the 2.5 docs. Where can I find these? (the python -c is for config,
I presume?)

By the way - the however: python...\nprint... snippet bombs in 2.5.2
1st bomb: looking for closing ' #so I add one and remove one below
2nd bomb: bad syntax # I play awhile and join EMACS
3rd bomb: Non-ASCII character '\xe4' in file....no encoding declared..

Python flatly states it's not ASCII and quits. Python print refuses to
handle high bit set bytes in 2.5.2....

The thank you is for pointing out how it works. I can use sed to fix for
file listing purposes. (Python won't like them, but a second pass thru
sed can give me something python can use and the two names can go on a
line on the cheat sheet.)

Barry, Kurt - do understand using sed to change the incoming names?
Put the python in a box and use the Linux mc, ls, sed and echo routines
to get the names into a form python can use while making the cheat sheet
at the same time. Substitutions like a for ä will generally be
acceptable. Yes or No? The cheat sheet can show the ä in the original
name because the OS functions allow it. I have no doubt there will be
some exceptions.

Once the names are "ASCII" you can get the python out & put it to work.

Just to head off the comments that it's not .... whatever

ls -1 | cheater.scr | python_program.py IS PURE UNIX

Unix is designed for this. Files from different parts of the world? If
you can see the name as something besides ????? make a cheeter for each
'Page'. mc /path/to/dir/of/choice
ls -1 >dummy
highlight dummy
F3
F4 and read the hex
takes me longer to type it in here than to do it. (leading spaces)

Today: 20090430

Steve

ps. Piet - thanks for including the version specifics. It makes a huge
difference in expectations and allowances.

replace text in unicode string	2	May 14, 2005
Trouble with utf-8 values	0	Nov 5, 2013
codecs.register_error for "strict", unicode.encode() and str.decode()	0	Jul 27, 2012
Q: a simple(?) raw-utf-8 conversion to internal type unicode "\304\246\311\231\316\257\316\271\303\2	1	Jan 1, 2007
Convert Active Directory Object to string	4	Jan 17, 2006
UnicodeEncodeError when piping stdout, but not when printingdirectly to the console	4	Jan 4, 2012
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
the stupid encoding problem to stdout	16	Jun 9, 2011

string processing question

Kurt Mueller

Paul McGuire

Kurt Mueller

Kurt Mueller

Sion Arrowsmith

Kurt Mueller

Piet van Oostrum

norseman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads