string processing question

K

Kurt Mueller

Hi,


on a Linux system and python 2.5.1 I have the
following behaviour which I do not understand:



case 1
python -c 'a="ä"; print a ; print a.center(6,"-") ; b=unicode(a, "utf8"); print b.center(6,"-")' ä
--ä--
--ä---


case 2
----- an UnicodeEncodeError in this case:
python -c 'a="ä"; print a ; print a.center(20,"-") ; b=unicode(a, "utf8"); print b.center(20,"-")' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 9: ordinal not in range(128)
ä
--ä--

The behaviour changes if I pipe the output to another prog or to a file.
and
centering with the string a is not correct, but with string b.



Could somebody please explain this to me?




Thanks in advance
 
P

Paul McGuire

Hi,

on a Linux system and python 2.5.1 I have the
following behaviour which I do not understand:

case 1> python -c 'a="ä"; print a ; print a.center(6,"-") ; b=unicode(a, "utf8"); print b.center(6,"-")'

ä
--ä--
--ä---

Weird. What happens if you change the second print statement to:

print b.center(6,u"-")

-- Paul
 
K

Kurt Mueller

Same behavior.


I have an even more minimal example:


:> python -c 'print unicode("ä", "utf8")'
ä

:> python -c 'print unicode("ä", "utf8")' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)


Just the difference of having piped the output to another program or to
a file.
Maybe we leave the other issue with the different centering for the moment.

My goal is to have my python programs unicode enabled.





TIA
 
K

Kurt Mueller

Scott said:
To discover what is happening, try something like:
python -c 'for a in "ä", unicode("ä"): print len(a), a'

I suspect that in your encoding, "ä" is two bytes long, and in
unicode it is converted to to a single character.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>

Yes it is. That is one of the two problems I see.
The solution for this is to unicode(<string>, <coding>) each string.


I'd like to have my python programs unicode enabled.




:> python -c 'for a in "ä", unicode("ä"): print len(a), a'
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal not in range(128)

It seems that the default encoding is "ascii", so unicode() cannot cope
with "ä".
If I specify "utf8" for the encoding, unicode() works.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a'
2 ä
1 ä
:>


But the print statement yelds an UnicodeEncodeError
if I pipe the output to a program or a file.

:> python -c 'for a in "ä", unicode("ä", "utf8"): print len(a), a' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
2 ä
1 :>


So it seems to me, that piping the output changes the behavior of the
print statement:

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)'
ä 2 <type 'str'>
ä 1 <type 'unicode'>

:> python -c 'for a in "ä", unicode("ä", "utf8", "ignore"): print a,
len(a), type(a)' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in
position 0: ordinal not in range(128)
ä 2 <type 'str'>
:>




How can I achieve that my python programs are unicode enabled:
- Input strings can have different encodings (mostly ascii, latin_1 or utf8)
- My python programs should always output "utf8".

Is that a good idea??



TIA
 
S

Sion Arrowsmith

Kurt Mueller said:
:> python -c 'print unicode("ä", "utf8")'
ä

:> python -c 'print unicode("ä", "utf8")' | cat
Traceback (most recent call last):
File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)

$ python -c 'import sys; print sys.stdout.encoding'
UTF-8
$ python -c 'import sys; print sys.stdout.encoding' | cat
None

If print gets a Unicode string, it does an implicit
..encode(sys.stdout.encoding or sys.getdefaultencoding()) on it.
If you want your output to be guaranteed UTF-8, you'll need to
explicitly .encode("utf8") it yourself.

(I dare say this is slightly different in 3.x .)
 
K

Kurt Mueller

Sion said:
$ python -c 'import sys; print sys.stdout.encoding'
UTF-8
$ python -c 'import sys; print sys.stdout.encoding' | cat
None

If print gets a Unicode string, it does an implicit
.encode(sys.stdout.encoding or sys.getdefaultencoding()) on it.
If you want your output to be guaranteed UTF-8, you'll need to
explicitly .encode("utf8") it yourself.

This works now correct with and without piping:

python -c 'a=unicode("ä", "utf8") ; print (a.encode("utf8"))'



In my python source code I have these two lines first:
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :

So the source code itself and the strings in the source code
are interpreted as utf-8.

But from the command line python interprets the code
as 'latin_1' I presume. That is why I have to convert
the "ä" with unicode().
Am I right?

(I dare say this is slightly different in 3.x .)
I heard about it but I wait to go to 3.x until its time to...




Thanks
 
P

Piet van Oostrum

Kurt Mueller said:
KM> But from the command line python interprets the code
KM> as 'latin_1' I presume. That is why I have to convert
KM> the "ä" with unicode().
KM> Am I right?

There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
sequence of bytes and passes them to the shell. How the characters
are encodes depends on the encoding used in the terminal emulator. So
for example when the terminal is set to utf-8, your "ä" is converted
to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command.
3. The python interpreter must interpret these bytes with some decoding.
If you use them in a bytes string they are copied as such, so in the
example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
If your terminal encoding would have been iso-8859-1, the string
would have had a single byte '\xe4'. If you use it in a unicode
string the Python parser has to convert it to unicode. If there is an
encoding declaration in the source than that is used. Of course it
should be the same as the actual encoding used by the shell (or the
editor when you have a script saved in a file) otherwise you have a
problem. If there is no encoding declaration in the source Python has
to guess. It appears that in Python 2.x the default is iso-8859-1 but
in Python 3.x it will be utf-8. You should avoid making any
assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
a file, passed as file names or arguments to other processes etc.
have to be encoded again to a sequence of bytes. In this case Python
refuses to guess. Also you can't use the same encoding as in step 3,
because the program can run on a completely different system than
were it was compiled to byte code. So if the (unicode) string isn't
ASCII and no encoding is given you get an error. The encoding can be
given explicitely, or depending on the context, by sys.stdout.encoding,
sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).

Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.

Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.

If I do
python -c 'print u"ä"' in my terminal I therefore get two characters: ä
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.

However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.
 
N

norseman

Piet said:
There are a couple of stages:
1. Your terminal emulator interprets your keystrokes, encodes them in a
sequence of bytes and passes them to the shell. How the characters
are encodes depends on the encoding used in the terminal emulator. So
for example when the terminal is set to utf-8, your "ä" is converted
to two bytes: \xc3 and \xa4.
2. The shell passes these bytes to the python command.
3. The python interpreter must interpret these bytes with some decoding.
If you use them in a bytes string they are copied as such, so in the
example above the string "ä" will consist of the 2 bytes '\xc3\xa4'.
If your terminal encoding would have been iso-8859-1, the string
would have had a single byte '\xe4'. If you use it in a unicode
string the Python parser has to convert it to unicode. If there is an
encoding declaration in the source than that is used. Of course it
should be the same as the actual encoding used by the shell (or the
editor when you have a script saved in a file) otherwise you have a
problem. If there is no encoding declaration in the source Python has
to guess. It appears that in Python 2.x the default is iso-8859-1 but
in Python 3.x it will be utf-8. You should avoid making any
assumptions about this default.
4. During runtime unicode characters that have to be printed, written to
a file, passed as file names or arguments to other processes etc.
have to be encoded again to a sequence of bytes. In this case Python
refuses to guess. Also you can't use the same encoding as in step 3,
because the program can run on a completely different system than
were it was compiled to byte code. So if the (unicode) string isn't
ASCII and no encoding is given you get an error. The encoding can be
given explicitely, or depending on the context, by sys.stdout.encoding,
sys.getdefaultencoding or PYTHONIOENCODING (from 2.6 on).

Unfortunately there is no equivalent to PYTHONIOENCODING for the
interpretation of the source text, it only works on run-time.

Example:
python -c 'print len(u"ä")'
prints 2 on my system, because my terminal is utf-8 so the ä is passed
as 2 bytes (\xc3\xa4), but these are interpreted by Python 2.6.2 as two
iso-8859-1 bytes.

If I do
python -c 'print u"ä"' in my terminal I therefore get two characters: ä
but if I do this in Emacs I get:
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-1: ordinal not in range(128)
because my Emacs doesn't pass the encoding of its terminal emulation.

However:
python -c '# -*- coding:utf-8 -*-
print len(u"ä")'
will correctly print 1.
===============================

Thank you. I knew there had to be something simpler than brute force.

I have missed seeing the explanations for:
python -c '# -*- coding:utf-8 -*-
in the 2.5 docs. Where can I find these? (the python -c is for config,
I presume?)

By the way - the however: python...\nprint... snippet bombs in 2.5.2
1st bomb: looking for closing ' #so I add one and remove one below
2nd bomb: bad syntax # I play awhile and join EMACS
3rd bomb: Non-ASCII character '\xe4' in file....no encoding declared..

Python flatly states it's not ASCII and quits. Python print refuses to
handle high bit set bytes in 2.5.2....

The thank you is for pointing out how it works. I can use sed to fix for
file listing purposes. (Python won't like them, but a second pass thru
sed can give me something python can use and the two names can go on a
line on the cheat sheet.)

Barry, Kurt - do understand using sed to change the incoming names?
Put the python in a box and use the Linux mc, ls, sed and echo routines
to get the names into a form python can use while making the cheat sheet
at the same time. Substitutions like a for ä will generally be
acceptable. Yes or No? The cheat sheet can show the ä in the original
name because the OS functions allow it. I have no doubt there will be
some exceptions. :(
Once the names are "ASCII" you can get the python out & put it to work.

Just to head off the comments that it's not .... whatever

ls -1 | cheater.scr | python_program.py IS PURE UNIX

Unix is designed for this. Files from different parts of the world? If
you can see the name as something besides ????? make a cheeter for each
'Page'. mc /path/to/dir/of/choice
ls -1 >dummy
highlight dummy
F3
F4 and read the hex
takes me longer to type it in here than to do it. (leading spaces) :)

Today: 20090430


Steve

ps. Piet - thanks for including the version specifics. It makes a huge
difference in expectations and allowances.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,602
Members
45,184
Latest member
ZNOChrista

Latest Threads

Top