Print formatted Strings with Umlauts

J

Joerg Lehmann

I am using Python 2.2.3 (Fedora Core 1). The problem is, that strings containing
umlauts do not work as I would expect. Here is my example:
äöü äöü
123 123

I would expect, that the displayed width of a or b is the same: 5 characters.
I also see, that len(a) is 6 (2 bytes per umlaut), whereas len(b) is 3:
6 3

I have tried to set the encoding in site.py to 'latin-1', but it did not change
my results. Is there no way to store umlauts in 1 byte??? What is the right way
to print strings containing umlauts in a tabular way (same field width)?

Thanks!
 
J

Jeff Epler

If you work with Unicode strings instead of byte strings in the utf-8
encoding, you'll get the desired results for characters in the german
character set:
äöü äöü
123 123

However, this isn't good enough in general. For instance, in the
presence of Unicode combining characters, you won't get what you want:äöü äöü
123 123


You'll also run into problems with characters that have "Wide" or
"Ambiguous" East Asian Width properties in Unicode. For example,uuu uuu
123 123

Jeff
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Joerg said:
I am using Python 2.2.3 (Fedora Core 1). ...
I have tried to set the encoding in site.py to 'latin-1', but it did not change
my results. Is there no way to store umlauts in 1 byte???

There is, but Fedora Core 1 does not use it. Instead, it uses an
encoding where an umlaut character needs two bytes (namely, UTF-8).
Changing site.py does not change the way your system represents
these characters.
What is the right way
to print strings containing umlauts in a tabular way (same field width)?

As Jeff explains: In the specific case, using Unicode strings would
help. He is also right that, in general, it is very difficult to find
out how many columns a single character uses, as some characters have
width 0, and other characters have width 2 (in a mono-spaced terminal;
for variable-spaced output, adding space characters to achieve
formatting will never work reliably).

Regards,
Martin
 
J

Joerg Lehmann

Martin v. Löwis said:
There is, but Fedora Core 1 does not use it. Instead, it uses an
encoding where an umlaut character needs two bytes (namely, UTF-8).
Changing site.py does not change the way your system represents
these characters.


As Jeff explains: In the specific case, using Unicode strings would
help. He is also right that, in general, it is very difficult to find
out how many columns a single character uses, as some characters have
width 0, and other characters have width 2 (in a mono-spaced terminal;
for variable-spaced output, adding space characters to achieve
formatting will never work reliably).

Regards,
Martin

I have found a fix myself, I'm not sure if this is "the right way",
but it solves my problem:

I changed the settings in /etc/sysconfig/i18ln from UTF-8 to
ISO-8859-1:

LANG="en_US.ISO-8859-1"
SUPPORTED="en_US.ISO-8859-1:en_US:en"
SYSFONT="latarcyrheb-sun16"

This fixed my problem, Umlauts are stored in one byte now.

Thanks for your inspirations.

PS: Installing Python 2.3 (rpm for Fedora from www.python.org) did not
help.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top