raw_input() and utf-8 formatted chars

7stud · Oct 12, 2007

s = 'A\xcc\x88' #capital A with umlaut
print s #displays capital A with umlaut

s = raw_input('Enter: ') #A\xcc\x88
print s #displays A\xcc\x88

print len(input) #9

It looks like every character of the string I enter in utf-8 is being
interpreted literally as 9 separate characters rather than one
character. How do I enter a capital A with an umlaut so that python
treats it as one character?

kyosohma · Oct 12, 2007

s = 'A\xcc\x88' #capital A with umlaut
print s #displays capital A with umlaut

s = raw_input('Enter: ') #A\xcc\x88
print s #displays A\xcc\x88

print len(input) #9

It looks like every character of the string I enter in utf-8 is being
interpreted literally as 9 separate characters rather than one
character. How do I enter a capital A with an umlaut so that python
treats it as one character?

I don't know. This works for me:

I'm using Python 2.4 with Default Source Encoding set to None on
Windows XP SP2.

Mike

7stud · Oct 12, 2007

I don't know. This works for me:

1

I'm using Python 2.4 with Default Source Encoding set to None on
Windows XP SP2.

Mike

Yeah, but what happens when you enter A\xcc\x88? And what is it that
your keyboard enters to produce an 'a' with an umlaut?

Marc 'BlackJack' Rintsch · Oct 12, 2007

Yeah, but what happens when you enter A\xcc\x88?

You mean literally!? Then of course I get A\xcc\x88 because that's what I
entered. In string literals in source code the backslash has a special
meaning but `raw_input()` does not "interpret" the input in any way.

And what is it that your keyboard enters to produce an 'a' with an umlaut?

*I* just hit the Ã¤ key. The one right next to the Ã¶ key. ;-)

Ciao,
Marc 'BlackJack' Rintsch

7stud · Oct 13, 2007

You mean literally!? Then of course I get A\xcc\x88 because that's what I
entered. In string literals in source code the backslash has a special
meaning but `raw_input()` does not "interpret" the input in any way.

Then why don't I end up with the same situation as this:

*I* just hit the ä key. The one right next to the ö key. ;-)

....and what if you don't have an a-with-umlaut key?

Dennis Lee Bieber · Oct 13, 2007

...and what if you don't have an a-with-umlaut key?

You find out how your OS keyboard driver handles the entry of
extended characters, and follow that procedure.

The ancient Amiga made them fairly easy -- <alt-f/g/h/j/k> were
treated as dead-keys which would result in the marker (acute accent,
grave accent, circumflex, diaeresis, tilde) being applied to the next
character hit (presuming it is a valid combination).

In Windows, using "Windows: Western" character set, <alt>0196 (numeric
pad) gives: Ä and <alt>0228 gives: ä

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/

Marc 'BlackJack' Rintsch · Oct 13, 2007

Then why don't I end up with the same situation as this:

I don't get the question!? In string literals in source code the
backslash has a special meaning, like I wrote above. When Python compiles
that above snippet you end up with a string of three bytes, one with the
ASCII value of an 'A' and two bytes where you typed in the byte value in
hexadecimal:

In [191]: s = 'A\xcc\x88'

In [192]: len(s)
Out[192]: 3

In [193]: map(ord, s)
Out[193]: [65, 204, 136]

In [194]: print s
Ã„

The last works this way only if the receiving/displaying program expected
UTF-8 as encoding. Otherwise something other than an Ã„ would have been
shown.

If you type in that text when asked by `raw_input()` then you get exactly
what you typed because there is no Python source code compiled:

In [195]: s = raw_input()
A\xcc\x88

In [196]: len(s)
Out[196]: 9

In [197]: map(ord, s)
Out[197]: [65, 92, 120, 99, 99, 92, 120, 56, 56]

In [198]: print s
A\xcc\x88

...and what if you don't have an a-with-umlaut key?

I find other means to enter it. <Alt> + some magic number on the numeric
keypad in windows, or <Compose>, <a>, <"> on Unix/Linux. Some text editors
offer special sequences too. If all fails there are character map
programs that show all unicode characters to choose from and copy'n'paste
them.

Ciao,
Marc 'BlackJack' Rintsch

MRAB · Oct 13, 2007

Then why don't I end up with the same situation as this:

...and what if you don't have an a-with-umlaut key?

raw_input() returns the string exactly as you entered it. You can
decode that into the actual UTF-8 string with decode("string_escape"):

s = raw_input('Enter: ') #A\xcc\x88
s = s.decode("string_escape")

It looks like your system already understands UTF-8 and will decode
the UTF-8 string you print to the Unicode character.

7stud · Nov 2, 2007

You can
decode that into the actual UTF-8 string with decode("string_escape"):

s = raw_input('Enter: ') #A\xcc\x88
s = s.decode("string_escape")

Ahh. Thanks for that.

*I* just hit the ä key. The one right next to the ö key. ;-)

BeautifulSoup can convert an html entity representing an 'A' with
umlaut, e.g.:

Ä

into an without every touching my keyboard. How does BeautifulSoup
do it?

from BeautifulSoup import BeautifulStoneSoup as bss

s1 = "<h1>Ä</h1>" #&_Auml;_
#I added the comment after the line to show the
#format of the html entity. In case a browser
#might render the comment into the actual character,
#I added underscores to the html entity:

soup = bss(s1)
text = soup.contents[0].string #gets the 'A' with umlaut out of the
html

new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
print repr(new_s)
print new_s

I see the same output for both print statements, and what I see is an
'A' with umlaut. I expected that the first print statement would show
the utf-8 encoding for the character.

Marc 'BlackJack' Rintsch · Nov 2, 2007

BeautifulSoup can convert an html entity representing an 'A' with
umlaut, e.g.:

Ä

into an without every touching my keyboard. How does BeautifulSoup
do it?

It maps the HTML entity names to unicode characters. Take a look at the
`htmlentitydefs` module.

from BeautifulSoup import BeautifulStoneSoup as bss

s1 = "<h1>Ä</h1>" #&_Auml;_
#I added the comment after the line to show the
#format of the html entity. In case a browser
#might render the comment into the actual character,
#I added underscores to the html entity:

soup = bss(s1)
text = soup.contents[0].string #gets the 'A' with umlaut out of the
html

new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
print repr(new_s)
print new_s

I see the same output for both print statements, and what I see is an
'A' with umlaut. I expected that the first print statement would show
the utf-8 encoding for the character.

Well it does, and apparently your terminal, or wherever the output goes,
decodes that UTF-8 encoded 'Ã„' and shows it. If you expected the output
'\xc3\x84' then remember that you ask the soup object for its
representation and not a string. The object itself decides what
`repr(obj)` returns. Soup objects represent themselves as UTF-8 encoded
strings.

Ciao,
Marc 'BlackJack' Rintsch

hex dump w/ or w/out utf-8 chars	40	Jul 8, 2013
UTF-8 and stdin/stdout?	5	May 28, 2008
MeCab UTF-8 Decoding Problem	6	Jun 29, 2013
UTF-8 read & print?	6	Nov 25, 2012
Stuck with urllib.quote and Unicode/UTF-8	0	May 7, 2011
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
Simple converter of files into their hex components... but i can'tarrange utf-8 parts!	2	Jun 9, 2013

raw_input() and utf-8 formatted chars

7stud

kyosohma

7stud

Marc 'BlackJack' Rintsch

7stud

Dennis Lee Bieber

Marc 'BlackJack' Rintsch

MRAB

7stud

Marc 'BlackJack' Rintsch

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads