raw_input() and utf-8 formatted chars

7

7stud

s = 'A\xcc\x88' #capital A with umlaut
print s #displays capital A with umlaut

s = raw_input('Enter: ') #A\xcc\x88
print s #displays A\xcc\x88

print len(input) #9


It looks like every character of the string I enter in utf-8 is being
interpreted literally as 9 separate characters rather than one
character. How do I enter a capital A with an umlaut so that python
treats it as one character?
 
K

kyosohma

s = 'A\xcc\x88' #capital A with umlaut
print s #displays capital A with umlaut

s = raw_input('Enter: ') #A\xcc\x88
print s #displays A\xcc\x88

print len(input) #9

It looks like every character of the string I enter in utf-8 is being
interpreted literally as 9 separate characters rather than one
character. How do I enter a capital A with an umlaut so that python
treats it as one character?

I don't know. This works for me:

I'm using Python 2.4 with Default Source Encoding set to None on
Windows XP SP2.

Mike
 
7

7stud

I don't know. This works for me:



1

I'm using Python 2.4 with Default Source Encoding set to None on
Windows XP SP2.

Mike

Yeah, but what happens when you enter A\xcc\x88? And what is it that
your keyboard enters to produce an 'a' with an umlaut?
 
M

Marc 'BlackJack' Rintsch

Yeah, but what happens when you enter A\xcc\x88?

You mean literally!? Then of course I get A\xcc\x88 because that's what I
entered. In string literals in source code the backslash has a special
meaning but `raw_input()` does not "interpret" the input in any way.
And what is it that your keyboard enters to produce an 'a' with an umlaut?

*I* just hit the ä key. The one right next to the ö key. ;-)

Ciao,
Marc 'BlackJack' Rintsch
 
7

7stud

You mean literally!? Then of course I get A\xcc\x88 because that's what I
entered. In string literals in source code the backslash has a special
meaning but `raw_input()` does not "interpret" the input in any way.

Then why don't I end up with the same situation as this:
*I* just hit the ä key. The one right next to the ö key. ;-)

....and what if you don't have an a-with-umlaut key?
 
D

Dennis Lee Bieber

...and what if you don't have an a-with-umlaut key?

You find out how your OS keyboard driver handles the entry of
extended characters, and follow that procedure.

The ancient Amiga made them fairly easy -- <alt-f/g/h/j/k> were
treated as dead-keys which would result in the marker (acute accent,
grave accent, circumflex, diaeresis, tilde) being applied to the next
character hit (presuming it is a valid combination).

In Windows, using "Windows: Western" character set, <alt>0196 (numeric
pad) gives: Ä and <alt>0228 gives: ä

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
M

Marc 'BlackJack' Rintsch

Then why don't I end up with the same situation as this:

I don't get the question!? In string literals in source code the
backslash has a special meaning, like I wrote above. When Python compiles
that above snippet you end up with a string of three bytes, one with the
ASCII value of an 'A' and two bytes where you typed in the byte value in
hexadecimal:

In [191]: s = 'A\xcc\x88'

In [192]: len(s)
Out[192]: 3

In [193]: map(ord, s)
Out[193]: [65, 204, 136]

In [194]: print s
Ä

The last works this way only if the receiving/displaying program expected
UTF-8 as encoding. Otherwise something other than an Ä would have been
shown.

If you type in that text when asked by `raw_input()` then you get exactly
what you typed because there is no Python source code compiled:

In [195]: s = raw_input()
A\xcc\x88

In [196]: len(s)
Out[196]: 9

In [197]: map(ord, s)
Out[197]: [65, 92, 120, 99, 99, 92, 120, 56, 56]

In [198]: print s
A\xcc\x88
...and what if you don't have an a-with-umlaut key?

I find other means to enter it. <Alt> + some magic number on the numeric
keypad in windows, or <Compose>, <a>, <"> on Unix/Linux. Some text editors
offer special sequences too. If all fails there are character map
programs that show all unicode characters to choose from and copy'n'paste
them.

Ciao,
Marc 'BlackJack' Rintsch
 
M

MRAB

Then why don't I end up with the same situation as this:



...and what if you don't have an a-with-umlaut key?

raw_input() returns the string exactly as you entered it. You can
decode that into the actual UTF-8 string with decode("string_escape"):

s = raw_input('Enter: ') #A\xcc\x88
s = s.decode("string_escape")

It looks like your system already understands UTF-8 and will decode
the UTF-8 string you print to the Unicode character.
 
7

7stud

You can
decode that into the actual UTF-8 string with decode("string_escape"):

s = raw_input('Enter: ') #A\xcc\x88
s = s.decode("string_escape")

Ahh. Thanks for that.

*I* just hit the ä key. The one right next to the ö key. ;-)

BeautifulSoup can convert an html entity representing an 'A' with
umlaut, e.g.:

&Auml;

into an without every touching my keyboard. How does BeautifulSoup
do it?


from BeautifulSoup import BeautifulStoneSoup as bss


s1 = "<h1>&Auml;</h1>" #&_Auml;_
#I added the comment after the line to show the
#format of the html entity. In case a browser
#might render the comment into the actual character,
#I added underscores to the html entity:

soup = bss(s1)
text = soup.contents[0].string #gets the 'A' with umlaut out of the
html

new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
print repr(new_s)
print new_s

I see the same output for both print statements, and what I see is an
'A' with umlaut. I expected that the first print statement would show
the utf-8 encoding for the character.
 
M

Marc 'BlackJack' Rintsch

BeautifulSoup can convert an html entity representing an 'A' with
umlaut, e.g.:

&Auml;

into an without every touching my keyboard. How does BeautifulSoup
do it?

It maps the HTML entity names to unicode characters. Take a look at the
`htmlentitydefs` module.
from BeautifulSoup import BeautifulStoneSoup as bss


s1 = "<h1>&Auml;</h1>" #&_Auml;_
#I added the comment after the line to show the
#format of the html entity. In case a browser
#might render the comment into the actual character,
#I added underscores to the html entity:

soup = bss(s1)
text = soup.contents[0].string #gets the 'A' with umlaut out of the
html

new_s = bss(text, convertEntities=bss.HTML_ENTITIES)
print repr(new_s)
print new_s

I see the same output for both print statements, and what I see is an
'A' with umlaut. I expected that the first print statement would show
the utf-8 encoding for the character.

Well it does, and apparently your terminal, or wherever the output goes,
decodes that UTF-8 encoded 'Ä' and shows it. If you expected the output
'\xc3\x84' then remember that you ask the soup object for its
representation and not a string. The object itself decides what
`repr(obj)` returns. Soup objects represent themselves as UTF-8 encoded
strings.

Ciao,
Marc 'BlackJack' Rintsch
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top