Py3.3 unicode literal and input()

J

jmfauth

Python 3 made several backwards-incompatible changes over Python 2.
First of all, input() in Python 3 is equivalent to raw_input() in
Python 2. It always returns a string. If you want the equivalent of
Python 2's input(), eval the result. Second, Python 3 is now unicode
by default. The "str" class is a unicode string. There is a separate
bytes class, denoted by b"", for byte strings. The u prefix is only
there to make it easier to port a codebase from Python 2 to Python 3.
It doesn't actually do anything.


It does. I shew it!

Related:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/3aefd602507d2fbe#

http://mail.python.org/pipermail/python-dev/2012-June/120341.html

jmf
 
S

Steven D'Aprano

What is input() supposed to return?

Whatever you type.

This demonstrates that in Python 3.3, u'a' gives a string equal to 'a'.

Since you typed the letter a, r1 is the string "a" (a single character).

Since you typed four characters, namely lowercase u, single quote,
lowercase a, single quote, r2 is the string "u'a'" (four characters).


(<class 'str'>, 4)

If you call print(r1) and print(r2), that will show you what they hold.
If in doubt, calling print(repr(r1)) will show extra information about
the object.

sys.argv?

What about it?
 
S

Steven D'Aprano

It does. I shew it!

Incorrect. You are assuming that Python 3 input eval's the input like
Python 2 does. That is wrong. All you show is that the one-character
string "a" is not equal to the four-character string "u'a'", which is
hardly a surprise. You wouldn't expect the string "3" to equal the string
"int('3')" would you?
 
J

jmfauth

Incorrect. You are assuming that Python 3 input eval's the input like
Python 2 does. That is wrong. All you show is that the one-character
string "a" is not equal to the four-character string "u'a'", which is
hardly a surprise. You wouldn't expect the string "3" to equal the string
"int('3')" would you?


A string is a string, a "piece of text", period.

I do not see why a unicode literal and an (well, I do not
know how the call it) a "normal class <str>" should behave
differently in code source or as an answer to an input().

Should a user write two derived functions?

input_for_entering_text()
and
input_if_you_are_entering_a_text_as_litteral()

---

Side effect from the unicode litteral reintroduction.
I do not mind about this, but I expect it does
work logically and correctly. And it does not.

PS English is not my native language. I never know
to reply to an (interro)-negative sentence.

jmf
 
D

Dave Angel

A string is a string, a "piece of text", period. I do not see why a
unicode literal and an (well, I do not know how the call it) a "normal
class <str>" should behave differently in code source or as an answer
to an input().

Wrong. The rules for parsing source code are NOT applied in general to
Python 3's input data, nor to file I/O done with methods like
myfile.readline(). We do not expect the runtime code to look for def
statements, nor for class statements, and not for literals. A literal
is a portion of source code where there are specific rules applied,
starting with the presence of some quote characters.

This is true of nearly all languages, and in most languages, the
difference is so obvious that the question seldom gets raised. For
example, in C code a literal is evaluated at compile time, and by the
time an end user sees an input prompt, he probably doesn't even have a
compiler on the same machine.

When an end user types in his data (into an input statement, typically),
he does NOT use quote literals, he does not use hex escape codes, he
does not escape things with backslash. If he wants an o with an umlaut
on it, he'd better have such a character available on his keyboard.

i'd suggest playing around a little with literal assignments and input
statements and print functions. In those literals, try entering escape
sequences (eg. "ab\x41cd") Run such programs from the command line,
and observe the output from the prints. Do this without using the
interactive interpreter, as by default it "helpfully" displays
expressions with the repr() function, which confuses the issue.

Should a user write two derived functions? input_for_entering_text()
and input_if_you_are_entering_a_text_as_litteral() --- Side effect
from the unicode litteral reintroduction. I do not mind about this,
but I expect it does work logically and correctly. And it does not. PS
English is not my native language. I never know to reply to an
(interro)-negative sentence. jmf

The user doesn't write functions, the programmer does. Until you learn
to distinguish between those two phases, you'll continue having this
confusion.

If you (the programmer) want a function that asks the user to enter a
literal at the input prompt, you'll have to write a post-processing for
it, which looks for prefixes, for quotes, for backslashes, etc., and
encodes the result. There very well may be such a decoder in the Python
library, but input does nothing of the kind.


The literal modifiers (u"" or r"") are irrelevant here. The "problem"
you're having is universal, and not new. The characters in source code
have different semantic meanings than those entered in input, or read
from file I/O.
 
U

Ulrich Eckhardt

Am 18.06.2012 16:00, schrieb jmfauth:
A string is a string, a "piece of text", period.

No. There are different representations for the same piece of text even
in the context of just Python. b'fou', u'fou', 'fou' are three different
source code representations, resulting in two different runtime
representation and they all represent the same text: fou.

I do not see why a unicode literal and an (well, I do not
know how the call it) a "normal class <str>" should behave
differently in code source or as an answer to an input().

input() retrieves a string from a user, not from a programmer that can
be expected to know the difference between b'\x81' and u'\u20ac'.

Should a user write two derived functions?

input_for_entering_text()
and
input_if_you_are_entering_a_text_as_litteral()

With "user" above, I guess you mean "Python programmer". In that case,
the answer is yes. Although asking the user of your program to learn
about Python's string literal formatting options is a bit much.

Side effect from the unicode litteral reintroduction.
I do not mind about this, but I expect it does
work logically and correctly. And it does not.

Yes it does. The user enters something. Python receives this and
provides it as string. You as a programmer are now supposed to
interpret, parse etc this string according to your program logic.


BTW: Just in case there is a language (native language, not programming
language) problem, don't hesitate to write in your native language, too.
Chances are good that someone here understands you.

Good luck!

Uli
 
J

jmfauth

Thinks are very clear to me. I wrote enough interactive
interpreters with all available toolkits for Windows
since I know Python (v. 1.5.6).

I do not see why the semantic may vary differently
in code source or in an interactive interpreter,
esp. if Python allow it!

If you have to know by advance what an end user
is supposed to type and/or check it ('str' or unicode
literal) in order to know if the answer has to be
evaluated or not, then it is better to reintroduce
input() and raw_input().

jmf
 
C

Chris Angelico

I do not see why the semantic may vary differently
in code source or in an interactive interpreter,
esp. if Python allow it!

When you're asking for input, you usually aren't looking for code. It
doesn't matter about string literal formats, because you don't need to
delimit it. In code, you need to make it clear to the interpreter
where your string finishes, and that's traditionally done with quote
characters:

name = "Chris Angelico" # this isn't part of the string, because the
two quotes mark off the ends of it

And you can include characters in your literals that you don't want in
your source code:

bad_chars = "\x00\x1A\x0A" # three characters NUL, SUB, LF

Everything about raw strings, Unicode literals, triple-quoted strings,
etc, etc, etc, is just variants on these two basic concepts. The
interpreter needs to know what you mean.

With input, though, the end of the string is defined in some other way
(such as by the user pushing Enter). The interpreter knows without any
extra hints where it's to stop parsing. Also, there's no need to
protect certain characters from getting into your code. It's a much
easier job for the interpreter, which translates to being much simpler
for the user: just type what you want and hit Enter. Quote characters
have no meaning.

Chris Angelico
 
J

Jussi Piitulainen

jmfauth said:
Thinks are very clear to me. I wrote enough interactive
interpreters with all available toolkits for Windows
u'a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SyntaxError: u'a

Er, no, not really :)
 
J

jmfauth

We are turning in circles. You are somehow
legitimating the reintroduction of unicode
literals and I shew, not to say proofed, it may
be a source of problems.

Typical Python desease. Introduce a problem,
then discuss how to solve it, but surely and
definitivly do not remove that problem.

As far as I know, Python 3.2 is working very
well.

jmf
 
A

Andrew Berg

u'a
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
SyntaxError: u'a

Er, no, not really :)
You're using 2.x; this thread concerns 3.3, which, as has been repeated
several times, does not evaluate strings passed via input() like 2.x.
That code does not raise a SyntaxError in 3.x.
 
J

John Roth

Thinks are very clear to me. I wrote enough interactive
interpreters with all available toolkits for Windows
since I know Python (v. 1.5.6).

I do not see why the semantic may vary differently
in code source or in an interactive interpreter,
esp. if Python allow it!

If you have to know by advance what an end user
is supposed to type and/or check it ('str' or unicode
literal) in order to know if the answer has to be
evaluated or not, then it is better to reintroduce
input() and raw_input().

The change between Python 2.x and 3.x was made for security reasons. The developers felt, correctly in my opinion, that the simpler operation should not pose a security risk of a malicious user entering an expression that would corrupt the program.

In Python 3.x the equivalent of Python 2.x's input() function is eval(input()). It poses the same security risk: acting on unchecked user data.

John Roth
 
D

Dave Angel

You're using 2.x; this thread concerns 3.3, which, as has been repeated
several times, does not evaluate strings passed via input() like 2.x.
That code does not raise a SyntaxError in 3.x.

And you're missing the context. jmfauth thinks we should re-introduce
the input/raw-input distinction so he could parse literal strings. So
Jussi demonstrated that the 2.x input did NOT satisfy fmfauth's dreams.
 
A

Andrew Berg

And you're missing the context. jmfauth thinks we should re-introduce
the input/raw-input distinction so he could parse literal strings. So
Jussi demonstrated that the 2.x input did NOT satisfy fmfauth's dreams.

You're right. I missed that part of jmfauth's post.
 
J

Jussi Piitulainen

Andrew said:
You're using 2.x; this thread concerns 3.3, which, as has been
repeated several times, does not evaluate strings passed via input()
like 2.x. That code does not raise a SyntaxError in 3.x.

I used 3.1.2, and I really meant the "not really". And the ":)". I
edited out the command that raised the exception.

This thread is weird. If I didn't know that things are very clear to
jmfauth, I would think that the behaviour of input() that I observe
has absolutely nothing to do with the u'' syntax in source code.
 
T

Terry Reedy

We are turning in circles.

You are, not we. Please stop.
You are somehow legitimating the reintroduction of unicode
literals

We are not 'reintroducing' unicode literals. In Python 3, string
literals *are* unicode literals.

Other developers reintroduced a now meaningless 'u' prefix for the
purpose of helping people write 2&3 code that runs on both Python 2 and
Python 3. Read about it here http://python.org/dev/peps/pep-0414/

In Python 3.3, 'u' should *only* be used for that purpose and should be
ignored by anyone not writing or editing 2&3 code. If you are not
writing such code, ignore it.
and I shew, not to say proofed, it may
be a source of problems.

You are the one making it be a problem.
Typical Python desease. Introduce a problem,
then discuss how to solve it, but surely and
definitivly do not remove that problem.

The simultaneous reintroduction of 'ur', but with a different meaning
than in 2.7, *was* a problem and it should be removed in the next release.
As far as I know, Python 3.2 is working very
well.

Except that many public libraries that we would like to see ported to
Python 3 have not been. The purpose of reintroducing 'u' is to encourage
more porting of Python 2 code. Period.
 
J

jmfauth

You are, not we. Please stop.


We are not 'reintroducing' unicode literals. In Python 3, string
literals *are* unicode literals.

Other developers reintroduced a now meaningless 'u' prefix for the
purpose of helping people write 2&3 code that runs on both Python 2 and
Python 3. Read about it herehttp://python.org/dev/peps/pep-0414/

In Python 3.3, 'u' should *only* be used for that purpose and should be
ignored by anyone not writing or editing 2&3 code. If you are not
writing such code, ignore it.

 > and I shew, not to say proofed, it may


You are the one making it be a problem.


The simultaneous reintroduction of 'ur', but with a different meaning
than in 2.7, *was* a problem and it should be removed in the next release..


Except that many public libraries that we would like to see ported to
Python 3 have not been. The purpose of reintroducing 'u' is to encourage
more porting of Python 2 code. Period.

It's a matter of perspective. I expected to have
finally a clean Python, the goal is missed.

I have nothing to object. It is "your" (core devs)
project, not mine. At least, you understood my point
of view.

I'm a more than two decades TeX user. At the release
of XeTeX (a pure unicode TeX-engine), the devs had,
like Python2/3, to make anything incompatible. A success.
It did not happen a week without seeing a updated
package or a refreshed documentation.

Luckily for me, Xe(La)TeX is more important than
Python.

As a scientist, Python is perfect.
From an educational point of view, I'm becoming
more and more skeptical about this language, a
moving target.

Note that I'm not complaining, only "desappointed".

jmf
 
S

Steven D'Aprano

A string is a string, a "piece of text", period.

I do not see why a unicode literal and an (well, I do not know how the
call it) a "normal class <str>" should behave differently in code source
or as an answer to an input().

They do not. As you showed earlier, in Python 3.3 the literal strings
u'a' and 'a' have the same meaning: both create a one-character string
containing the Unicode letter LOWERCASE-A.

Note carefully that the quotation marks are not part of the string. They
are delimiters. Python 3.3 allows you to create a string by using
delimiters:

' '
" "
u' '
u" "

plus triple-quoted versions of the same. The delimiter is not part of the
string. They are only there to mark the start and end of the string in
source code so that Python can tell the difference between the string "a"
and the variable named "a".

Note carefully that quotation marks can exist inside strings:

my_string = "This string has 'quotation marks'."

The " at the start and end of the string literal are delimiters, not part
of the string, but the internal ' characters *are* part of the string.

When you read data from a file, or from the keyboard using input(),
Python takes the data and returns a string. You don't need to enter
delimiters, because there is no confusion between a string (all data you
read) and other programming tokens.

For example:

py> s = input("Enter a string: ")
Enter a string: 42
py> print(s, type(s))
42 <class 'str'>

Because what I type is automatically a string, I don't need to enclose it
in quotation marks to distinguish it from the integer 42.

py> s = input("Enter a string: ")
Enter a string: This string has 'quotation marks'.
py> print(s, type(s))
This string has 'quotation marks'. <class 'str'>


What you type is exactly what you get, no more, no less.

If you type 42, you get the two character string "42" and not the int 42.

If you type [1, 2, 3], then you get the nine character string "[1, 2, 3]"
and not a list containing integers 1, 2 and 3.

If you type 3**0.5 then you get the six character string "3**0.5" and not
the float 1.7320508075688772.

If you type u'a' then you get the four character string "u'a'" and not
the single character 'a'.

There is nothing new going on here. The behaviour of input() in Python 3,
and raw_input() in Python 2, has not changed.

Should a user write two derived functions?

input_for_entering_text()
and
input_if_you_are_entering_a_text_as_litteral()

If you, the programmer, want to force the user to write input in Python
syntax, then yes, you have to write a function to do so. input() is very
simple: it just reads strings exactly as typed. It is up to you to
process those strings however you wish.
 
J

jmfauth

A string is a string, a "piece of text", period.
I do not see why a unicode literal and an (well, I do not know how the
call it) a "normal class <str>" should behave differently in code source
or as an answer to an input().

They do not. As you showed earlier, in Python 3.3 the literal strings
u'a' and 'a' have the same meaning: both create a one-character string
containing the Unicode letter LOWERCASE-A.

Note carefully that the quotation marks are not part of the string. They
are delimiters. Python 3.3 allows you to create a string by using
delimiters:

' '
" "
u' '
u" "

plus triple-quoted versions of the same. The delimiter is not part of the
string. They are only there to mark the start and end of the string in
source code so that Python can tell the difference between the string "a"
and the variable named "a".

Note carefully that quotation marks can exist inside strings:

my_string = "This string has 'quotation marks'."

The " at the start and end of the string literal are delimiters, not part
of the string, but the internal ' characters *are* part of the string.

When you read data from a file, or from the keyboard using input(),
Python takes the data and returns a string. You don't need to enter
delimiters, because there is no confusion between a string (all data you
read) and other programming tokens.

For example:

py> s = input("Enter a string: ")
Enter a string: 42
py> print(s, type(s))
42 <class 'str'>

Because what I type is automatically a string, I don't need to enclose it
in quotation marks to distinguish it from the integer 42.

py> s = input("Enter a string: ")
Enter a string: This string has 'quotation marks'.
py> print(s, type(s))
This string has 'quotation marks'. <class 'str'>

What you type is exactly what you get, no more, no less.

If you type 42, you get the two character string "42" and not the int 42.

If you type [1, 2, 3], then you get the nine character string "[1, 2, 3]"
and not a list containing integers 1, 2 and 3.

If you type 3**0.5 then you get the six character string "3**0.5" and not
the float 1.7320508075688772.

If you type u'a' then you get the four character string "u'a'" and not
the single character 'a'.

There is nothing new going on here. The behaviour of input() in Python 3,
and raw_input() in Python 2, has not changed.
Should a user write two derived functions?
input_for_entering_text()
and
input_if_you_are_entering_a_text_as_litteral()

If you, the programmer, want to force the user to write input in Python
syntax, then yes, you have to write a function to do so. input() is very
simple: it just reads strings exactly as typed. It is up to you to
process those strings however you wish.


Python 3.3.0a4 (v3.3.0a4:7c51388a3aa7+, May 31 2012, 20:15:21) [MSC v.
1600
32 bit (Intel)] on win32running smidzero.py...
....smidzero has been executed
input(':')
:éléphant
'éléphant'input(':')
:u'éléphant'
'éléphant'input(':')
:u'\u00e9l\xe9phant'
'éléphant'input(':')
:u'\U000000e9léphant'
'éléphant'input(':')
:\U000000e9léphant
'éléphant' input(':')
:b'éléphant'
"b'éléphant'"
len(input(':'))
:b'éléphant'
11

---

Good news on the ru''/ur'' front:
http://bugs.python.org/issue15096

---

Finally I'm just wondering if this unicode_literal
reintroduction is not a bad idea.

b'these_are_bytes'
u'this_is_a_unicode_string'

I wrote all my Py2 code in a "unicode mode" since ... Py2.3 (?).

jmf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top