How do I display unicode value stored in a string variable using ord()

C

Charles Jensen

Everyone knows that the python command

ord(u'…')

will output the number 8230 which is the unicode character for the horizontal ellipsis.

How would I use ord() to find the unicode value of a string stored in a variable?

So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?

a = '…'
ord(a)
 
C

Chris Angelico

How would I use ord() to find the unicode value of a string stored in a variable?

So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?

a = '…'
ord(a)

I presume you're talking about Python 2, because in Python 3 your
string variable is a Unicode string and will behave as you describe
above.

You'll need to look into what the encoding is, and figure it out from there..

ChrisA
 
D

Dave Angel

Everyone knows that the python command

ord(u'…')

will output the number 8230 which is the unicode character for the horizontal ellipsis.

How would I use ord() to find the unicode value of a string stored in a variable?

So the following 2 lines of code will give me the ascii value of the variable a. How do I specify ord to give me the unicode value of a?

a = '…'
ord(a)

You omitted the print statement. You also didn't specify what version
of Python you're using; I'll assume Python 2.x because in Python 3.x,
the u"xx" notation would have been a syntax error.

To get the ord of a unicode variable, you do it the same as a unicode
literal:

a = u"j" #note: for this to work reliably, you probably
need the correct Unicode declaration in line 2 of the file
print ord(a)

But if you have a byte string containing some binary bits, and you want
to get a unicode character value out of it, you'll need to explicitly
convert it to unicode.

First, decide what method the byte string was encoded. If you specify
the wrong encoding, you'll likely to get an exception, or maybe just a
nonsense answer.

a = "\xc1\xc1" #I just made this value up; it's not
valid utf8
b = a.decode("utf-8")
print ord(b)
 
T

Terry Reedy

a = '…'
print(ord(a))8230
Most things with unicode are easier in 3.x, and some are even better in
3.3. The current beta is good enough for most informal work. 3.3.0 will
be out in a month.
 
W

wxjmfauth

Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
a = '…'

print(ord(a))


8230

Most things with unicode are easier in 3.x, and some are even better in

3.3. The current beta is good enough for most informal work. 3.3.0 will

be out in a month.

Slightly off topic.

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)

jmf
 
W

wxjmfauth

Le vendredi 17 août 2012 01:59:31 UTC+2, Terry Reedy a écrit :
a = '…'

print(ord(a))


8230

Most things with unicode are easier in 3.x, and some are even better in

3.3. The current beta is good enough for most informal work. 3.3.0 will

be out in a month.

Slightly off topic.

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).
Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)

jmf
 
J

Jerry Hill

The character '…', Unicode name 'HORIZONTAL ELLIPSIS',
is one of these characters existing in the cp1252, mac-roman
coding schemes and not in iso-8859-1 (latin-1) and obviously
not in ascii. It causes Py3.3 to work a few 100% slower
than Py<3.3 versions due to the flexible string representation
(ascii/latin-1/ucs-2/ucs-4) (I found cases up to 1000%).

Traceback (most recent call last):
File "<eta last command>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2026'
in position 0: ordinal not in range(256)

If one could neglect this (typographically important) glyph, what
to say about the characters of the European scripts (languages)
present in cp1252 or in mac-roman but not in latin-1 (eg. the
French script/language)?

So... python should change the longstanding definition of the latin-1
character set? This isn't some sort of python limitation, it's just
the reality of legacy encodings that actually exist in the real world.

Very nice. Python 2 was built for ascii user, now Python 3 is
*optimized* for, let say, ascii user!

The future is bright for Python. French users are better
served with Apple or MS products, simply because these
corporates know you can not write French with iso-8859-1.

PS When "TeX" moved from the ascii encoding to iso-8859-1
and the so called Cork encoding, "they" know this and provided
all the complementary packages to circumvent this. It was
in 199? (Python was not even born).

Ditto for the foundries (Adobe, Linotype, ...)


I don't understand what any of this has to do with Python. Just
output your text in UTF-8 like any civilized person in the 21st
century, and none of that is a problem at all. Python make that easy.
It also makes it easy to interoperate with older encodings if you
have to.
 
W

wxjmfauth

Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
So... python should change the longstanding definition of the latin-1

character set? This isn't some sort of python limitation, it's just

the reality of legacy encodings that actually exist in the real world.
















I don't understand what any of this has to do with Python. Just

output your text in UTF-8 like any civilized person in the 21st

century, and none of that is a problem at all. Python make that easy.

It also makes it easy to interoperate with older encodings if you

have to.

Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.

jmf
 
W

wxjmfauth

Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
So... python should change the longstanding definition of the latin-1

character set? This isn't some sort of python limitation, it's just

the reality of legacy encodings that actually exist in the real world.
















I don't understand what any of this has to do with Python. Just

output your text in UTF-8 like any civilized person in the 21st

century, and none of that is a problem at all. Python make that easy.

It also makes it easy to interoperate with older encodings if you

have to.

Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.

jmf
 
D

Dave Angel

Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit :
Sorry, you missed the point.

My comment had nothing to do with the code source coding,
the coding of a Python "string" in the code source or with
the display of a Python3 <str>.
I wrote about the *internal* Python "coding", the
way Python keeps "strings" in memory. See PEP 393.

jmf

The internal coding described in PEP 393 has nothing to do with latin-1
encoding. So what IS your point? Make it clearly, without all the
snide side-comments.
 
D

Dave Angel

It certainly does. PEP 393 provides for Unicode strings to be represented
internally as any of Latin-1, UCS-2, or UCS-4, whichever is smallest and
sufficient to contain the data. I understand the complaint to be that while
the change is great for strings that happen to fit in Latin-1, it is less
efficient than previous versions for strings that do not.

That's not the way I interpreted the PEP 393. It takes a pure unicode
string, finds the largest code point in that string, and chooses 1, 2 or
4 bytes for every character, based on how many bits it'd take for that
largest code point. Further i read it to mean that only 00 bytes would
be dropped in the process, no other bytes would be changed. I take it
as a coincidence that it happens to match latin-1; that's the way
Unicode happened historically, and is not Python's fault. Am I reading
it wrong?

I also figure this is going to be more space efficient than Python 3.2
for any string which had a max code point of 65535 or less (in Windows),
or 4billion or less (in real systems). So unless French has code points
over 64k, I can't figure that anything is lost.

I have no idea about the times involved, so i wanted a more specific
complaint.
I don't know how much merit there is to this claim. It would seem to me
that even in non-western locales, most strings are likely to be Latin-1 or
even ASCII, e.g. class and attribute and function names.

The jmfauth rant I was responding to was saying that French isn't
efficiently encoded, and that performance of some vague operations were
somehow reduced by several fold. I was just trying to get him to be
more specific.
 
S

Steven D'Aprano

Le vendredi 17 août 2012 20:21:34 UTC+2, Jerry Hill a écrit : [...]
Sorry, you missed the point.

My comment had nothing to do with the code source coding, the coding of
a Python "string" in the code source or with the display of a Python3
<str>.
I wrote about the *internal* Python "coding", the way Python keeps
"strings" in memory. See PEP 393.


The PEP does not support your claim that flexible string storage is 100%
to 1000% slower. It claims 1% - 30% slowdown, with a saving of up to 60%
of the memory used for strings.

I don't really understand what message you are trying to give here. Are
you saying that PEP 393 is a good thing or a bad thing?

In Python 1.x, there was no support for Unicode at all. You could only
work with pure byte strings. Support for non-ascii characters like … ∞ é ñ
£ π Ж ش was purely by accident -- if your terminal happened to be set to
an encoding that supported a character, and you happened to use the
appropriate byte value, you might see the character you wanted.

In Python 2.2, Python gained support for Unicode. You could now guarantee
support for any Unicode character in the Basic Multilingual Plane (BMP)
by writing your strings using the u"..." style. In Python 3, you no
longer need the leading U, all strings are unicode.

But there is a problem: if your Python interpreter is a "narrow build",
it *only* supports Unicode characters in the BMP. When Python is a "wide
build", compiled with support for the additional character planes, then
strings take much more memory, even if they are in the BMP, or are simple
ASCII strings.

PEP 393 fixes this problem and gets rid of the distinction between narrow
and wide builds. From Python 3.3 onwards, all Python compilers will have
the same support for unicode, rather than most being BMP-only. Each
individual string's internal storage will use only as many bytes-per-
character as needed to store the largest character in the string.

This will save a lot of memory for those using mostly ASCII or Latin-1
but a few multibyte characters. While the increased complexity causes a
small slowdown, the increased functionality makes it well worthwhile.
 
S

Steven D'Aprano

Unicode strings are not represented as Latin-1 internally. Latin-1 is a
byte encoding, not a unicode internal format. Perhaps you mean to say
that they are represented as a single byte format?
That's not the way I interpreted the PEP 393. It takes a pure unicode
string, finds the largest code point in that string, and chooses 1, 2 or
4 bytes for every character, based on how many bits it'd take for that
largest code point.

That's how I interpret it too.

Further i read it to mean that only 00 bytes would
be dropped in the process, no other bytes would be changed.

Just to clarify, you aren't talking about the \0 character, but only to
extraneous "padding" 00 bytes.

I also figure this is going to be more space efficient than Python 3.2
for any string which had a max code point of 65535 or less (in Windows),
or 4billion or less (in real systems). So unless French has code points
over 64k, I can't figure that anything is lost.

I think that on narrow builds, it won't make terribly much difference.
The big savings are for wide builds.
 
W

wxjmfauth

sys.version
'3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')")
0.8158757139801764
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32 bit
(Intel)]'1.2918679017971044

timeit.timeit("('ab…' * 10).replace('…', '€…')")
1.2484133226156757

* I intuitively and empirically noticed, this happens for
cp1252 or mac-roman characters and not characters which are
elements of the latin-1 coding scheme.

* Bad luck, such characters are usual characters in French scripts
(and in some other European language).

* I do not recall the extreme cases I found. Believe me, when
I'm speaking about a few 100%, I do not lie.

My take of the subject.

This is a typical Python desease. Do not solve a problem, but
find a way, a workaround, which is expecting to solve a problem
and which finally solves nothing. As far as I know, to break
the "BMP limit", the tools are here. They are called utf-8 or
ucs-4/utf-32.

One day, I fell on very, very old mail message, dating at the
time of the introduction of the unicode type in Python 2.
If I recall correctly it was from Victor Stinner. He wrote
something like this "Let's go with ucs-4, and the problems
are solved for ever". He was so right.

I'm spying the dev-list since years, my feeling is that
there is always a latent and permanent conflict between
"ascii users" and "non ascii users" (see the unicode
literal reintroduction).

Please, do not get me wrong. As a non-computer scientist,
I'm very happy with Python. If I try to take a distant
eye, I became more and more sceptical.

PS Py3.3b2 is still crashing, silently exiting, with
cp65001.

jmf
 
S

Steven D'Aprano

sys.version '3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]'
timeit.timeit("('ab…' * 1000).replace('…', '……')")
37.32762490493721
timeit.timeit("('ab…' * 10).replace('…', 'œ…')") 0.8158757139801764
'3.3.0b2 (v3.3.0b2:4972a8f1b2aa, Aug 12 2012, 15:02:36) [MSC v.1600 32
bit (Intel)]'61.919225272152346

"imeit"?

It is hard to take your results seriously when you have so obviously
edited your timing results, not just copied and pasted them.


Here are my results, on my laptop running Debian Linux. First, testing on
Python 3.2:

steve@runes:~$ python3.2 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 50.2 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 45.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 51.3 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 47.6 usec per loop
steve@runes:~$ python3.2 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 45.9 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 57.5 usec per loop
steve@runes:~$ python3.2 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 49.7 usec per loop


As you can see, the timing results are all consistently around 50
microseconds per loop, regardless of which characters I use, whether they
are in Latin-1 or not. The differences between one test and another are
not meaningful.


Now I do them again using Python 3.3:

steve@runes:~$ python3.3 -m timeit "('abc' * 1000).replace('c', 'de')"
10000 loops, best of 3: 64.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '……')"
10000 loops, best of 3: 67.8 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'x…')"
10000 loops, best of 3: 66 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', 'œ…')"
10000 loops, best of 3: 67.6 usec per loop
steve@runes:~$ python3.3 -m timeit "('ab…' * 1000).replace('…', '€…')"
10000 loops, best of 3: 68.3 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('X', 'éç')"
10000 loops, best of 3: 67.9 usec per loop
steve@runes:~$ python3.3 -m timeit "('XYZ' * 1000).replace('Y', 'πЖ')"
10000 loops, best of 3: 66.9 usec per loop

The results are all consistently around 67 microseconds. So Python's
string handling is about 30% slower in the examples show here.

If you can consistently replicate a 100% to 1000% slowdown in string
handling, please report it as a performance bug:


http://bugs.python.org/

Don't forget to report your operating system.


My take of the subject.

This is a typical Python desease. Do not solve a problem, but find a
way, a workaround, which is expecting to solve a problem and which
finally solves nothing. As far as I know, to break the "BMP limit", the
tools are here. They are called utf-8 or ucs-4/utf-32.

The problem with UCS-4 is that every character requires four bytes.
Every. Single. One.

So under UCS-4, the pure-ascii string "hello world" takes 44 bytes plus
the object overhead. Under UCS-2, it takes half that space: 22 bytes, but
of course UCS-2 can only represent characters in the BMP. A pure ASCII
string would only take 11 bytes, but we're not going back to pure ASCII.

(There is an extension to UCS-2, UTF-16, which encodes non-BMP characters
using two code points. This is fragile and doesn't work very well,
because string-handling methods can break the surrogate pairs apart,
leaving you with invalid unicode string. Not good.)

The difference between 44 bytes and 22 bytes for one little string is not
very important, but when you double the memory required for every single
string it becomes huge. Remember that every class, function and method
has a name, which is a string; every attribute and variable has a name,
all strings; functions and classes have doc strings, all strings. Strings
are used everywhere in Python, and doubling the memory needed by Python
means that it will perform worse.

With PEP 393, each Python string will be stored in the most efficient
format possible:

- if it only contains ASCII characters, it will be stored using 1 byte
per character;

- if it only contains characters in the BMP, it will be stored using
UCS-2 (2 bytes per character);

- if it contains non-BMP characters, the string will be stored using
UCS-4 (4 bytes per character).
 
W

wxjmfauth

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf
 
I

Ian Kelly

(Resending this to the list because I previously sent it only to
Steven by mistake. Also showing off a case where top-posting is
reasonable, since this bit requires no context. :)
 
M

Mark Lawrence

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

[*] This not even true, if we consider the €uro currency
symbol used all around the world (banking, accounting
applications).

jmf

Sorry but you've got me completely baffled. Could you please explain in
words of one syllable or less so I can attempt to grasp what the hell
you're on about?
 
C

Chris Angelico

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

Regardless of your own native language, "len" is the name of a popular
Python function. And "dict" is a well-used class. Both those names are
representable in ASCII, even if every quoted string in your code
requires more bytes to store.

And memory usage has significance in many other areas, too. CPU cache
utilization turns a space saving into a time saving. That's why
structure packing still exists, even though member alignment has other
advantages.

You'd be amazed how many non-USA strings still fit inside seven bits,
too. Are you appending a space to something? Splitting on newlines?
You'll have lots of strings that are going now to be space-optimized.
Of course, the performance gains from shortening some of the strings
may be offset by costs when comparing one-byte and multi-byte strings,
but presumably that's all been gone into in great detail elsewhere.

ChrisA
 
I

Ian Kelly

Le samedi 18 août 2012 14:27:23 UTC+2, Steven D'Aprano a écrit :
[...]
The problem with UCS-4 is that every character requires four bytes.
[...]

I'm aware of this (and all the blah blah blah you are
explaining). This always the same song. Memory.

Let me ask. Is Python an 'american" product for us-users
or is it a tool for everybody [*]?
Is there any reason why non ascii users are somehow penalized
compared to ascii users?

The change does not just benefit ASCII users. It primarily benefits
anybody using a wide unicode build with strings mostly containing only
BMP characters. Even for narrow build users, there is the benefit
that with approximately the same amount of memory usage in most cases,
they no longer have to worry about non-BMP characters sneaking in and
breaking their code.

There is some additional benefit for Latin-1 users, but this has
nothing to do with Python. If Python is going to have the option of a
1-byte representation (and as long as we have the flexible
representation, I can see no reason not to), then it is going to be
Latin-1 by definition, because that's what 1-byte Unicode (UCS-1, if
you will) is. If you have an issue with that, take it up with the
designers of Unicode.
This flexible string representation is a regression (ascii users
or not).

I recognize in practice the real impact is for many users
closed to zero (including me) but I have shown (I think) that
this flexible representation is, by design, not as optimal
as it is supposed to be. This is in my mind the relevant point.

You've shown nothing of the sort. You've demonstrated only one out of
many possible benchmarks, and other users on this list can't even
reproduce that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,876
Messages
2,569,932
Members
46,207
Latest member
MedallionGreensCBD

Latest Threads

Top