Performance of int/long in Python 3

S

Steven D'Aprano

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string is
null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

All the words are in English (well, most of them...) but what does it
mean?
If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip.

Under what circumstances will a string be created from a wchar_t string?
How, and why, would such a string be created? Why would Python still
support strings containing surrogates when it now has a nice, shiny,
surrogate-free flexible representation?


... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Not to me. That almost doubles the size of the string, on the off-chance
that you'll need the UTF-8 encoding. Which for many uses, you don't, and
even if you do, it seems like premature optimization to keep it around
just in case. Encoding to UTF-8 will be fast for small N, and for large
N, why carry around (potentially) multiple megabytes of duplicated data
just in case the encoded version is needed some time?
 
C

Chris Angelico

Under what circumstances will a string be created from a wchar_t string?
How, and why, would such a string be created? Why would Python still
support strings containing surrogates when it now has a nice, shiny,
surrogate-free flexible representation?

Strings are created from some form of content. If not from another
Python string, then - most likely - it's from a stream of bytes. If
from a C API that returns wchar_t, then it'd make sense to have that
form around.

ChrisA
 
E

Ethan Furman

Steven D'Aprano:


It doesn't horrify me - I've been working this way for over 10 years and it seems completely natural.

Horrifying or not, I am willing to give up a small amount of speed for correctness. Heck, I'm willing to give up a lot
of speed for correctness. Once I have my slow but correct prototype going I can recode in a faster language (if needed)
and compare it's blazingly fast output with my slowly-generated but known-good output.
You can wrap
access in iterators that hide the byte offsets if you like. This then ensures that all operations on those iterators are
safe only allowing the iterator to point at the start/end of valid characters.

Sure. Or I can let Python handle it for me.

The counter-problem is that a French document that needs to include one mathematical symbol (or emoji) outside
Latin-1 will double in size as a Python string.

True. But how often do you have the entire document as a single string? Use readlines() instead of read(). Besides,
memory is cheap.
 
C

Chris Angelico

It doesn't horrify me - I've been working this way for over 10 years and
it seems completely natural. You can wrap access in iterators that hide the
byte offsets if you like. This then ensures that all operations on those
iterators are safe only allowing the iterator to point at the start/end of
valid characters.

But both this and your example of case conversion are, fundamentally,
iterating over the string. What if you aren't doing that? What if you
want to parse and process?

ChrisA
 
I

Ian Kelly

Not to me. That almost doubles the size of the string, on the off-chance
that you'll need the UTF-8 encoding. Which for many uses, you don't, and
even if you do, it seems like premature optimization to keep it around
just in case. Encoding to UTF-8 will be fast for small N, and for large
N, why carry around (potentially) multiple megabytes of duplicated data
just in case the encoded version is needed some time?
From the PEP:

"""
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
representation. It is thus identical to the existing
_PyUnicode_AsString, which is removed. The function will compute the
utf8 representation when first called. Since this representation will
consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible (which
generates a new string object every time). APIs that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use PyUnicode_AsUTF8 to compute a conversion.
"""

So the utf8 representation is not populated when the string is
created, but when a utf8 representation is requested, and only when
requested by the API that returns a char*, not by the API that returns
a bytes object.
 
I

Ian Kelly

From the PEP:

"""
A new function PyUnicode_AsUTF8 is provided to access the UTF-8
representation. It is thus identical to the existing
_PyUnicode_AsString, which is removed. The function will compute the
utf8 representation when first called. Since this representation will
consume memory until the string object is released, applications
should use the existing PyUnicode_AsUTF8String where possible (which
generates a new string object every time). APIs that implicitly
converts a string to a char* (such as the ParseTuple functions) will
use PyUnicode_AsUTF8 to compute a conversion.
"""

So the utf8 representation is not populated when the string is
created, but when a utf8 representation is requested, and only when
requested by the API that returns a char*, not by the API that returns
a bytes object.

Since the PEP specifically mentions ParseTuple string conversion, I am
thinking that this is probably the motivation for caching it. A
string that is passed into a C function (that uses one of the various
UTF-8 char* format specifiers) is perhaps likely to be passed into
that function again at some point, so the UTF-8 representation is kept
around to avoid the need to recompose it at on each call.
 
G

Grant Edwards

I cannot speak for the borg mind, but for myself a troll is anyone
who continually posts rants (such as RR & XL) or who continuously
hijacks threads to talk about their pet peeve (such as jmf).

Assuming jmf actually does care deeply and genuinely about Unicode
implementations, and his postings reflect his actual position/opinion,
then he's not a troll. Traditionally, a troll is someone who posts
statements purely to provoke a response -- they don't really care
about the topic and often don't believe what they're posting.
 
E

Ethan Furman

Assuming jmf actually does care deeply and genuinely about Unicode
implementations, and his postings reflect his actual position/opinion,
then he's not a troll. Traditionally, a troll is someone who posts
statements purely to provoke a response -- they don't really care
about the topic and often don't believe what they're posting.

Even if he does care deeply and genuinely he still hijacks threads, still refuses the challenges to try X or Y and
report back, and (ISTM) still refuses to learn.

If that's not trollish behavior, what is it?

FWIW I don't think he does care deeply and genuinely (at least not genuinely) or he would do more than whine about micro
benchmarks and make sweeping statements like "nobody here understands unicode" (paraphrased).
 
G

Grant Edwards

Even if he does care deeply and genuinely he still hijacks threads,
still refuses the challenges to try X or Y and report back, and
(ISTM) still refuses to learn.

If that's not trollish behavior, what is it?

He might indeed be trolling. But what defines a troll is
motive/intent, not behavior. Those behaviors are all common in
non-troll net.kooks. Maybe I'm being a bit too "old-school Usenet",
but being rude, ignorant (even stubbornly so), wrong, or irrational
doesn't make you a troll. What makes you a troll is intent. If you
don't actually care about the topic but are posting because you enjoy
poking people with a stick to watch them jump and howl, then you're a
troll.
FWIW I don't think he does care deeply and genuinely (at least not
genuinely) or he would do more than whine about micro benchmarks and
make sweeping statements like "nobody here understands unicode"
(paraphrased).

Perhaps he doesn't care about Unicode or Python performance. If so
he's putting on a pretty good act -- if he's a troll, he's a good one
and he's running a long game. Personally, I don't think he's a troll.
I think he's obsessed with what he percieves as an issue with Python's
string implementation. IOW, if he's a troll, he's got me fooled.
 
T

Terry Reedy

Under what circumstances will a string be created from a wchar_t string?
How, and why, would such a string be created? Why would Python still
support strings containing surrogates when it now has a nice, shiny,
surrogate-free flexible representation?

I believe because surrogates are legal codepoints and users may put them
in strings even though python does not (except for surrogate_escape
error handling).

I believe some of the internal complexity comes from supporting the old
C-api so as to not immediately invalidate existing extensions.
 
R

rurpy

It can also be the case when language less strong would be useless.

I don't get your point.
I was pointing out the fallacy in Steven's logic (which you cut).
How is your statement relevant to that?
Non-objective? If today poster B says X, and tomorrow poster B says
s/he was unaware of X until just now, is not "liar" a reasonable
conclusion?

Of course not. People forget what they posted previously, change
their mind, don't express what they intended perfectly, sometimes
express a complex thought that the reader inaccurately perceives
as contradictory, don't realize themselves that their thinking
is contradictory, ...
And of course who among us *not* a "liar" since we all lie from
time to time.

Lying involves intent to deceive. I haven't been following jmfauth's
claims since they are not of interest to me, but going back and quickly
looking at the posts that triggered the "liar" and "idiot" posts, I
did not see anything that made me think that jmfauth was not sincere
in his beliefs. Being wrong and being sincere are not exclusive.
Nor did Steven even try to justify the "liar" claim. As to Mark
Lawrence, that seemed like a pure "I don't like you" insult whose
proper place is /dev/null.

Even if the odds are 80% that the person is lying, why risk your
own credibility by making a nearly impossible to substantiate claim?
Someone may praise some company's product constantly online and be
discovered to be a salesperson at that company. Most of the time
you would be right to accuse the person of dishonesty. But I knew
a person who was very young and naive, who really believed in the
product and truly didn't see anything wrong in doing that. That
doesn't make it good behavior but those who claimed he was hiding
his identity for personal gain were wrong (at least as far as I
could tell, knowing the person personally.) Just post the facts
and let people draw their own conclusions; that's better than making
aggressive and offensive claims than can never be proven.

Calling people liars or idiots not only damages the reputation of
the Python community in general [*1] but hurts your own credibility
as well, since any sensible reader will wonder if other opinions
you post are more influenced by your emotions than by your intelligence.
Correct. Do you not agree?

Don't ask me, ask Steven. He was the one who wrote two sentences
earlier, "...we want a...community where everyone is welcome."

I'll snip the rest of your post because it is your opinions
and I've already said why I disagree. Most people are smart enough
to make their own evaluations of posters here and if they are not,
and reject python based on what they read from a single poster
who obviously has "strong" views, then perhaps that's for the
best. That possibility (which I think is very close to zero) is
a tiny price to pay to avoid all the hostility and noise.

----
[*1] See for example the blog post at
http://joepie91.wordpress.com/2013/02/19/the-python-documentation-is-bad-and-you-should-feel-bad/
which was recently discussed in this list and in which the
author wrote, "the community around Python is one of the most
hostile and unhelpful communities around any programming-related
topic that I have ever seen".
 
E

Ethan Furman

I don't get your point.
I was pointing out the fallacy in Steven's logic (which you cut).
How is your statement relevant to that?

Ah. I thought you were saying that in all cases helpful strong language would be even more helpful if less strong.

Of course not. People forget what they posted previously, change
their mind, don't express what they intended perfectly, sometimes
express a complex thought that the reader inaccurately perceives
as contradictory, don't realize themselves that their thinking
is contradictory, ...

I agree, which is why I resisted my own impulse to call him a liar; however, he has been harping on this subject for
months now, so I would be suprised if he actually was surprised and had forgotten...

Lying involves intent to deceive. I haven't been following jmfauth's
claims since they are not of interest to me, but going back and quickly
looking at the posts that triggered the "liar" and "idiot" posts, I
did not see anything that made me think that jmfauth was not sincere
in his beliefs. Being wrong and being sincere are not exclusive.
Nor did Steven even try to justify the "liar" claim. As to Mark
Lawrence, that seemed like a pure "I don't like you" insult whose
proper place is /dev/null.

After months of jmf's antagonist posts, I don't blame them.
Don't ask me, ask Steven. He was the one who wrote two sentences
earlier, "...we want a...community where everyone is welcome."

Ah, right -- missed that!
 
J

jmfauth

------

Neil Hodgson:

"The counter-problem is that a French document that needs to include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string."

Serious developers/typographers/users know that you can not compose
a text in French with "latin-1". This is now also the case with
German (Germany).

---

Neil's comment is correct,
2040

This is not really the problem. "Serious users" may
notice sooner or later, Python and Unicode are walking in
opposite directions (technically and in spirit).
timeit.repeat("'a' * 1000 + 'ẞ'") [1.1088995672090292, 1.0842266613261913, 1.1010779011941594]
timeit.repeat("'a' * 1000 + 'z'")
[0.6362570846925735, 0.6159128762502917, 0.6200501673623791]


(Just an opinion)

jmf
 
S

Steven D'Aprano

This is not really the problem. "Serious users" may notice sooner or
later, Python and Unicode are walking in opposite directions
(technically and in spirit).
timeit.repeat("'a' * 1000 + 'ẞ'") [1.1088995672090292, 1.0842266613261913, 1.1010779011941594]
timeit.repeat("'a' * 1000 + 'z'")
[0.6362570846925735, 0.6159128762502917, 0.6200501673623791]

Perhaps you should stick to Python 3.2, where ASCII strings are no faster
than non-ASCII strings.


Python 3.2 versus Python 3.3, no significant difference:

# 3.2
py> timeit.repeat("'a' * 1000 + 'ẞ'")
[1.7418999671936035, 1.7198870182037354, 1.763346004486084]

# 3.3
py> timeit.repeat("'a' * 1000 + 'ẞ'")
[1.8083378580026329, 1.818592812011484, 1.7922867869958282]



Python 3.2, ASCII vs Non-ASCII:

py> timeit.repeat("'a' * 1000 + 'z'")
[1.756322135925293, 1.8002049922943115, 1.721085958480835]
py> timeit.repeat("'a' * 1000 + 'ẞ'")
[1.7209150791168213, 1.7162668704986572, 1.7260780334472656]



In other words, if you stick to non-ASCII strings, Python 3.3 is no
slower than Python 3.2.
 
M

Mark Lawrence

------

Neil Hodgson:

"The counter-problem is that a French document that needs to include
one mathematical symbol (or emoji) outside Latin-1 will double in size
as a Python string."

Serious developers/typographers/users know that you can not compose
a text in French with "latin-1". This is now also the case with
German (Germany).

---

Neil's comment is correct,
2040

This is not really the problem. "Serious users" may
notice sooner or later, Python and Unicode are walking in
opposite directions (technically and in spirit).
timeit.repeat("'a' * 1000 + 'ẞ'") [1.1088995672090292, 1.0842266613261913, 1.1010779011941594]
timeit.repeat("'a' * 1000 + 'z'")
[0.6362570846925735, 0.6159128762502917, 0.6200501673623791]


(Just an opinion)

jmf

I'm feeling very sorry for this horse, it's been flogged so often it's
down to bare bones.
 
R

rusi

I'm feeling very sorry for this horse, it's been flogged so often it's
down to bare bones.

While I am now joining the camp of those fed up with jmf's whining, I
do wonder if we are shooting the messenger…

From a recent Roy mysqldb-unicode thread:
My unicode-fu is a bit weak. Are we looking at a Python problem, a
MySQLdb problem, or a problem with the underlying MySQL server? We've
certainly inserted utf-8 data before without any problems. It's
possible this is the first time we've tried to handle a character
outside the BMP. :
:
OK, that leads to the next question. Is there anyway I can (in Python
2.7) detect when a string is not entirely in the BMP? If I could find
all the non-BMP characters, I could replace them with U+FFFD
(REPLACEMENT CHARACTER) and life would be good (enough).
Steven's:
But it means that if you're one of the 99.9% of users who mostly use
characters in the BMP, …

And from http://www.tlg.uci.edu/~opoudjis/unicode/unicode_astral.html
The informal name for the supplementary planes of Unicode is "astral planes", since
(especially in the late '90s) their use seemed to be as remote as
the theosophical "great beyond". …
As of this writing for instance, Dreamweaver MX for MacOSX (which I am currently using
to prepare this) will let you paste BMP text into its WYSIWYG window; butpasting
Supplementary Plane text there will make it crash.

So I really wonder: Is python losing more by supporting SMP with
performance hit on BMP?
The problem as I see it is that a choice that is sufficiently skew is
no more a straightforward choice. An example will illustrate:

I can choose to drive or not -- a choice.
Statistics tell me that on average there are 3 fatalities every day; I
am very concerned that I could get killed so I choose not to drive.
Which neglects that there are a couple of million safe-drives at the
same time as the '3 fatalities'

[What if anything this has to do with jmf's rants I dont know because
I dont know if anyone (including jmf) knows what he is ranting about. ]
 
I

Ian Kelly

So I really wonder: Is python losing more by supporting SMP with
performance hit on BMP?

I don't believe so. Although performance is undeniably worse for some
benchmarks, it is also better for some others. Nobody has yet
demonstrated an actual, real-world program that is affected negatively
by the Unicode change. All of jmf's complaints amount to
cherry-picking data and leaping to conclusions.
 
C

Chris Angelico

So I really wonder: Is python losing more by supporting SMP with
performance hit on BMP?

If your strings fit entirely within the BMP, then you should see no
penalty compared to previous versions of Python. If they happen to fit
inside ASCII, then there may well be significant improvements. But
regardless, what you gain is the ability to work with *any* string,
regardless of its content, without worrying about it. You can count
characters regardless of their content. Imagine if a tuple of integers
behaved differently if some of those integers flipped to being long
ints:

x = (1, 2, 4, 8, 1<<30, 1<<300, 1<<10)

Wouldn't you be surprised if len(x) returned 8? I certainly would be.
And that's what a narrow build of Python does with Unicode.

Unicode strings are approximately comparable to tuples of integers. In
fact, they can be interchanged fairly readily:

string = "Treble clef: \U0001D11E"
array = tuple(map(ord,string))
assert(len(array) == 14)
out_string = ''.join(map(chr,array))
assert(out_string == string)

This doesn't work in Python 2.6 on Windows, partly because of
surrogates, but also because chr() isn't designed for Unicode strings.
There's probably a solution to the second, but not really to the
first. The tuple of ords should match the way the characters are laid
out to a human.

ChrisA
 
S

Steven D'Aprano

While I am now joining the camp of those fed up with jmf's whining, I do
wonder if we are shooting the messenger…

No. The trouble is that the messenger is shouting that the Unicode world
is ending on December 21st 2012, and hasn't noticed that was over three
months ago and the world didn't end.


[...]
Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just throw
it away unnecessarily. That would be rude, and in Python 3.3 it should be
unnecessary.

Although, since the person you are quoting is stuck in Python 2.7, it may
be less bad than having to deal with potentially broken Unicode strings.

Steven's:

Yes. "Mostly" does not mean exclusively, and given (say) a billion
computer users, that leaves about a million users who have significant
need for non-BMP characters.

If you don't agree with my estimate, feel free to invent your own :)


That was nearly two decades ago. Two decades ago, the idea that the
entire computing world could standardize on a single character set,
instead of having to deal with dozens of different "code pages", seemed
as likely as people landing on the Moon seemed in 1940.

Today, the entire computing world has standardized on such a system,
"code pages" (encodings) are mostly only needed for legacy data and
shitty applications, but most implementations don't support the entire
Unicode range. A couple of programming languages, including Pike and
Python, support Unicode fully and correctly. Pike has never had the same
high-profile as Python, but now that Python can support the entire
Unicode range without broken surrogate support, maybe users of other
languages will start to demand the same.

So I really wonder: Is python losing more by supporting SMP with
performance hit on BMP?

No.

As many people have demonstrated, both with code snippets and whole-
program benchmarks, Python 3.3 is *as fast* or *faster* than Python 3.2
narrow builds. In practice, Python 3.3 saves enough memory by using
sensible string implementations that real world software is faster in
Python 3.3 than in 3.2.

The problem as I see it is that a choice that is sufficiently skew is no
more a straightforward choice. An example will illustrate:

I can choose to drive or not -- a choice. Statistics tell me that on
average there are 3 fatalities every day; I am very concerned that I
could get killed so I choose not to drive. Which neglects that there are
a couple of million safe-drives at the same time as the '3 fatalities'

Clear as mud. What does this have to do with supporting Unicode?
 
R

Roy Smith

Steven D'Aprano said:
[...]
Of course you can do this, but you should not. If your input data
includes character C, you should deal with character C and not just throw
it away unnecessarily. That would be rude, and in Python 3.3 it should be
unnecessary.

The import job isn't done yet, but so far we've processed 116 million
records and had to clean up four of them. I can live with that.
Sometimes practicality trumps correctness.

It turns out, the problem is that the version of MySQL we're using
doesn't support non-BMP characters. Newer versions do (but you have to
declare the column to use the utf8bm4 character set). I could upgrade
to a newer MySQL version, but it's just not worth it.

Actually, I did try spinning up a 5.5 instance (one of the nice things
of being in the cloud) and experimented with that, but couldn't get it
to work there either. I'll admit that I didn't invest a huge amount of
effort to make that work before just writing this:

def bmp_filter(self, s):
"""Filter a unicode string to remove all non-BMP (basic
multilingual plane) characters. All such characters are
replaced with U+FFFD (Unicode REPLACEMENT CHARACTER).

"""
if all(ord(c) <= 0xffff for c in s):
return s
else:
self.logger.warning("making %r BMP-clean", s)
bmp_chars = [(c if ord(c) <= 0xffff else u'\ufffd') for c in
s]
return ''.join(bmp_chars)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top