Performance of int/long in Python 3

Mark Lawrence · Apr 3, 2013

jmfauth:

3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]

Click to expand...

That's a larger performance decrease than the 64-bit version.

Reported the issue as
http://bugs.python.org/issue17615

Neil

FTR this has been closed as fixed see
http://bugs.python.org/issue17615#msg185862

Chris Angelico · Apr 3, 2013

This has to inspect the entire string, no? I posted (essentially) this
a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"

I'm reasonably sure all() is smart enough to stop at the first False
value.

Probably, but it still has to scan the body of the string. It'd not be
too bad if it's all astral, but if it's all BMP, it has to scan the
whole string. In the max() case, it has to scan the whole string
anyway, as there's no other way to determine the maximum. I'm thinking
here of this function:

http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/width.html

It's implemented as a simple lookup into the header. (Pike strings,
like PEP 393 strings, are stored in the most compact way possible - 1,
2, or 4 bytes per character - with a conceptually similar header
structure.) Is this something that would be worth having available?
Should I post an issue about it?

ChrisA

more for self-ref than anyone else's: source of Pike's String.width():
http://pike-git.lysator.liu.se/gitweb.cgi?p=pike.git;a=blob;f=src/builtin.cmod;hb=HEAD#l1077

Roy Smith · Apr 3, 2013

Steven D'Aprano said:
I seem to recall that "sort relies only on < operator" is a language
promise, but I can't seem to find it documented anywhere official.

That's pretty typical for sort implementations in all languages. Except
for those which rely on "less than and equal to"

Steven D'Aprano · Apr 3, 2013

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]

This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

I posted (essentially) this a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"

It's not "astral crap". People use it, and they'll use it more in the
future. Just because you don't, doesn't give you leave to make
disparaging remarks about it.

Honestly, it's really painful to see how history repeats itself:

"Bah humbug, why do we need to support the SMP astral crap? The Unicode
BMP is more than enough for everybody."

"Bah humbug, why do we need to support Unicode crap? Latin1 is more than
enough for everybody."

"Bah humbug, why do we need to support Latin1 crap? ASCII is more than
enough for everybody."

"Bah humbug, why do we need to support ASCII crap? Uppercase A-Z is more
than enough for everybody."

Seriously. Go back long enough, to the telegraph days, and you have
people arguing that there was no need for upper and lower case letters.

I'm reasonably sure all() is smart enough to stop at the first False
value.

Yes, all() and any() are guaranteed to be short-circuit functions. They
will stop as soon as they see a False or a True value respectively.

Steven D'Aprano · Apr 3, 2013

Probably, but it still has to scan the body of the string. It'd not be
too bad if it's all astral, but if it's all BMP, it has to scan the
whole string. In the max() case, it has to scan the whole string anyway,
as there's no other way to determine the maximum. I'm thinking here of
this function:

http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/ width.html

It's implemented as a simple lookup into the header. (Pike strings, like
PEP 393 strings, are stored in the most compact way possible - 1, 2, or
4 bytes per character - with a conceptually similar header structure.)
Is this something that would be worth having available? Should I post an
issue about it?

I'm not really sure why I would want to know, apart from pure
intellectual curiosity, but sure, post a feature request. Be sure to
mention that Pike supports this feature.

rusi · Apr 3, 2013

This has to inspect the entire string, no? ?I posted (essentially) this
a few days ago:

? ? ? ?if all(ord(c) <= 0xffff for c in s):
? ? ? ? ? ? return "it's all bmp"
? ? ? ? else:
? ? ? ? ? ? return "it's got astral crap in it"

Astral crap? CRAP?
Verily sir I am offended!

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!

In short you are a complete philistine
No? On second thoughts I take that back. For all we know philistine
may be one of the blessings of the Unicode gods?
So following the ilustrious example of jmf, I shall pronounce upon you
the ultimate curse:

You are American!

Ian Kelly · Apr 3, 2013

I'm also puzzled. I thought that the sort algorithm used a hash of all the
items to be sorted, and only reverted to a raw comparison of the original
values when the hash collided. Is that not the case? Or is the code you
post here only used when the hash collides?

I think you are mistaken, because I don't see how that could work. If
the hashes of two items are different then you can assume they are not
equal, but sorting requires a partial ordering comparison, not simply
an equality comparison. You cannot determine which item is less or
greater than the other from the hash values alone.

Ian Kelly · Apr 3, 2013

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]

This has to inspect the entire string, no?

Click to expand...

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in the
string. It's only safe to short-circuit return 4, not 2 or 1.

Ian Kelly · Apr 3, 2013

(sys.getsizeof(s) - sys.getsizeof(''))/len(s)

3.0

I didn't know there was a 3-byte-width representation.

More seriously, it fails because '' is ASCII and s is not, and the
overhead for the two strings is different.

Ethan Furman · Apr 3, 2013

Astral crap? CRAP?
Verily sir I am offended!

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!

In short you are a complete philistine
No? On second thoughts I take that back. For all we know philistine
may be one of the blessings of the Unicode gods?
So following the ilustrious example of jmf, I shall pronounce upon you
the ultimate curse:

You are American!

LOL!

Steven D'Aprano · Apr 3, 2013

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]

n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no?

Click to expand...

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

Click to expand...

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string. It's only safe to short-circuit return 4, not 2 or 1.

Doh!

I mean, well done sir, you have successfully passed my little test!

Dave Angel · Apr 3, 2013

I think you are mistaken, because I don't see how that could work. If
the hashes of two items are different then you can assume they are not
equal, but sorting requires a partial ordering comparison, not simply
an equality comparison. You cannot determine which item is less or
greater than the other from the hash values alone.

You are of course correct. The particular data that Neil had provided
might well have had many duplicates, but that won't be the typical case,
so there's not much point in doing an unordered hash. I guess I was
confusing it with the key= argument for modifying sort order, where the
key function might replace a slow-to-compare data type with something
faster.

Chris Angelico · Apr 3, 2013

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

Click to expand...

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string. It's only safe to short-circuit return 4, not 2 or 1.

Click to expand...

Doh!

I mean, well done sir, you have successfully passed my little test!

Try this:

def str_width(s):
width=1
for ch in map(ord,s):
if ch > 0xFFFF: return 4
if cn > 0xFF: width=2
return width

ChrisA

Chris Angelico · Apr 3, 2013

I'm not really sure why I would want to know, apart from pure
intellectual curiosity, but sure, post a feature request. Be sure to
mention that Pike supports this feature.

http://bugs.python.org/issue17629 opened.

ChrisA

Mark Lawrence · Apr 3, 2013

43:06 -0400, Roy Smith wrote:

[...]
n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string. It's only safe to short-circuit return 4, not 2 or 1.

Click to expand...

Doh!

I mean, well done sir, you have successfully passed my little test!

Click to expand...

Try this:

def str_width(s):
width=1
for ch in map(ord,s):
if ch > 0xFFFF: return 4
if cn > 0xFF: width=2
return width

ChrisA

Given the quality of some code posted here recently this patch can't be
accepted until there are some unit tests

Roy Smith · Apr 3, 2013

rusi said:
This has to inspect the entire string, no? ?I posted (essentially) this
a few days ago:

? ? ? ?if all(ord(c) <= 0xffff for c in s):
? ? ? ? ? ? return "it's all bmp"
? ? ? ? else:
? ? ? ? ? ? return "it's got astral crap in it"

Click to expand...

Astral crap? CRAP?
Verily sir I am offended!
[...]
You are American!

This is true.

But, to be fair, in the (I don't have the exact number here) roughly 200
million records in our recent big data import job, I found exactly FOUR
strings with astral characters. Which boiled down to two versions of
each of two different song titles.

One had a Unicode Character 'BALLOON' (U+1F388). The other had some
heart symbol (sorry, I don't remember the exact code point). These
hardly seem a matter of national pride.

And, if you don't believe there is astral crap, how do you explain
U+1F4A9?

Roy Smith · Apr 3, 2013

Steven D'Aprano said:
On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]

This has to inspect the entire string, no?

Click to expand...

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

I posted (essentially) this a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"

Click to expand...

It's not "astral crap". People use it, and they'll use it more in the
future. Just because you don't, doesn't give you leave to make
disparaging remarks about it.

Honestly, it's really painful to see how history repeats itself:

"Bah humbug, why do we need to support the SMP astral crap? The Unicode
BMP is more than enough for everybody."

Come on, guys. It was a joke. I'm the guy who was complaining that my
database doesn't support non-BMP, remember?

Serhiy Storchaka · Apr 6, 2013

04.04.13 00:57, Chris Angelico ???????(??):

http://bugs.python.org/issue17629 opened.

See also the discussion at
http://comments.gmane.org/gmane.comp.python.ideas/15640 . I agree with
rejection. This is an implementation detail and different Python
implementations (including future CPython versions) can have different
internal string implementations.

Chris Angelico · Apr 6, 2013

04.04.13 00:57, Chris Angelico ???????(??):

See also the discussion at
http://comments.gmane.org/gmane.comp.python.ideas/15640 . I agree with
rejection. This is an implementation detail and different Python
implementations (including future CPython versions) can have different
internal string implementations.

I really don't see why this means that there can't be a function in
sys, or something. I mean, other Pythons aren't expected to return the
exact same values from sys.getsizeof, are they? But clearly the weight
of opinion is against me, so fine, I don't care that much.

ChrisA

Ethan Furman · Apr 6, 2013

I really don't see why this means that there can't be a function in
sys, or something. I mean, other Pythons aren't expected to return the
exact same values from sys.getsizeof, are they?

What it boils down to is:

- it can easily be done by hand now
- it's a very uncommon need

ergo:

- it's not worth the time and on-going effort required

Do you know any other interesting features about coding in Python?	5	Sep 17, 2023
range() vs xrange() Python2\|3 issues for performance	11	Aug 2, 2011
Python battle game help	2	Feb 23, 2023
Python code problem	2	Apr 23, 2023
Why is Python telling me variable is local not global?	3	Sep 2, 2023
Rock paper scissors in python with "algorithm"	1	Feb 27, 2022
performance of tight loop	8	Dec 13, 2010
performance of script to write very long lines of random chars	15	Apr 10, 2013

Performance of int/long in Python 3

Mark Lawrence

Chris Angelico

Roy Smith

Steven D'Aprano

Steven D'Aprano

rusi

Ian Kelly

Ian Kelly

Ian Kelly

Ethan Furman

Steven D'Aprano

Dave Angel

Chris Angelico

Chris Angelico

Mark Lawrence

Roy Smith

Roy Smith

Serhiy Storchaka

Chris Angelico

Ethan Furman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads