Performance of int/long in Python 3

M

Mark Lawrence

jmfauth:
3.2.3 (default, Apr 11 2012, 07:15:24) [MSC v.1500 32 bit (Intel)]
[0.8343414906182101, 0.8336184057396241, 0.8330473419738562]
3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 10:55:48) [MSC v.1600 32 bit
[1.3840254166697845, 1.3933888932429768, 1.391664674507438]

That's a larger performance decrease than the 64-bit version.

Reported the issue as
http://bugs.python.org/issue17615

Neil

FTR this has been closed as fixed see
http://bugs.python.org/issue17615#msg185862
 
C

Chris Angelico

This has to inspect the entire string, no? I posted (essentially) this
a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"

I'm reasonably sure all() is smart enough to stop at the first False
value.

Probably, but it still has to scan the body of the string. It'd not be
too bad if it's all astral, but if it's all BMP, it has to scan the
whole string. In the max() case, it has to scan the whole string
anyway, as there's no other way to determine the maximum. I'm thinking
here of this function:

http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/width.html

It's implemented as a simple lookup into the header. (Pike strings,
like PEP 393 strings, are stored in the most compact way possible - 1,
2, or 4 bytes per character - with a conceptually similar header
structure.) Is this something that would be worth having available?
Should I post an issue about it?

ChrisA

more for self-ref than anyone else's: source of Pike's String.width():
http://pike-git.lysator.liu.se/gitweb.cgi?p=pike.git;a=blob;f=src/builtin.cmod;hb=HEAD#l1077
 
R

Roy Smith

Steven D'Aprano said:
I seem to recall that "sort relies only on < operator" is a language
promise, but I can't seem to find it documented anywhere official.

That's pretty typical for sort implementations in all languages. Except
for those which rely on "less than and equal to" :)
 
S

Steven D'Aprano

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1


I posted (essentially) this a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"


It's not "astral crap". People use it, and they'll use it more in the
future. Just because you don't, doesn't give you leave to make
disparaging remarks about it.

Honestly, it's really painful to see how history repeats itself:

"Bah humbug, why do we need to support the SMP astral crap? The Unicode
BMP is more than enough for everybody."

"Bah humbug, why do we need to support Unicode crap? Latin1 is more than
enough for everybody."

"Bah humbug, why do we need to support Latin1 crap? ASCII is more than
enough for everybody."

"Bah humbug, why do we need to support ASCII crap? Uppercase A-Z is more
than enough for everybody."

Seriously. Go back long enough, to the telegraph days, and you have
people arguing that there was no need for upper and lower case letters.


I'm reasonably sure all() is smart enough to stop at the first False
value.

Yes, all() and any() are guaranteed to be short-circuit functions. They
will stop as soon as they see a False or a True value respectively.
 
S

Steven D'Aprano

Probably, but it still has to scan the body of the string. It'd not be
too bad if it's all astral, but if it's all BMP, it has to scan the
whole string. In the max() case, it has to scan the whole string anyway,
as there's no other way to determine the maximum. I'm thinking here of
this function:

http://pike.lysator.liu.se/generated/manual/modref/ex/7.2_3A_3A/String/ width.html

It's implemented as a simple lookup into the header. (Pike strings, like
PEP 393 strings, are stored in the most compact way possible - 1, 2, or
4 bytes per character - with a conceptually similar header structure.)
Is this something that would be worth having available? Should I post an
issue about it?

I'm not really sure why I would want to know, apart from pure
intellectual curiosity, but sure, post a feature request. Be sure to
mention that Pike supports this feature.
 
R

rusi

This has to inspect the entire string, no? ?I posted (essentially) this
a few days ago:

? ? ? ?if all(ord(c) <= 0xffff for c in s):
? ? ? ? ? ? return "it's all bmp"
? ? ? ? else:
? ? ? ? ? ? return "it's got astral crap in it"

Astral crap? CRAP?
Verily sir I am offended!

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!

In short you are a complete philistine
No? On second thoughts I take that back. For all we know philistine
may be one of the blessings of the Unicode gods?
So following the ilustrious example of jmf, I shall pronounce upon you
the ultimate curse:

You are American!
 
I

Ian Kelly

I'm also puzzled. I thought that the sort algorithm used a hash of all the
items to be sorted, and only reverted to a raw comparison of the original
values when the hash collided. Is that not the case? Or is the code you
post here only used when the hash collides?

I think you are mistaken, because I don't see how that could work. If
the hashes of two items are different then you can assume they are not
equal, but sorting requires a partial ordering comparison, not simply
an equality comparison. You cannot determine which item is less or
greater than the other from the hash values alone.
 
I

Ian Kelly

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in the
string. It's only safe to short-circuit return 4, not 2 or 1.
 
I

Ian Kelly

(sys.getsizeof(s) - sys.getsizeof(''))/len(s)
3.0

I didn't know there was a 3-byte-width representation. :)

More seriously, it fails because '' is ASCII and s is not, and the
overhead for the two strings is different.
 
E

Ethan Furman

Astral crap? CRAP?
Verily sir I am offended!

You dont play with Mahjong characters? How crude!
You dont know about cuneiform? How illiterate!
You dont compose poetry with Egyptian hieroglyphs? How rude!
Shavian has not reformed you? How backward!

In short you are a complete philistine
No? On second thoughts I take that back. For all we know philistine
may be one of the blessings of the Unicode gods?
So following the ilustrious example of jmf, I shall pronounce upon you
the ultimate curse:

You are American!

LOL!
 
S

Steven D'Aprano

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string. It's only safe to short-circuit return 4, not 2 or 1.


Doh!

I mean, well done sir, you have successfully passed my little test!
 
D

Dave Angel

I think you are mistaken, because I don't see how that could work. If
the hashes of two items are different then you can assume they are not
equal, but sorting requires a partial ordering comparison, not simply
an equality comparison. You cannot determine which item is less or
greater than the other from the hash values alone.

You are of course correct. The particular data that Neil had provided
might well have had many duplicates, but that won't be the typical case,
so there's not much point in doing an unordered hash. I guess I was
confusing it with the key= argument for modifying sort order, where the
key function might replace a slow-to-compare data type with something
faster.
 
C

Chris Angelico

On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string. It's only safe to short-circuit return 4, not 2 or 1.


Doh!

I mean, well done sir, you have successfully passed my little test!

Try this:

def str_width(s):
width=1
for ch in map(ord,s):
if ch > 0xFFFF: return 4
if cn > 0xFF: width=2
return width

ChrisA
 
M

Mark Lawrence

43:06 -0400, Roy Smith wrote:

[...]
n = max(map(ord, s))
4 if n > 0xffff else 2 if n > 0xff else 1

This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1

That's an incorrect implementation, as it would return 2 at the first
non-Latin-1 BMP character, even if there were SMP characters later in
the string. It's only safe to short-circuit return 4, not 2 or 1.


Doh!

I mean, well done sir, you have successfully passed my little test!

Try this:

def str_width(s):
width=1
for ch in map(ord,s):
if ch > 0xFFFF: return 4
if cn > 0xFF: width=2
return width

ChrisA

Given the quality of some code posted here recently this patch can't be
accepted until there are some unit tests :)
 
R

Roy Smith

rusi said:
This has to inspect the entire string, no? ?I posted (essentially) this
a few days ago:

? ? ? ?if all(ord(c) <= 0xffff for c in s):
? ? ? ? ? ? return "it's all bmp"
? ? ? ? else:
? ? ? ? ? ? return "it's got astral crap in it"

Astral crap? CRAP?
Verily sir I am offended!
[...]
You are American!

This is true.

But, to be fair, in the (I don't have the exact number here) roughly 200
million records in our recent big data import job, I found exactly FOUR
strings with astral characters. Which boiled down to two versions of
each of two different song titles.

One had a Unicode Character 'BALLOON' (U+1F388). The other had some
heart symbol (sorry, I don't remember the exact code point). These
hardly seem a matter of national pride.

And, if you don't believe there is astral crap, how do you explain
U+1F4A9?
 
R

Roy Smith

Steven D'Aprano said:
On Wed, 03 Apr 2013 09:43:06 -0400, Roy Smith wrote:

[...]
This has to inspect the entire string, no?

Correct. A more efficient implementation would be:

def char_size(s):
for n in map(ord, s):
if n > 0xFFFF: return 4
if n > 0xFF: return 2
return 1


I posted (essentially) this a few days ago:

if all(ord(c) <= 0xffff for c in s):
return "it's all bmp"
else:
return "it's got astral crap in it"


It's not "astral crap". People use it, and they'll use it more in the
future. Just because you don't, doesn't give you leave to make
disparaging remarks about it.

Honestly, it's really painful to see how history repeats itself:

"Bah humbug, why do we need to support the SMP astral crap? The Unicode
BMP is more than enough for everybody."

Come on, guys. It was a joke. I'm the guy who was complaining that my
database doesn't support non-BMP, remember?
 
C

Chris Angelico

04.04.13 00:57, Chris Angelico ???????(??):


See also the discussion at
http://comments.gmane.org/gmane.comp.python.ideas/15640 . I agree with
rejection. This is an implementation detail and different Python
implementations (including future CPython versions) can have different
internal string implementations.

I really don't see why this means that there can't be a function in
sys, or something. I mean, other Pythons aren't expected to return the
exact same values from sys.getsizeof, are they? But clearly the weight
of opinion is against me, so fine, I don't care that much.

ChrisA
 
E

Ethan Furman

I really don't see why this means that there can't be a function in
sys, or something. I mean, other Pythons aren't expected to return the
exact same values from sys.getsizeof, are they?

What it boils down to is:

- it can easily be done by hand now
- it's a very uncommon need

ergo:

- it's not worth the time and on-going effort required
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top