Performance of int/long in Python 3

jmfauth · Mar 28, 2013

I'm not following your grammar perfectly here, but if Python were
implementing Unicode correctly, there would be no difference between
any of those characters, which is the way a *wide* build works. With a
narrow build, there is a difference between BMP and non-BMP
characters.

ChrisA

--------

The wide build (I never used) is in my mind as correct as
the narrow build. It "just" covers a different range in unicode
(the whole range).

Claiming that the narrow build is buggy, because it does not
cover the whole unicode is not correct.

Unicode does not stipulate, one has to cover the whole range.
Unicode expects that every character in a range behaves the same
way. This is clearly not realized with the flexible string
representation. An user should not be somehow penalized
simply because it not an ascii user.

If you take the fonts in consideration (btw a problem nobody
is speaking about) and you ensure your application, toolkit, ...
is MES-X or WGL4 compliant, your are also deliberately (and
correctly) working with a restriced unicode range.

jmf

Benjamin Kaplan · Mar 28, 2013

-----

You know, we can discuss this ad nauseam. What is important
is Unicode.

You have transformed Python back in an ascii oriented product.

If Python had imlemented Unicode correctly, there would
be no difference in using an "a", "Ã©", "â‚¬" or any character,
what the narrow builds did.

If I am practically the only one, who speakes /discusses about
this, I can ensure you, this has been noticed.

Now, it's time to prepare the Asparagus, the "jambon cru"
and a good bottle a dry white wine.

jmf

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

Ethan Furman · Mar 28, 2013

For someone who delights in pointing out the logical errors
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.

It can also be the case when language less strong would be useless.

Further, "liar" is both so non-objective and so pejoratively
emotive that it is a word much more likely to be used by
someone interested in trolling than in a serious discussion,
so most sensible people here likely would not bite.

Non-objective? If today poster B says X, and tomorrow poster B says s/he was unaware of X until just now, is not "liar"
a reasonable conclusion?

I hope so too but it is likely that some people want a place
to develop and assert some sense of influence, engage in verbal
duels, instigate arguments, etc. That can be true of regulars
here as well as drive-by posters.

In other words, everyone is NOT welcome.

Correct. Do you not agree?

Where those terms are defined by you and a handful of other
voracious posters. "Troll" in particular is often used to
mean someone who disagrees with the borg mind here, or who
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in
a sufficiently submissive way.

I cannot speak for the borg mind, but for myself a troll is anyone who continually posts rants (such as RR & XL) or who
continuously hijacks threads to talk about their pet peeve (such as jmf).

No, we disagree on who fits those definitions and even
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original
posts are.

I completely disagree, and I am grateful to those who bother to take the time to continually point out the errors from
those posters and to warn newcomers that those posters should not be believed.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think.

If one of my first few posts on c.l.p netted a response from a troll I would greatly appreciate a reply from one of the
regulars saying that was a troll so I didn't waste time trying to use whatever they said, or be concerned that the
language I was trying to use and learn was horribly flawed.

If the "truthsquad" posts are so offensive to you, why don't you kill-file them?

jmfauth · Mar 28, 2013

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

----------

I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf

jmfauth · Mar 28, 2013

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

Click to expand...

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

Click to expand...

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

Click to expand...

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

Click to expand...

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

Click to expand...

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

Click to expand...

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

Click to expand...

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

Click to expand...

----------

I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf

-----

Addendum.

And you kwow what? Py34 will suffer from the same desease.
You are spending your time in improving chunks of bytes,
when the problem is elsewhere.
In fact you are working for peanuts, eg the replacing method.

If you are not satisfied with my examples, just pick up
the examples of GvR (ascii-string) on the bug tracker, "timeit"
them and you will see there is already a problem.

Better, "timeit" them afeter having replaced his ascii-strings
with non ascii characters...

jmf

and you will see, there is

Chris Angelico · Mar 28, 2013

The wide build (I never used) is in my mind as correct as
the narrow build. It "just" covers a different range in unicode
(the whole range).

Actually it does; it covers all of the Unicode range, by using
(effectively) UTF-16. Characters that cannot be represented in one
16-bit number are represented in two. That's not "just" covering a
different range. It's being buggy. And it's creating a way for code to
unexpectedly behave fundamentally differently on Windows and Linux
(since the most common builds for Windows were narrow and for Linux
were wide). This is a Bad Thing for Python.

ChrisA

MRAB · Mar 28, 2013

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

Click to expand...

----------

I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

If you're that concerned about it, why don't you modify the source code so
that the string representation chooses between only 2 bytes and 4 bytes per
codepoint, and then see whether that you prefer that situation. How do
the memory usage and speed compare?

Benjamin Kaplan · Mar 28, 2013

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

Click to expand...

----------

I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf

By that logic, we should all be using ASCII because it's "correct" for
the 127 characters that I (as an English speaker) use, and therefore
it's all that we should care about. I don't care if Ã© counts as two
characters, it's faster and more memory efficient for all of my
strings to just count bytes. There are certain domains where
characters outside the basic multilingual plane are used. Python's job
is to be correct in all of those circumstances, not just the ones you
care about.

88888 Dihedral · Mar 28, 2013

Chris Angelicoæ–¼ 2013å¹´3æœˆ28æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ11æ™‚40åˆ†17ç§’å¯«é“ï¼š

Has anybody else thought that [jmf's] last few responses are starting to sound

Click to expand...

bot'ish?

Click to expand...

Yes, I did wonder. It's like he and Dihedral have been trading

accounts sometimes. Hey, Dihedral, I hear there's a discussion of

Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords

for you to trigger on and Python and bots are funny and this text is

almost grammatical!

There. Let's see if he takes the bait.

ChrisA

Well, we need some cheap ram to hold 4 bytes per character

in a text segment to be observed.

For those not to be observed or shown, the old way still works.

Windows got this job done right to collect taxes in areas
of different languages.

88888 Dihedral · Mar 28, 2013

Chris Angelicoæ–¼ 2013å¹´3æœˆ28æ—¥æ˜ŸæœŸå››UTC+8ä¸Šåˆ11æ™‚40åˆ†17ç§’å¯«é“ï¼š

Has anybody else thought that [jmf's] last few responses are starting to sound

Click to expand...

bot'ish?

Click to expand...

Yes, I did wonder. It's like he and Dihedral have been trading

accounts sometimes. Hey, Dihedral, I hear there's a discussion of

Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords

for you to trigger on and Python and bots are funny and this text is

almost grammatical!

There. Let's see if he takes the bait.

ChrisA

Well, we need some cheap ram to hold 4 bytes per character

in a text segment to be observed.

For those not to be observed or shown, the old way still works.

Windows got this job done right to collect taxes in areas
of different languages.

Terry Reedy · Mar 28, 2013

On 3/28/2013 4:26 PM, jmfauth wrote:

Please provide references for your assertions. I have read the unicode
standard, parts more than once, and your assertions contradict my memory.

Unicode does not stipulate, one has to cover the whole range.

I believe it does. As I remember, the recognized encodings all encode
the entire unicode codepoint range

Unicode expects that every character in a range behaves the same
way.

I have no idea what you mean by 'same way'. Each codepoint is supposed
to behave differently in some way. That is the reason for having
multiple codepoints. One causes an 'a' to appear, another a 'b'. Indeed,
the standard define multiple categories of codepoints and chars in
different categories are supposed to act differently (or be treated
differently). Glyphic chars versus control chars are one example.

Dennis Lee Bieber · Mar 28, 2013

At some point we have to stop being gentle / polite / politically correct and call a shovel a shovel... er, spade.

Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)

Chris Angelico · Mar 29, 2013

Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)

I don't know, but I'm pretty sure there's a private detective who
wouldn't appreciate being called Sam Shovel.

ChrisA

Mark Lawrence · Mar 29, 2013

Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)

Surely you can spade a lot more things than dirt?

Steven D'Aprano · Mar 29, 2013

Even if you personally would prefer someone to respond by calling you a
liar, your personal preferences do not form a basis for desirable
posting behavior here.

Whereas yours apparently are.

Thanks for the feedback, I'll take it under advisement.

Steven D'Aprano · Mar 29, 2013

The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header. The smaller header omits the utf8 pointer
(which optionally points to an additional UTF-8 representation of the
string) and its associated length variable. These are not needed for
ASCII-only strings because an ASCII string can be directly interpreted
as a UTF-8 string for the same result. The smaller header also omits
the "wstr_length" field which, according to the PEP, "differs from
length only if there are surrogate pairs in the representation." For an
ASCII string, of course there would not be any surrogate pairs.

I wonder why they need care about surrogate pairs?

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.

Chris Angelico · Mar 29, 2013

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...

I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.

.... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?

Minor nitpick, btw:

(in which cast wstr_length differs form length)

Should be "in which case" and "from". Who has the power to correct
typos in PEPs?

ChrisA

Mark Lawrence · Mar 29, 2013

Minor nitpick, btw:
Should be "in which case" and "from". Who has the power to correct
typos in PEPs?

ChrisA

Sneak it in here? http://bugs.python.org/issue13604

Chris Angelico · Mar 29, 2013

Sneak it in here? http://bugs.python.org/issue13604

Ah! Turns out it's already been fixed; a reword of that section, as
shown in the attached files, no longer has the parenthesis, and thus
its typos.

ChrisA

MRAB · Mar 29, 2013

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...

... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?

You could ask the same question about any encoding.

It's only an issue if it's passed to a C function which expects a
null-terminated string.

Do you know any other interesting features about coding in Python?	5	Sep 17, 2023
range() vs xrange() Python2\|3 issues for performance	11	Aug 2, 2011
Python battle game help	2	Feb 23, 2023
Python code problem	2	Apr 23, 2023
Why is Python telling me variable is local not global?	3	Sep 2, 2023
Rock paper scissors in python with "algorithm"	1	Feb 27, 2022
performance of tight loop	8	Dec 14, 2010
performance of script to write very long lines of random chars	15	Apr 11, 2013

Performance of int/long in Python 3

jmfauth

Benjamin Kaplan

Ethan Furman

jmfauth

jmfauth

Chris Angelico

MRAB

Benjamin Kaplan

88888 Dihedral

88888 Dihedral

Terry Reedy

Dennis Lee Bieber

Chris Angelico

Mark Lawrence

Steven D'Aprano

Steven D'Aprano

Chris Angelico

Mark Lawrence

Chris Angelico

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads