Performance of int/long in Python 3

J

jmfauth

I'm not following your grammar perfectly here, but if Python were
implementing Unicode correctly, there would be no difference between
any of those characters, which is the way a *wide* build works. With a
narrow build, there is a difference between BMP and non-BMP
characters.

ChrisA

--------

The wide build (I never used) is in my mind as correct as
the narrow build. It "just" covers a different range in unicode
(the whole range).

Claiming that the narrow build is buggy, because it does not
cover the whole unicode is not correct.

Unicode does not stipulate, one has to cover the whole range.
Unicode expects that every character in a range behaves the same
way. This is clearly not realized with the flexible string
representation. An user should not be somehow penalized
simply because it not an ascii user.

If you take the fonts in consideration (btw a problem nobody
is speaking about) and you ensure your application, toolkit, ...
is MES-X or WGL4 compliant, your are also deliberately (and
correctly) working with a restriced unicode range.

jmf
 
B

Benjamin Kaplan

-----

You know, we can discuss this ad nauseam. What is important
is Unicode.

You have transformed Python back in an ascii oriented product.

If Python had imlemented Unicode correctly, there would
be no difference in using an "a", "é", "€" or any character,
what the narrow builds did.

If I am practically the only one, who speakes /discusses about
this, I can ensure you, this has been noticed.

Now, it's time to prepare the Asparagus, the "jambon cru"
and a good bottle a dry white wine.

jmf
You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.
 
E

Ethan Furman

For someone who delights in pointing out the logical errors
of others you are often remarkably sloppy in your own logic.

Of course language can be both helpful and excessively strong.
That is the case when language less strong would be
equally or more helpful.

It can also be the case when language less strong would be useless.

Further, "liar" is both so non-objective and so pejoratively
emotive that it is a word much more likely to be used by
someone interested in trolling than in a serious discussion,
so most sensible people here likely would not bite.

Non-objective? If today poster B says X, and tomorrow poster B says s/he was unaware of X until just now, is not "liar"
a reasonable conclusion?

I hope so too but it is likely that some people want a place
to develop and assert some sense of influence, engage in verbal
duels, instigate arguments, etc. That can be true of regulars
here as well as drive-by posters.


In other words, everyone is NOT welcome.

Correct. Do you not agree?

Where those terms are defined by you and a handful of other
voracious posters. "Troll" in particular is often used to
mean someone who disagrees with the borg mind here, or who
says anything negative about Python, or who due attitude or
lack of full English fluency do not express themselves in
a sufficiently submissive way.

I cannot speak for the borg mind, but for myself a troll is anyone who continually posts rants (such as RR & XL) or who
continuously hijacks threads to talk about their pet peeve (such as jmf).

No, we disagree on who fits those definitions and even
how tolerant we are to those who do fit the definitions.
The policing that you and a handful of other self-appointed
net-cops try to do is far more obnoxious that the original
posts are.

I completely disagree, and I am grateful to those who bother to take the time to continually point out the errors from
those posters and to warn newcomers that those posters should not be believed.

Believe or not, most of the rest of us here are smart enough to
form our own opinions of such posters without you and the other
c.l.p truthsquad members telling us what to think.

If one of my first few posts on c.l.p netted a response from a troll I would greatly appreciate a reply from one of the
regulars saying that was a troll so I didn't waste time trying to use whatever they said, or be concerned that the
language I was trying to use and learn was horribly flawed.

If the "truthsquad" posts are so offensive to you, why don't you kill-file them?
 
J

jmfauth

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings  are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

----------


I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf
 
J

jmfauth

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:
1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.
2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.
2) Encodings  are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.
3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).
4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.
5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.
6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

----------

I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf

-----

Addendum.

And you kwow what? Py34 will suffer from the same desease.
You are spending your time in improving chunks of bytes,
when the problem is elsewhere.
In fact you are working for peanuts, eg the replacing method.


If you are not satisfied with my examples, just pick up
the examples of GvR (ascii-string) on the bug tracker, "timeit"
them and you will see there is already a problem.

Better, "timeit" them afeter having replaced his ascii-strings
with non ascii characters...

jmf

and you will see, there is
 
C

Chris Angelico

The wide build (I never used) is in my mind as correct as
the narrow build. It "just" covers a different range in unicode
(the whole range).

Actually it does; it covers all of the Unicode range, by using
(effectively) UTF-16. Characters that cannot be represented in one
16-bit number are represented in two. That's not "just" covering a
different range. It's being buggy. And it's creating a way for code to
unexpectedly behave fundamentally differently on Windows and Linux
(since the most common builds for Windows were narrow and for Linux
were wide). This is a Bad Thing for Python.

ChrisA
 
M

MRAB

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

----------


I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....
If you're that concerned about it, why don't you modify the source code so
that the string representation chooses between only 2 bytes and 4 bytes per
codepoint, and then see whether that you prefer that situation. How do
the memory usage and speed compare?
 
B

Benjamin Kaplan

You still have yet to explain how Python's string representation is
wrong. Just how it isn't optimal for one specific case. Here's how I
understand it:

1) Strings are sequences of stuff. Generally, we talk about strings as
either sequences of bytes or sequences of characters.

2) Unicode is a format used to represent characters. Therefore,
Unicode strings are character strings, not byte strings.

2) Encodings are functions that map characters to bytes. They
typically also define an inverse function that converts from bytes
back to characters.

3) UTF-8 IS NOT UNICODE. It is an encoding- one of those functions I
mentioned in the previous point. It happens to be one of the five
standard encodings that is defined for all characters in the Unicode
standard (the others being the little and big endian variants of
UTF-16 and UTF-32).

4) The internal representation of a character string DOES NOT MATTER.
All that matters is that the API represents it as a string of
characters, regardless of the representation. We could implement
character strings by putting the Unicode code-points in binary-coded
decimal and it would be a Unicode character string.

5) The String type that .NET and Java (and unicode type in Python
narrow builds) use is not a character string. It is a string of
shorts, each of which corresponds to a UTF-16 code point. I know this
is the case because in all of these, the length of "\u1f435" is 2 even
though it only consists of one character.

6) The new string representation in Python 3.3 can successfully
represent all characters in the Unicode standard. The actual number of
bytes that each character consumes is invisible to the user.

----------


I shew enough examples. As soon as you are using non latin-1 chars
your "optimization" just became irrelevant and not only this, you
are penalized.

I'm sorry, saying Python now is just covering the whole unicode
range is not a valuable excuse. I prefer a "correct" version with
a narrower range of chars, especially if this range represents
the "daily used chars".

I can go a step further, if I wish to write an application for
Western European users, I'm better served if I'm using a coding
scheme covering all thesee languages/scripts. What about cp1252 [*]?
Does this not remind somthing?

Python can do better, it only succeeds to do worth!

[*] yes, I kwnow, internally ....

jmf

By that logic, we should all be using ASCII because it's "correct" for
the 127 characters that I (as an English speaker) use, and therefore
it's all that we should care about. I don't care if é counts as two
characters, it's faster and more memory efficient for all of my
strings to just count bytes. There are certain domains where
characters outside the basic multilingual plane are used. Python's job
is to be correct in all of those circumstances, not just the ones you
care about.
 
8

88888 Dihedral

Chris Angelicoæ–¼ 2013å¹´3月28日星期四UTC+8上åˆ11時40分17秒寫é“:
Has anybody else thought that [jmf's] last few responses are starting to sound



Yes, I did wonder. It's like he and Dihedral have been trading

accounts sometimes. Hey, Dihedral, I hear there's a discussion of

Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords

for you to trigger on and Python and bots are funny and this text is

almost grammatical!



There. Let's see if he takes the bait.



ChrisA

Well, we need some cheap ram to hold 4 bytes per character

in a text segment to be observed.

For those not to be observed or shown, the old way still works.

Windows got this job done right to collect taxes in areas
of different languages.
 
8

88888 Dihedral

Chris Angelicoæ–¼ 2013å¹´3月28日星期四UTC+8上åˆ11時40分17秒寫é“:
Has anybody else thought that [jmf's] last few responses are starting to sound



Yes, I did wonder. It's like he and Dihedral have been trading

accounts sometimes. Hey, Dihedral, I hear there's a discussion of

Unicode and PEP 393 and Python 3.3 and Unicode and lots of keywords

for you to trigger on and Python and bots are funny and this text is

almost grammatical!



There. Let's see if he takes the bait.



ChrisA

Well, we need some cheap ram to hold 4 bytes per character

in a text segment to be observed.

For those not to be observed or shown, the old way still works.

Windows got this job done right to collect taxes in areas
of different languages.
 
T

Terry Reedy

On 3/28/2013 4:26 PM, jmfauth wrote:

Please provide references for your assertions. I have read the unicode
standard, parts more than once, and your assertions contradict my memory.
Unicode does not stipulate, one has to cover the whole range.

I believe it does. As I remember, the recognized encodings all encode
the entire unicode codepoint range
Unicode expects that every character in a range behaves the same
way.

I have no idea what you mean by 'same way'. Each codepoint is supposed
to behave differently in some way. That is the reason for having
multiple codepoints. One causes an 'a' to appear, another a 'b'. Indeed,
the standard define multiple categories of codepoints and chars in
different categories are supposed to act differently (or be treated
differently). Glyphic chars versus control chars are one example.
 
D

Dennis Lee Bieber

At some point we have to stop being gentle / polite / politically correct and call a shovel a shovel... er, spade.

Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)
 
C

Chris Angelico

Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)

I don't know, but I'm pretty sure there's a private detective who
wouldn't appreciate being called Sam Shovel.

ChrisA
 
M

Mark Lawrence

Call it an Instrument For the Transplantation of Dirt

(Is an antique Steam Shovel ever a Steam Spade?)

Surely you can spade a lot more things than dirt?
 
S

Steven D'Aprano

Even if you personally would prefer someone to respond by calling you a
liar, your personal preferences do not form a basis for desirable
posting behavior here.

Whereas yours apparently are.

Thanks for the feedback, I'll take it under advisement.
 
S

Steven D'Aprano

The only difference for ASCII-only strings is that they are kept in a
struct with a smaller header. The smaller header omits the utf8 pointer
(which optionally points to an additional UTF-8 representation of the
string) and its associated length variable. These are not needed for
ASCII-only strings because an ASCII string can be directly interpreted
as a UTF-8 string for the same result. The smaller header also omits
the "wstr_length" field which, according to the PEP, "differs from
length only if there are surrogate pairs in the representation." For an
ASCII string, of course there would not be any surrogate pairs.


I wonder why they need care about surrogate pairs?

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.
 
C

Chris Angelico

ASCII and Latin-1 strings obviously do not have them. Nor do BMP-only
strings. It's only strings in the SMPs that could need surrogate pairs,
and they don't need them in Python's implementation since it's a full 32-
bit implementation. So where do the surrogate pairs come into this?

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...
I also wonder why the implementation bothers keeping a UTF-8
representation. That sounds like premature optimization to me. Surely you
only need it when writing to a file with UTF-8 encoding? For most
strings, that will never happen.

.... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?

Minor nitpick, btw:
(in which cast wstr_length differs form length)
Should be "in which case" and "from". Who has the power to correct
typos in PEPs?

ChrisA
 
M

MRAB

PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.

utf8_length, utf8: UTF-8 representation (null-terminated).

data: shortest-form representation of the unicode string. The string
is null-terminated (in its respective representation).

All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""

If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip. Same with...


... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.

Though, from the same quote: The UTF-8 representation is
null-terminated. Does this mean that it can't be used if there might
be a \0 in the string?
You could ask the same question about any encoding.

It's only an issue if it's passed to a C function which expects a
null-terminated string.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top