S
Steven D'Aprano
PEP 393 says:
"""
wstr_length, wstr: representation in platform's wchar_t
(null-terminated). If wchar_t is 16-bit, this form may use surrogate
pairs (in which cast wstr_length differs form length). wstr_length
differs from length only if there are surrogate pairs in the
representation.
utf8_length, utf8: UTF-8 representation (null-terminated).
data: shortest-form representation of the unicode string. The string is
null-terminated (in its respective representation).
All three representations are optional, although the data form is
considered the canonical representation which can be absent only while
the string is being created. If the representation is absent, the
pointer is NULL, and the corresponding length field may contain
arbitrary data.
"""
All the words are in English (well, most of them...) but what does it
mean?
If the string was created from a wchar_t string, that string will be
retained, and presumably can be used to re-output the original for a
clean and fast round-trip.
Under what circumstances will a string be created from a wchar_t string?
How, and why, would such a string be created? Why would Python still
support strings containing surrogates when it now has a nice, shiny,
surrogate-free flexible representation?
... the UTF-8 version. It'll keep it if it has it, and not else. A lot
of content will go out in the same encoding it came in in, so it makes
sense to hang onto it where possible.
Not to me. That almost doubles the size of the string, on the off-chance
that you'll need the UTF-8 encoding. Which for many uses, you don't, and
even if you do, it seems like premature optimization to keep it around
just in case. Encoding to UTF-8 will be fast for small N, and for large
N, why carry around (potentially) multiple megabytes of duplicated data
just in case the encoded version is needed some time?