unicode encoding usablilty problem

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 20, 2005

Nick said:
Having "", u"", and r"" be immutable, while b"" was mutable would seem
rather inconsistent.

Yes. However, this inconsistency might be desirable. It would, of
course, mean that the literal cannot be a singleton. Instead, it has
to be a display (?), similar to list or dict displays: each execution
of the byte string literal creates a new object.

An alternative would be to have "bytestr" be the immutable type
corresponding to the current str (with b"" literals producing
bytestr's), while reserving the "bytes" name for a mutable byte
sequence.

Indeed. This maze of options has caused the process to get stuck.
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.

The main point being, the replacement for 'str' needs to be immutable or
the upgrade process is going to be a serious PITA.

Somebody really needs to take this in his hands, completing the PEP,
writing a patch, checking applications to find out what breaks.

Regards,
Martin

Nick Coghlan · Feb 20, 2005

Martin said:
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.

Indeed - I've got a data manipulating program that I figured I could make
slightly less memory hungry by using arrays instead of strings.

I discovered very quickly just how inconvenient such a change would be in terms
of the available API for manipulation of the byte array (the loss of 'join'
support was a serious drawback). The program still uses strings for that reason.

However, I wonder if that might not be better solved by providing an
"array.bytearray" that supported relevant portions of the string API (and easy
conversion to a string), rather than blurring the concept of immutable strings.

Hmm - something else the PEP needs to discuss: What happens to __str__ and
__unicode__? Is there a new __bytes__ slot?

I wonder if Skip is still up for championing this one. . .

Cheers,
Nick.
One PEP's enough for me (even though 338 doesn't seem to generate much interest)

aurora · Feb 21, 2005

if you don't know what a and b comes from, how can you be sure that
your program works at all? how can you be sure they're both strings?

("a op b" can fail in many ways, depending on what "a", "b", and "op"
are)

a and b are both string. The issue is 8-bit string or unicode string.

if you have unit tests, why don't they include Unicode tests?

</F>

How do I structure the test cases to guarantee coverage? It is not
practical to test every combinations of unicode/8-bit strings. Adding
non-ascii characters to test data probably make problem pop up earlier.
But it is arduous and it is hard to spot if you left out any.

aurora · Feb 21, 2005

Yes. However, this inconsistency might be desirable. It would, of
course, mean that the literal cannot be a singleton. Instead, it has
to be a display (?), similar to list or dict displays: each execution
of the byte string literal creates a new object.

Indeed. This maze of options has caused the process to get stuck.
People also argue that with such an approach, we could as well
tell users to use array.array for the mutable type. But then,
people complain that it doesn't have all the library support that
strings have.

Somebody really needs to take this in his hands, completing the PEP,
writing a patch, checking applications to find out what breaks.

Regards,
Martin

What is the processing of getting a PEP work out? Does the work and
discussion carry out in the python-dev mailing list? I would be glad to
help out especially on this particular issue.

Fredrik Lundh · Feb 21, 2005

aurora said:
a and b are both string.

how do you know that?

How do I structure the test cases to guarantee coverage? It is not practical to test every
combinations of unicode/8-bit strings. Adding non-ascii characters to test data probably make
problem pop up earlier. But it is arduous

sounds like you don't want to test for it. sorry, cannot help. I prefer
to design libraries so they can be tested, and design tests so they test all
important aspects of my libraries. if you prefer another approach, there's
not much I can do, other than repeating what I said at the start: if you do
things the right way (decode on the way in, encode on the way out), it
just works.

</F>

Dieter Maurer · Feb 21, 2005

I do understand aurora's problems very well.

Me, too, I had suffered from this occasionally:

* some library decides to use unicode (without I had asked it to do so)

* Python decides then to convert other strings to unicode
and bum: "Unicode decode error".

I solve these issues with a "sys.setdefaultencoding(ourDefaultEncoding)"
in "sitecustomize.py".

I know that almost all the characters I have to handle
are encoded in "ourDefaultEncoding" and if something
converts to Unicode without being asked for, then this
is precisely the correct encoding.

I know that Unicode fanatists do not like "setdefaultencoding"
but until we will have completely converted to Unicode (which we probably
will do in the farer future), this is essential to keep sane...

Dieter

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= · Feb 21, 2005

aurora said:
What is the processing of getting a PEP work out? Does the work and
discussion carry out in the python-dev mailing list? I would be glad to
help out especially on this particular issue.

See PEP 1 for the PEP process. The main point is that discussion is
*not* carried out on any specific forum. But instead, the PEP serves
as a container for all possible considerations people come up with,
formally by writing to the PEP author. Of course, they will use
comp.lang.python and python-dev (and perhaps SIG mailing lists)
instead of writing to the PEP author, so the PEP author may need to
track these as well.

The process is triggered by the author posting revisions of the
PEP at a moderate rate, each time claiming "now I think it is
complete". Then, if nobody comes up with a reasoning that is
not yet covered in the PEP, it becomes ready for BDFL
pronouncement. It better also has an implementation at some point
in time.

For a dormant PEP, the prospective author should contact the
original author, and offer co-authoring. Perhaps the original
author even proposes that you can take over the entire thing
sometime.

Notice that, at some point, a patch implementing the PEP will
be needed. So you should indicate from the beginning whether you
are also willing to work on the implementation. If not, there is
a good chance that the PEP again goes dormant after the
specification is complete.

Regards,
Martin

Vinay Sajip · Feb 25, 2005

This will help in your code, but there is big pile of modules in stdlib

that are not unicode-friendly. From my daily practice come shlex
(tokenizer works only with encoded strings) and logging (you cann't
specify encoding for FileHandler).

You can, of course, pass in a stream opened using codecs.open to StreamHandler.
Not quite as friendly, I'll grant you.

Regards,

Vinay Sajip

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?= · Apr 11, 2005

Hello !

I've been trying desperately to access http://www.stackless.com but it's
been down, for about a week now !
I desperatly need to download stackless python...
Of course the stackless mailing list is on their server, so it's down,
too.

Does anybody has any info ?
Does anybody have a tarball of a recent version of stackless that I may
use (with the docs ?)

Thanks !

Regards,
P.F. Caillaud

cfbolz · Apr 12, 2005

Hi!

Pierre-Frédéric Caillaud said:
I've been trying desperately to access http://www.stackless.com but
it's been down, for about a week now !

The stackless webpage is working again.

Regards,

Carl Friedrich Bolz

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?= · Apr 12, 2005

Great !
Thanks !

Hi!

The stackless webpage is working again.

Regards,

Carl Friedrich Bolz

Encoding trouble when script called from application	0	Jan 14, 2014
Preserving unicode filename encoding	1	Oct 20, 2012
Unicode	2	Mar 15, 2013
Unicode Chars in Windows Path	12	Apr 3, 2014
files.py (encoding error)	0	Jun 10, 2013
encoding problem	11	Dec 19, 2008
Python dict as unicode	1	Nov 24, 2010
Thinking Unicode	0	Aug 8, 2013

unicode encoding usablilty problem

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Nick Coghlan

aurora

aurora

Fredrik Lundh

Dieter Maurer

=?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?=

Vinay Sajip

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

cfbolz

=?iso-8859-15?Q?Pierre-Fr=E9d=E9ric_Caillaud?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads