Bug in slice type

S

Steve Holden

Antoon said:
Sometimes it is convenient to have the exception thrown at a later
time.




And maybe the more convenient place for this "if" is in a whole different
part of your program, a part where using -1 as an invalid index isn't
at all obvious.




You always seem to look at such things in a very narrow scope. You never
seem to consider that various parts of a program have to work together.
Or perhaps it's just that I try not to mix parts inappropriately.
So what happens if you have a module that is collecting string-index
pair, colleted from various other parts. In one part you
want to select the last letter, so you pythonically choose -1 as
index. In an other part you get a result of find and are happy
with -1 as an indictation for an invalid index. Then these
data meet.
That's when debugging has to start. Mixing data of such types is
somewhat inadvisable, don't you agree?

I suppose I can't deny that people do things like that, myself included,
but mixing data sets where -1 is variously an error flag and a valid
index is only going to lead to trouble when the combined data is used.

regards
Steve
 
T

Terry Reedy

Steve Holden said:
That's when debugging has to start. Mixing data of such types is
somewhat inadvisable, don't you agree?

I suppose I can't deny that people do things like that, myself included,
but mixing data sets where -1 is variously an error flag and a valid
index is only going to lead to trouble when the combined data is used.

The fact that the -1 return *has* lead to bugs in actual code is the
primary reason Guido has currently decided that find and rfind should go.
A careful review of current usages in the standard library revealed at
least a couple bugs even there.

Terry J. Reedy
 
P

Paul Rubin

Terry Reedy said:
The fact that the -1 return *has* lead to bugs in actual code is the
primary reason Guido has currently decided that find and rfind should go.
A careful review of current usages in the standard library revealed at
least a couple bugs even there.

Really it's x[-1]'s behavior that should go, not find/rfind.

Will socket.connect_ex also go? How about dict.get? Being able to
return some reasonable value for "failure" is a good thing, if failure
is expected. Exceptions are for unexpected, i.e., exceptional failures.
 
A

Antoon Pardon

Op 2005-08-29 said:
Or perhaps it's just that I try not to mix parts inappropriately.

I didn't know it was inappropriately to mix certain parts. Can you
give a list of modules in the standard list I shouldn't mix.
That's when debugging has to start. Mixing data of such types is
somewhat inadvisable, don't you agree?

The type of both data is the same, it is a string-index pair in
both cases. The problem is that a module from the standard lib
uses a certain value to indicate an illegal index, that has
a very legal value in python in general.
I suppose I can't deny that people do things like that, myself included,

It is not about what people do. If this was about someone implementing
find himself and using -1 as an illegal index, I would certainly agree
that it was inadvisable to do so. Yet when this is what python with
its libary offers the programmer, you seem reluctant find fault with
it.
but mixing data sets where -1 is variously an error flag and a valid
index is only going to lead to trouble when the combined data is used.

Yet this is what python does. Using -1 variously as an error flag and
a valid index and when people complain about that, you say it sounds like
whining.
 
B

Bryan Olson

Steve said:
> I'm all in favor of discussions to make 3.0 a better
> language.

This one should definitely be two-phase. First, the non-code-
breaking change that replaces-and-deprecates the warty handling
of negative indexes, and later the removal of the old style. For
the former, there's no need to wait for a X.0 release; for the
latter, 3.0 may be too early.

The draft PEP went to the PEP editors a couple days ago. Haven't
heard back yet.
 
A

Antoon Pardon

Op 2005-08-29 said:
Antoon said:
I think a properly implented find is better than an index.

See the current thread in python-dev[1], which proposes a new method,
str.partition(). I believe that Raymond Hettinger has shown that almost
all uses of str.find() can be more clearly be represented with his
proposed function.

Do we really need this? As far as I understand most of this
functionality is already provided by str.split and str.rsplit

I think adding an optional third parameter 'full=False' to these
methods, would be all that is usefull here. If full was set
to True, split and rsplit would enforce that a list with
maxsplit + 1 elements was returned, filling up the list with
None's if necessary.


head, sep, tail = str.partion(sep)

would then almost be equivallent to

head, tail = str.find(sep, 1, True)


Code like the following:

head, found, tail = result.partition(' ')
if not found:
break
result = head + tail


Could be replaced by:

head, tail = result.split(' ', 1, full = True)
if tail is None
break
result = head + tail


I also think that code like this:

while tail:
head, _, tail = tail.partition('.')
mname = "%s.%s" % (m.__name__, head)
m = self.import_it(head, mname, m)
...


Would probably better be written as follows:

for head in tail.split('.'):
mname = "%s.%s" % (m.__name__, head)
m = self.import_it(head, mname, m)
...


Unless I'm missing something.
 
T

Terry Reedy

Really it's x[-1]'s behavior that should go, not find/rfind.

I complete disagree, x[-1] as an abbreviation of x[len(x)-1] is extremely
useful, especially when 'x' is an expression instead of a name. But even
if -1 were not a legal subscript, I would still consider it a design error
for Python to mistype a non-numeric singleton indicator as an int. Such
mistyping is only necessary in a language like C that requires all return
values to be of the same type, even when the 'return value' is not really a
return value but an error signal. Python does not have that typing
restriction and should not act as if it does by copying C.
Will socket.connect_ex also go?

Not familiar with it.
How about dict.get?

A default value is not necessarily an error indicator. One can regard a
dict that is 'getted' as an infinite dict matching all keys with the
default except for a finite subset of keys, as recorded in the dict.

If the default is to be regarded a 'Nothing to return' indicator, then that
indicator *must not* be in the dict. A recommended idiom is to then create
a new, custom subset of object which *cannot* be a value in the dict.
Return values can they safely be compared with that indicator by using the
'is' operator.

In either case, .get is significantly different from .find.

Terry J. Reedy
 
P

Paul Rubin

Terry Reedy said:
Really it's x[-1]'s behavior that should go, not find/rfind.

I complete disagree, x[-1] as an abbreviation of x[len(x)-1] is extremely
useful, especially when 'x' is an expression instead of a name.

There are other abbreviations possible, for example the one in the
proposed PEP at the beginning of this thread.
But even
if -1 were not a legal subscript, I would still consider it a design error
for Python to mistype a non-numeric singleton indicator as an int.

OK, .find should return None if the string is not found.
 
B

Bryan Olson

Terry said:
> "Paul Rubin" wrote:
>
>>Really it's x[-1]'s behavior that should go, not find/rfind.
>
> I complete disagree, x[-1] as an abbreviation of x[len(x)-1] is extremely
> useful, especially when 'x' is an expression instead of a name.

Hear us out; your disagreement might not be so complete as you
think. From-the-far-end indexing is too useful a feature to
trash. If you look back several posts, you'll see that the
suggestion here is that the index expression should explicitly
call for it, rather than treat negative integers as a special
case.

I wrote up and sent off my proposal, and once the PEP-Editors
respond, I'll be pitching it on the python-dev list. Below is
the version I sent (not yet a listed PEP).


--
--Bryan


PEP: -1
Title: Improved from-the-end indexing and slicing
Version: $Revision: 1.00 $
Last-Modified: $Date: 2005/08/26 00:00:00 $
Author: Bryan G. Olson <[email protected]>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 26 Aug 2005
Post-History:


Abstract

To index or slice a sequence from the far end, we propose
using a symbol, '$', to stand for the length, instead of
Python's current special-case interpretation of negative
subscripts. Where Python currently uses:

sequence[-i]

We propose:

sequence[$ - i]

Python's treatment of negative indexes as offsets from the
high end of a sequence causes minor obvious problems and
major subtle ones. This PEP proposes a consistent meaning
for indexes, yet still supports from-the-far-end
indexing. Use of new syntax avoids breaking existing code.


Specification

We propose a new style of slicing and indexing for Python
sequences. Instead of:

sequence[start : stop : step]

new-style slicing uses the syntax:

sequence[start ; stop ; step]

It works like current slicing, except that negative start or
stop values do not trigger from-the-high-end interpretation.
Omissions and 'None' work the same as in old-style slicing.

Within the square-brackets, the '$' symbol stands for the
length of the sequence. One can index from the high end by
subtracting the index from '$'. Instead of:

seq[3 : -4]

we write:

seq[3 ; $ - 4]

When square-brackets appear within other square-brackets,
the inner-most bracket-pair determines which sequence '$'
describes. The length of the next-outer sequence is denoted
by '$1', and the next-out after than by '$2', and so on. The
symbol '$0' behaves identically to '$'. Resolution of $x is
syntactic; a callable object invoked within square brackets
cannot use the symbol to examine the context of the call.

The '$' notation also works in simple (non-slice) indexing.
Instead of:

seq[-2]

we write:

seq[$ - 2]

If we did not care about backward compatibility, new-style
slicing would define seq[-2] to be out-of-bounds. Of course
we do care about backward compatibility, and rejecting
negative indexes would break way too much code. For now,
simple indexing with a negative subscript (and no '$') must
continue to index from the high end, as a deprecated
feature. The presence of '$' always indicates new-style
indexing, so a programmer who needs a negative index to
trigger a range error can write:

seq[($ - $) + index]


Motivation

From-the-far-end indexing is such a useful feature that we
cannot reasonably propose its removal; nevertheless Python's
current method, which is to treat a range of negative
indexes as special cases, is warty. The wart bites novice or
imperfect Pythoners by not raising an exceptions when they
need to know about a bug. For example, the following code
prints 'y' with no sign of error:

s = 'buggy'
print s[s.find('w')]

The wart becomes an even bigger problem with more
sophisticated use of Python sequences. What is the 'stop'
value for a slice when the step is negative and the slice
includes the zero index? An instance of Python's slice type
will report that the stop value is -1, but if we use this
stop value to slice, it gets misinterpreted as the last
index in the sequence. Here's an example:

class BuggerAll:

def __init__(self, somelist):
self.sequence = somelist[:]

def __getitem__(self, key):
if isinstance(key, slice):
start, stop, step = key.indices(len(self.sequence))
# print 'Slice says start, stop, step are:', start,
stop, step
return self.sequence[start : stop : step]


print range(10) [None : None : -2]
print BuggerAll(range(10))[None : None : -2]

The above prints:

[9, 7, 5, 3, 1]
[]

Un-commenting the print statement in __getitem__ shows:

Slice says start, stop, step are: 9 -1 -2

The slice object seems to think that -1 is a valid exclusive
bound, but when using it to actually slice, Python
interprets the negative number as an offset from the high
end of the sequence.

Steven Bethard offered the simpler example:

py> range(10)[slice(None, None, -2)]
[9, 7, 5, 3, 1]
py> slice(None, None, -2).indices(10)
(9, -1, -2)
py> range(10)[9:-1:-2]
[]

The double-meaning of -1, as both an exclusive stopping
bound and an alias for the highest valid index, is just
plain whacked. So what should the slice object return? With
Python's current indexing/slicing, there is no value that
just works. 'None' will work as a stop value in a slice, but
index arithmetic will fail. The value 0 - (len(sequence) +
1) will work as a stop value, and slice arithmetic and
range() will happily use it, but the result is not what the
programmer probably intended.

The problem is subtle. A Python sequence starts at index
zero. There is some appeal to giving negative indexes a
useful interpretation, on the theory that they were invalid
as subscripts and thus useless otherwise. That theory is
wrong, because negative indexes were already useful, even
though not legal subscripts, and the reinterpretation often
breaks their exiting use. Specifically, negative indexes are
useful in index arithmetic, and as exclusive stopping
bounds.

The problem is fixable. We propose that negative indexes not
be treated as a special case. To index from the far end of a
sequence, we use a syntax that explicitly calls for far-end
indexing.


Rationale

New-style slicing/indexing is designed to fix the problems
described above, yet live happily in Python along-side the
old style. The new syntax leaves the meaning of existing
code unchanged, and is even more Pythonic than current
Python.

Semicolons look a lot like colons, so the new semicolon
syntax follows the rule that things that are similar should
look similar. The semicolon syntax is currently illegal, so
its addition will not break existing code. Python is
historically tied to C, and the semicolon syntax is
evocative of the similar start-stop-step expressions of C's
'for' loop. JPython is tied to Java, which uses a similar
'for' loop syntax.

The '$' character currently has no place in a Python index,
so its new interpretation will not break existing code. We
chose it over other unused symbols because the usage roughly
corresponds to its meaning in the Python library's regular
expression module.

We expect use of the $0, $1, $2 ... syntax to be rare;
nevertheless, it has a Pythonic consistency. Thanks to Paul
Rubin for advocating it over the inferior multiple-$ syntax
that this author initially proposed.


Backwards Compatibility

To avoid braking code, we use new syntax that is currently
illegal. The new syntax more-or-less looks like current
Python, which may help Python programmers adjust.

User-defined classes that implement the sequence protocol
are likely to work, unchanged, with new-style slicing.
'Likely' is not certain; we've found one subtle issue (and
there may be others):

Currently, user-defined classes can implement Python
subscripting and slicing without implementing Python's len()
function. In our proposal, the '$' symbol stands for the
sequence's length, so classes must be able to report their
length in order for $ to work within their slices and
indexes.

Specifically, to support new-style slicing, a class that
accepts index or slice arguments to any of:

__getitem__
__setitem__
__delitem__
__getslice__
__setslice__
__delslice__

must also consistently implement:

__len__

Sane programmers already follow this rule.



Copyright:

This document has been placed in the public domain.
 
P

Paul Rubin

Bryan Olson said:
Specifically, to support new-style slicing, a class that
accepts index or slice arguments to any of:

__getitem__
__setitem__
__delitem__
__getslice__
__setslice__
__delslice__

must also consistently implement:

__len__

Sane programmers already follow this rule.

It should be ok to use new-style slicing without implementing __len__
as long as you don't use $ in any slices. Using $ in a slice without
__len__ would throw a runtime error. I expect using negative
subscripts in old-style slices on objects with no __len__ also throws
an error.

Not every sequence needs __len__; for example, infinite sequences, or
sequences that implement slicing and subscripts by doing lazy
evaluation of iterators:

digits_of_pi = memoize(generate_pi_digits()) # 3,1,4,1,5,9,2,...
print digits_of_pi[5] # computes 6 digits and prints '9'
print digits_of_pi($-5) # raises exception
 
A

Antoon Pardon

Op 2005-08-30 said:
Really it's x[-1]'s behavior that should go, not find/rfind.

I complete disagree, x[-1] as an abbreviation of x[len(x)-1] is extremely
useful, especially when 'x' is an expression instead of a name.

I don't think the ability to easily index sequences from the right is
in dispute. Just the fact that negative numbers on their own provide
this functionality.

Because I sometimes find it usefull to have a sequence start and
end at arbitrary indexes, I have written a table class. So I
can have a table that is indexed from e.g. -4 to +6. So how am
I supposed to easily get at that last value?
 
R

Robert Kern

Bryan said:
Currently, user-defined classes can implement Python
subscripting and slicing without implementing Python's len()
function. In our proposal, the '$' symbol stands for the
sequence's length, so classes must be able to report their
length in order for $ to work within their slices and
indexes.

Specifically, to support new-style slicing, a class that
accepts index or slice arguments to any of:

__getitem__
__setitem__
__delitem__
__getslice__
__setslice__
__delslice__

must also consistently implement:

__len__

Sane programmers already follow this rule.

Incorrect. Some sane programmers have multiple dimensions they need to
index.

from Numeric import *
A = array([[0, 1], [2, 3], [4, 5]])
A[$-1, $-1]

The result of len(A) has nothing to do with the second $.

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
A

Antoon Pardon

Op 2005-08-30 said:
Incorrect. Some sane programmers have multiple dimensions they need to
index.

I don't see how that contradicts Bryan's statement.
from Numeric import *
A = array([[0, 1], [2, 3], [4, 5]])
A[$-1, $-1]

The result of len(A) has nothing to do with the second $.

But that is irrelevant to the fact wether or not sane
programmes follow Bryan's stated rule. That the second
$ has nothing to do with len(A), doesn't contradict
__len__ has to be implemented nor that sane programers
already do.
 
R

Robert Kern

Antoon said:
Op 2005-08-30 said:
Incorrect. Some sane programmers have multiple dimensions they need to
index.

I don't see how that contradicts Bryan's statement.
from Numeric import *
A = array([[0, 1], [2, 3], [4, 5]])
A[$-1, $-1]

The result of len(A) has nothing to do with the second $.

But that is irrelevant to the fact wether or not sane
programmes follow Bryan's stated rule. That the second
$ has nothing to do with len(A), doesn't contradict
__len__ has to be implemented nor that sane programers
already do.

Except that the *consistent* implementation is supposed to support the
interpretation of $. It clearly can't for multiple dimensions.

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
B

Bryan Olson

Robert said:
> Bryan Olson wrote:
>
>
>> Currently, user-defined classes can implement Python
>> subscripting and slicing without implementing Python's len()
>> function. In our proposal, the '$' symbol stands for the
>> sequence's length, so classes must be able to report their
>> length in order for $ to work within their slices and
>> indexes.
>>
>> Specifically, to support new-style slicing, a class that
>> accepts index or slice arguments to any of:
>>
>> __getitem__
>> __setitem__
>> __delitem__
>> __getslice__
>> __setslice__
>> __delslice__
>>
>> must also consistently implement:
>>
>> __len__
>>
>> Sane programmers already follow this rule.
>
>
> Incorrect. Some sane programmers have multiple dimensions they need to
> index.
>
> from Numeric import *
> A = array([[0, 1], [2, 3], [4, 5]])
> A[$-1, $-1]
>
> The result of len(A) has nothing to do with the second $.

I think you have a good observation there, but I'll stand by my
correctness.

My initial post considered re-interpreting tuple arguments, but
I abandoned that alternative after Steven Bethard pointed out
how much code it would break. Modules/classes would remain free
to interpret tuple arguments in any way they wish. I don't think
my proposal breaks any sane existing code.

Going forward, I would advocate that user classes which
implement their own kind of subscripting adopt the '$' syntax,
and interpret it as consistently as possible. For example, they
could respond to __len__() by returning a type that supports the
"Emulating numeric types" methods from the Python Language
Reference 3.3.7, and also allows the class's methods to tell
that it stands for the length of the dimension in question.
 
R

Robert Kern

Bryan said:
Robert Kern wrote:
from Numeric import *
A = array([[0, 1], [2, 3], [4, 5]])
A[$-1, $-1]

The result of len(A) has nothing to do with the second $.

I think you have a good observation there, but I'll stand by my
correctness.

len() cannot be used to determine the value of $ in the context of
multiple dimensions.
My initial post considered re-interpreting tuple arguments, but
I abandoned that alternative after Steven Bethard pointed out
how much code it would break. Modules/classes would remain free
to interpret tuple arguments in any way they wish. I don't think
my proposal breaks any sane existing code.

What it does do is provide a second way to do indexing from the end that
can't be extended to multiple dimensions.
Going forward, I would advocate that user classes which
implement their own kind of subscripting adopt the '$' syntax,
and interpret it as consistently as possible.

How? You haven't proposed how an object gets the information that
$-syntax is being used. You've proposed a syntax and some semantics; you
also need to flesh out the pragmatics.
For example, they
could respond to __len__() by returning a type that supports the
"Emulating numeric types" methods from the Python Language
Reference 3.3.7, and also allows the class's methods to tell
that it stands for the length of the dimension in question.

I have serious doubts about __len__() returning anything but a bona-fide
integer. We shouldn't need to use incredible hacks like that to support
a core language feature.

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
P

phil hunt

Specifically, to support new-style slicing, a class that
accepts index or slice arguments to any of:

__getitem__
__setitem__
__delitem__
__getslice__
__setslice__
__delslice__

must also consistently implement:

__len__

Sane programmers already follow this rule.


Wouldn't it be more sensible to have an abstract IndexedCollection
superclass, which imlements all the slicing stuff, then when someone
writes their own collection class they just have to implement
__len__ and __getitem__ and slicing works automatically?
 
S

Steve Holden

Antoon said:
I didn't know it was inappropriately to mix certain parts. Can you
give a list of modules in the standard list I shouldn't mix.




The type of both data is the same, it is a string-index pair in
both cases. The problem is that a module from the standard lib
uses a certain value to indicate an illegal index, that has
a very legal value in python in general.
Since you are clearly feeling pedantic enough to beat this one to death
with a 2 x 4 please let me substitute "usages" for "types".

In the case of a find() result -1 *isn't* a string index, it's a failure
flag. Which is precisely why it should be filtered out of any set of
indexes. once it's been inserted it can no longer be distinguished as a
failure indication.
It is not about what people do. If this was about someone implementing
find himself and using -1 as an illegal index, I would certainly agree
that it was inadvisable to do so. Yet when this is what python with
its libary offers the programmer, you seem reluctant find fault with
it.
I've already admitted that the choice of -1 as a return value wasn't
smart. However you appear to be saying that it's sensible to mix return
values from find() with general-case index values. I'm saying that you
should do so only with caution. The fact that the naiive user will often
not have the wisdom to apply such caution is what makes a change desirable.
Yet this is what python does. Using -1 variously as an error flag and
a valid index and when people complain about that, you say it sounds like
whining.
What I am trying to say is that this doesn't make sense: if you want to
combine find() results with general-case indexes (i.e. both positive and
negative index values) it behooves you to strip out the -1's before you
do so. Any other behaviour is asking for trouble.

regards
Steve
 
B

Bengt Richter

Specification

We propose a new style of slicing and indexing for Python
sequences. Instead of:

sequence[start : stop : step]

new-style slicing uses the syntax:

sequence[start ; stop ; step]
I don't mind the semantics, but I don't like the semicolons ;-)

What about if when brackets trail as if attributes, it means
your-style slicing written with colons instead of semicolons?

sequence.[start : stop : step]

I think that would just be a tweak on the trailer syntax.
I just really dislike the semicolons ;-)

Regards,
Bengt Richter
 
P

Paul Rubin

What about if when brackets trail as if attributes, it means
your-style slicing written with colons instead of semicolons?

sequence.[start : stop : step]

This is nice. It gets rid of the whole $1,$2,etc syntax as well.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,268
Latest member
AshliMacin

Latest Threads

Top