set and dict iteration

Aaron Brady · Aug 16, 2012

Hello,

I observed an inconsistency in the behavior of 'set' and 'dict' iterators. It is "by design" according to the docs.

'''
http://docs.python.org/dev/library/stdtypes.html#dict-views

iter(dictview). Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
'''

The 'set' has the same behavior. Iteration might also complete successfully.

The inconsistency is, if we remove an element from a set and add another during iteration, the new element might appear later in the iteration, and might not, depending on the hash code; therefore comparing the size of the set between iterations isn't adequate. Example:

http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.py

'''
# py: { 'ver': '3' }

set0= set( ( 1, 2 ) )

iter0= iter( set0 )
print( next( iter0 ) )

set0.add( 3 )
set0.remove( 2 )
print( next( iter0 ) )

print( )

set0= set( ( 6, 7 ) )

iter0= iter( set0 )
print( next( iter0 ) )

set0.add( 8 )
set0.remove( 7 )
print( next( iter0 ) )
'''

Output:

'''
1
3

6
Traceback (most recent call last):
File [...] line 22, in <module>
print( next( iter0 ) )
StopIteration
'''

Iteration should behave the same regardless of the contents of the set. Continuing iteration over sets and dicts after a modification isn't defined; it should unconditionally raise an error.

What's going on, is '8' is added before the position of the iterator due tohashing in the second part, but the size doesn't change, so the iterator reaches the end of the set after '7' is removed.

The inconsistency isn't easily solved. One possibility is to use a timestamp or other serial index in the object and iterators, and compare them on every iteration to determine if a modification has occurred.

Another possibility which the author prefers, is to maintain a secondary collection of the iterators of an object, and invalidate them upon modification. The applicable collection structure is a doubly-linked linked list, informally depicted:

http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

Upon modification, the set traverses its iterators, setting an 'invalid' flag on each; and subsequent calls to any of them raise an 'IterationError'. Adding and removing iterators to and from the secondary list is performed in O( 1 ) time with no penalty.

The above example depicted a 'Set'. 'Dicts' have the same anomaly, but thesolution is ambiguous, since dict values can be changed meaningfully without altering the structure of the object. In the author's opinion, the dictshould not raise an 'IterationError' on value changes, only key changes like the set, but the argument isn't conclusive.

Dave Angel · Aug 16, 2012

Hello,

I observed an inconsistency in the behavior of 'set' and 'dict' iterators. It is "by design" according to the docs.

'''
http://docs.python.org/dev/library/stdtypes.html#dict-views

iter(dictview). Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
'''

The 'set' has the same behavior. Iteration might also complete successfully.

The inconsistency is, if we remove an element from a set and add another during iteration, the new element might appear later in the iteration, and might not, depending on the hash code; therefore comparing the size of the set between iterations isn't adequate. Example:
<SNIP>

Iteration should behave the same regardless of the contents of the set. Continuing iteration over sets and dicts after a modification isn't defined; it should unconditionally raise an error.

Why is it the iterator's job to protect against the user's bug? The
doc is clear enough. If you don't change the collection, you won't have
a problem.

<SNIP more details>.

Everything else is implementation defined. Why should an implementation
be forced to have ANY extra data structure to detect a static bug in the
caller's code?

Paul Rubin · Aug 16, 2012

Dave Angel said:
Everything else is implementation defined. Why should an implementation
be forced to have ANY extra data structure to detect a static bug in the
caller's code?

For the same reason the interpreter checks for type errors at runtime
and raises TypeError, instead of letting the program go into the weeds.

Ian Kelly · Aug 17, 2012

The inconsistency is, if we remove an element from a set and add anotherduring iteration, the new element might appear later in the iteration, andmight not, depending on the hash code; therefore comparing the size of theset between iterations isn't adequate. Example:

Click to expand...

It can be more than just the new element. For example, here the
entire set is repeated (Python 3.2):

s = set(range(8, 13))
it = iter(s)
from itertools import islice
list(islice(it, 5)) # avoid exhausting the iterator [8, 9, 10, 11, 12]
s.add(13)
s.remove(13)
list(it)

Click to expand...

Click to expand...

[8, 9, 10, 11, 12]

This occurs because the addition of the sixth item triggers a resize
of the underlying hash table, and the existing items, which were
originally in slots 0-4, are now in slots 8-12.

Another curious example:

s = set(range(8, 48, 8))
s {8, 16, 40, 24, 32}
it = iter(s)
from itertools import islice
list(islice(it, 4)) [8, 16, 40, 24]
s.add(48)
s.remove(48)
list(it)

Click to expand...

Click to expand...

[8, 16, 40, 24]

Hey, what happened to 32?

Dave Angel · Aug 17, 2012

For the same reason the interpreter checks for type errors at runtime
and raises TypeError, instead of letting the program go into the weeds.

There's an enormous difference between type errors, which affect the low
level dispatch, and checking for whether a dict has changed and may have
invalidated the iterator. If we were really going to keep track of what
iterators are tracking a given dict or set, why stop there? Why not
check if another process has changed a file we're iterating through? Or ...

Ian Kelly · Aug 17, 2012

There's an enormous difference between type errors, which affect the low
level dispatch, and checking for whether a dict has changed and may have
invalidated the iterator. If we were really going to keep track of what
iterators are tracking a given dict or set, why stop there? Why not
check if another process has changed a file we're iterating through? Or ...

How does this affect low-level dispatch (Python 2.7)?
.... def bar(self):
.... return self
....Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: unbound method bar() must be called with Foo instance as
first argument (got object instance instead)

There is no low-level need for this TypeError -- it's purely a case of
not letting the developer shoot himself in the foot. Although to be
honest the interpreter doesn't give quite enough rope (to mix
metaphors) in this case, and I'm glad for the sake of duck typing that
they removed this particular error in Python 3.

With regard to key insertion and deletion while iterating over a dict
or set, though, there is just no good reason to be doing that
(especially as the result is very implementation-specific), and I
wouldn't mind a more complete low-level check against it as long as
it's not too expensive (which is not clearly the case with the current
suggestion at all).

Paul Rubin · Aug 17, 2012

Ian Kelly said:
With regard to key insertion and deletion while iterating over a dict
or set, though, there is just no good reason to be doing that
(especially as the result is very implementation-specific), and I
wouldn't mind a more complete low-level check against it as long as
it's not too expensive (which is not clearly the case with the current
suggestion at all).

One possible approach is to freeze the dictionary against modification
while any iterator is open on it. You could keep a count of active
iterators in the dict structure, adjusting it whenever an iterator is
created or closed/destroyed.

Steven D'Aprano · Aug 17, 2012

There's an enormous difference between type errors, which affect the low
level dispatch, and checking for whether a dict has changed and may have
invalidated the iterator. If we were really going to keep track of what
iterators are tracking a given dict or set, why stop there? Why not
check if another process has changed a file we're iterating through? Or
...

Which is why Python doesn't do it -- because it is (claimed to be)
excessively expensive for the benefit that you would get.

Not because it is a matter of principle that data integrity is
unimportant. Data integrity *is* important, but in the opinion of the
people who wrote these particular data structures, the effort required to
guarantee correct iteration in the face of mutation is too expensive for
the benefit.

Are they right? I don't know. I know that the list sort method goes to a
lot of trouble to prevent code from modifying lists while they are being
sorted. During the sort, the list temporarily appears to be empty to
anything which attempts to access it. So at least sometimes, the Python
developers spend effort to ensure data integrity.

Luckily, Python is open source. If anyone thinks that sets and dicts
should include more code protecting against mutation-during-iteration,
they are more than welcome to come up with a patch. Don't forget unit and
regression tests, and also a set of timing results which show that the
slow-down isn't excessive.

Paul Rubin · Aug 17, 2012

Steven D'Aprano said:
Luckily, Python is open source. If anyone thinks that sets and dicts
should include more code protecting against mutation-during-iteration,
they are more than welcome to come up with a patch. Don't forget unit and
regression tests, and also a set of timing results which show that the
slow-down isn't excessive.

It could be a debugging option, in which case even a fairly significant
slowdown is acceptable.

Ulrich Eckhardt · Aug 17, 2012

Am 17.08.2012 03:01, schrieb Paul Rubin:

One possible approach is to freeze the dictionary against modification
while any iterator is open on it. You could keep a count of active
iterators in the dict structure, adjusting it whenever an iterator is
created or closed/destroyed.

What if there is an iterator left over from a loop that was terminated
early? That could block access to the sequence even though nothing is
/really/ iterating over it.

I personally prefer a reliable error, at least when __debug__ is set.
Someone suggested a timestamp or a list of active iterators, which both
sound reasonable. The two should be O(1) and O(#iterators) in complexity
on all mutating operations and O(1) on iteration, so they should be
acceptable. With a C implementation it probably boils down to very few
cycles (checking a pointer/incrementing an integer). I can't say if this
is feasible without compromising performance though, at the very least
it requires an additional member in all dicts and iterators.

Uli

Aaron Brady · Aug 17, 2012

It can be more than just the new element. For example, here the

Click to expand...

entire set is repeated (Python 3.2):

Click to expand...

[8, 9, 10, 11, 12]

s.add(13)
s.remove(13)
list(it)

Click to expand...

Click to expand...

[8, 9, 10, 11, 12]
This occurs because the addition of the sixth item triggers a resize

Click to expand...

of the underlying hash table, and the existing items, which were

Click to expand...

originally in slots 0-4, are now in slots 8-12.

Click to expand...

Another curious example:

{8, 16, 40, 24, 32}

[8, 16, 40, 24]

[8, 16, 40, 24]

Hey, what happened to 32?

Good examples. The former occurs without the 'islice' as well.

s= set( range( 8, 13 ) )
it= iter( s )
print( [ next( it ) for _ in range( 5 ) ] )
s.add( 13 )
s.remove( 13 )
print( [ next( it ) for _ in range( 5 ) ] )

Output:

[8, 9, 10, 11, 12]
[8, 9, 10, 11, 12]

Aaron Brady · Aug 17, 2012

It can be more than just the new element. For example, here the

Click to expand...

entire set is repeated (Python 3.2):

Click to expand...

[8, 9, 10, 11, 12]

s.add(13)
s.remove(13)
list(it)

Click to expand...

Click to expand...

[8, 9, 10, 11, 12]
This occurs because the addition of the sixth item triggers a resize

Click to expand...

of the underlying hash table, and the existing items, which were

Click to expand...

originally in slots 0-4, are now in slots 8-12.

Click to expand...

Another curious example:

{8, 16, 40, 24, 32}

[8, 16, 40, 24]

[8, 16, 40, 24]

Hey, what happened to 32?

Good examples. The former occurs without the 'islice' as well.

s= set( range( 8, 13 ) )
it= iter( s )
print( [ next( it ) for _ in range( 5 ) ] )
s.add( 13 )
s.remove( 13 )
print( [ next( it ) for _ in range( 5 ) ] )

Output:

[8, 9, 10, 11, 12]
[8, 9, 10, 11, 12]

Aaron Brady · Aug 17, 2012

One possible approach is to freeze the dictionary against modification

while any iterator is open on it. You could keep a count of active

iterators in the dict structure, adjusting it whenever an iterator is

created or closed/destroyed.

Good point. Your approach is another consistent solution.

The difference is in where the exception is raised. Invalidating the iterators raises an exception when they're called. Locking the set/dict raises an exception on 'add' and 'remove' calls.

The latter also forces additional statements to delete iterators before they leave scope in some cases. We wouldn't be able to make modifications in a 'for' suite, even if followed by a 'break', which could be a problem for existing code.

Aaron Brady · Aug 17, 2012

It could be a debugging option, in which case even a fairly significant

slowdown is acceptable.

Another possibility is to use the 'gc.get_referrers' mechanism to obtain the iterators.

import gc
a= set( ( 0, 1, 2 ) )
b= iter( a )
c= iter( a )
d= iter( a )
print( gc.get_referrers( a ) )

Output:

[<set_iterator object at 0x00C0B9E0>, <set_iterator object at 0x00C0BA08>, <set_iterator object at 0x00C0BA30>, [others] ]

This approach wouldn't be as time-efficient as a dedicated secondary structure, due to the other objects which refer to the set, including variable namespaces.

Aaron Brady · Aug 17, 2012

Which is why Python doesn't do it -- because it is (claimed to be)

excessively expensive for the benefit that you would get.

Not because it is a matter of principle that data integrity is

unimportant. Data integrity *is* important, but in the opinion of the

people who wrote these particular data structures, the effort required to

guarantee correct iteration in the face of mutation is too expensive for

the benefit.

Are they right? I don't know. I know that the list sort method goes to a

lot of trouble to prevent code from modifying lists while they are being

sorted. During the sort, the list temporarily appears to be empty to

anything which attempts to access it. So at least sometimes, the Python

developers spend effort to ensure data integrity.

Luckily, Python is open source. If anyone thinks that sets and dicts

should include more code protecting against mutation-during-iteration,

they are more than welcome to come up with a patch. Don't forget unit and

regression tests, and also a set of timing results which show that the

slow-down isn't excessive.

I contribute a patch some time ago. It wasn't accepted. However this thread seems to show a moderately more favorable sentiment than that one.

Is there a problem with hacking on the Beta? Or should I wait for the Release? Does anyone want to help me with the changes? Perhaps P. Rubin could contribute the variation he suggested as well.

Chris Angelico · Aug 17, 2012

Is there a problem with hacking on the Beta?

Nope. Hack on the beta, then when the release arrives, rebase your
work onto it. I doubt that anything of this nature will be changed
between now and then.

ChrisA

Aaron Brady · Aug 18, 2012

Nope. Hack on the beta, then when the release arrives, rebase your

work onto it. I doubt that anything of this nature will be changed

between now and then.

ChrisA

Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
....The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples.
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient. Nevertheless, the earlier diagram contains anovert homogeneous reference cycle.

Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in thelinked list don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker,help prove or disprove the above?

Aaron Brady · Aug 18, 2012

Nope. Hack on the beta, then when the release arrives, rebase your

work onto it. I doubt that anything of this nature will be changed

between now and then.

ChrisA

Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
....The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples.
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient. Nevertheless, the earlier diagram contains anovert homogeneous reference cycle.

Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in thelinked list don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker,help prove or disprove the above?

MRAB · Aug 18, 2012

Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples.
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient. Nevertheless, the earlier diagram contains an overt homogeneous reference cycle.

Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above?

In simple terms, when you create an immutable object it can contain
only references to pre-existing objects, but in order to create a cycle
you need to make an object refer to another which is created later, so
it's not possible to create a cycle out of immutable objects.

However, using Python's C API it _is_ possible to create such a cycle,
by mutating an otherwise-immutable tuple (see PyTuple_SetItem and
PyTuple_SET_ITEM).

Aaron Brady · Aug 19, 2012

In simple terms, when you create an immutable object it can contain

only references to pre-existing objects, but in order to create a cycle

you need to make an object refer to another which is created later, so

it's not possible to create a cycle out of immutable objects.

However, using Python's C API it _is_ possible to create such a cycle,

by mutating an otherwise-immutable tuple (see PyTuple_SetItem and

PyTuple_SET_ITEM).

Are there any precedents for storing uncounted references to PyObject's?

One apparent problematic case is creating an iterator to a set, then addingit to the set. However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined forcollection.

Otherwise, the iterator keeps a counted reference to the set, but the set does not keep a counted reference to the iterator, so the iterator will always be freed first. Therefore, the set's secondary list will be empty when the set is freed.

Concurrent addition and deletion of iterators should be disabled, and the iterators should remove themselves from the set's secondary list before theydecrement their references to the set.

Please refresh the earlier diagram; counted references are distinguished separately. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

Modify dict/set during iteration?	5	Oct 30, 2009
set, dict and other structures	8	Jan 31, 2005
MDP Modeling in GAMS	1	Mar 10, 2023
expanding a variable to a dict	8	Apr 3, 2008
generating unique set of dicts from a list of dicts	0	Jan 10, 2012
Builtin classes list, set, dict reimplemented via B-trees	0	Sep 15, 2005
list/tuple to dict...	17	Sep 15, 2004
set update in 2.5	2	Jan 28, 2007

set and dict iteration

Aaron Brady

Dave Angel

Paul Rubin

Ian Kelly

Dave Angel

Ian Kelly

Paul Rubin

Steven D'Aprano

Paul Rubin

Ulrich Eckhardt

Aaron Brady

Aaron Brady

Aaron Brady

Aaron Brady

Aaron Brady

Chris Angelico

Aaron Brady

Aaron Brady

MRAB

Aaron Brady

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads