set and dict iteration

Aaron Brady · Aug 19, 2012

In simple terms, when you create an immutable object it can contain

only references to pre-existing objects, but in order to create a cycle

you need to make an object refer to another which is created later, so

it's not possible to create a cycle out of immutable objects.

However, using Python's C API it _is_ possible to create such a cycle,

by mutating an otherwise-immutable tuple (see PyTuple_SetItem and

PyTuple_SET_ITEM).

Are there any precedents for storing uncounted references to PyObject's?

One apparent problematic case is creating an iterator to a set, then addingit to the set. However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined forcollection.

Otherwise, the iterator keeps a counted reference to the set, but the set does not keep a counted reference to the iterator, so the iterator will always be freed first. Therefore, the set's secondary list will be empty when the set is freed.

Concurrent addition and deletion of iterators should be disabled, and the iterators should remove themselves from the set's secondary list before theydecrement their references to the set.

Please refresh the earlier diagram; counted references are distinguished separately. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

Aaron Brady · Aug 23, 2012

Are there any precedents for storing uncounted references to PyObject's?

One apparent problematic case is creating an iterator to a set, then adding it to the set. However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined for collection.

Otherwise, the iterator keeps a counted reference to the set, but the setdoes not keep a counted reference to the iterator, so the iterator will always be freed first. Therefore, the set's secondary list will be empty when the set is freed.

Concurrent addition and deletion of iterators should be disabled, and theiterators should remove themselves from the set's secondary list before they decrement their references to the set.

Please refresh the earlier diagram; counted references are distinguished separately. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

The patch for the above is only 40-60 lines. However it introduces two newconcepts.

The first is a "linked list", a classic dynamic data structure, first developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list . Linked lists are absent in Python, including the standard library and CPython implementation, beyond the weak reference mechanism and garbage collector. The "collections.deque" structure shares some of the linked list interface but uses arrays.

The second is "uncounted references". The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects,and are invisible to the rest of the program. The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types. Counted references could be used instead, but would be consistently wasted work for the garbage collector,though the benefit to programmers' peace of mind could be significant.

Please share your opinion! Do you agree that the internal list resolves the inconsistency? Do you agree with the strategy? Do you agree that uncounted references are justified to introduce, or are counted references preferable?

Aaron Brady · Aug 23, 2012

Are there any precedents for storing uncounted references to PyObject's?

One apparent problematic case is creating an iterator to a set, then adding it to the set. However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined for collection.

Otherwise, the iterator keeps a counted reference to the set, but the setdoes not keep a counted reference to the iterator, so the iterator will always be freed first. Therefore, the set's secondary list will be empty when the set is freed.

Concurrent addition and deletion of iterators should be disabled, and theiterators should remove themselves from the set's secondary list before they decrement their references to the set.

Please refresh the earlier diagram; counted references are distinguished separately. Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062 set iterators.png

The patch for the above is only 40-60 lines. However it introduces two newconcepts.

The first is a "linked list", a classic dynamic data structure, first developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list . Linked lists are absent in Python, including the standard library and CPython implementation, beyond the weak reference mechanism and garbage collector. The "collections.deque" structure shares some of the linked list interface but uses arrays.

The second is "uncounted references". The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects,and are invisible to the rest of the program. The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types. Counted references could be used instead, but would be consistently wasted work for the garbage collector,though the benefit to programmers' peace of mind could be significant.

Please share your opinion! Do you agree that the internal list resolves the inconsistency? Do you agree with the strategy? Do you agree that uncounted references are justified to introduce, or are counted references preferable?

Steven D'Aprano · Aug 23, 2012

On Thu, 23 Aug 2012 09:49:41 -0700, Aaron Brady wrote:

[...]

The patch for the above is only 40-60 lines. However it introduces two
new concepts.

The first is a "linked list", a classic dynamic data structure, first
developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list .
Linked lists are absent in Python

They certainly are not. There's merely no named "linked list" class.

Linked lists are used by collections.ChainMap, tracebacks, xml.dom,
Abstract Syntax Trees, and probably many other places. (Well, technically
some of these are trees rather than lists.) You can trivially create a
linked list:

x = [a, [b, [c, [d, [e, None]]]]]

is equivalent to a singly-linked list with five nodes. Only less
efficient.

The second is "uncounted references". The uncounted references are
references to "set iterators" exclusively, exist only internally to
"set" objects, and are invisible to the rest of the program. The reason
for the exception is that iterators are unique in the Python Data Model;
iterators consist of a single immutable reference, unlike both immutable
types such as strings and numbers, as well as container types. Counted
references could be used instead, but would be consistently wasted work
for the garbage collector, though the benefit to programmers' peace of
mind could be significant.

The usual way to implement "uncounted references" is by using weakrefs.
Why invent yet another form of weakref?

Aaron Brady · Aug 27, 2012

On Thu, 23 Aug 2012 09:49:41 -0700, Aaron Brady wrote:

[...]

The patch for the above is only 40-60 lines. However it introduces two

Click to expand...

new concepts.

The first is a "linked list", a classic dynamic data structure, first

Click to expand...

developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list .

Click to expand...

Linked lists are absent in Python

Click to expand...

They certainly are not. There's merely no named "linked list" class.

Linked lists are used by collections.ChainMap, tracebacks, xml.dom,

Abstract Syntax Trees, and probably many other places. (Well, technically

some of these are trees rather than lists.) You can trivially create a

linked list:

x = [a, [b, [c, [d, [e, None]]]]]

is equivalent to a singly-linked list with five nodes. Only less

efficient.

The second is "uncounted references". The uncounted references are

Click to expand...

references to "set iterators" exclusively, exist only internally to

Click to expand...

"set" objects, and are invisible to the rest of the program. The reason

Click to expand...

for the exception is that iterators are unique in the Python Data Model;

Click to expand...

iterators consist of a single immutable reference, unlike both immutable

Click to expand...

types such as strings and numbers, as well as container types. Counted

Click to expand...

references could be used instead, but would be consistently wasted work

Click to expand...

for the garbage collector, though the benefit to programmers' peace of

Click to expand...

mind could be significant.

Click to expand...

The usual way to implement "uncounted references" is by using weakrefs.

Why invent yet another form of weakref?

Hello S. D'Aprano. Thanks for your support as always.

The semantics of the second collection are equivalent to a WeakSet. The space and time consumption of a WeakSet are higher in comparison to a linked list. However, so long as we iterated over it using the C API instead of creating an iterator, it would be consistent. If we dynamically create the WeakSet on demand and free it when empty, the space consumption would be lower. Typical use cases don't involve creating thousands of iterators, or rapidly creating and destroying them, so the performance impact might not besevere.

Regarding the bare weakrefs, if the iterator's destructor hasn't been called, then the pointer is still valid. If it has been called, then it's not present in the list. Unlike Python classes, the destructors of C extension classes are guaranteed to be called. Therefore there are no points during exection at which a node needs to check whether a reference to its neighboris valid.

Are your concerns based on data integrity, future maintainability, or what?

Aaron Brady · Sep 2, 2012

Is there a link to the patch?

Please see below. It grew somewhat during development.
SNIP.

This feature is a hard sell as it is; I think that adding uncounted

references into the mix is only going to make that worse. May I

suggest an alternate approach? Internally tag each set or dict with a

"version", which is just a C int. Every time the hash table is

modified, increment the version. When an iterator is created, store

the current version on the iterator. When the iterator is advanced,

check that the iterator version matches the dict/set version. If

they're not equal, raise an error.

This should add less overhead than the linked list without any

concerns about reference counting. It does introduce a small bug in

that an error condition could be "missed", if the version is

incremented a multiple of 2**32 or 2**64 times between iterations --

but how often is that really likely to occur? Bearing in mind that

this error is meant for debugging and not production error handling,

you could even make the version a single byte and I'd still be fine

with that.

Cheers,

Ian

Hi Ian,

We could use a Python long object for the version index to prevent overflow.. Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any. Using the linked list collection, modification operations are expensive in rare cases. Using the version index, iteration is expensive in rare cases.

I was more interested in the linked list for conceptual reasons, so I developed it further. Changelog, diff file, test suite, and links are below. The devs should be aware that a competing patch might be developed. I wouldbe pleased to hear what everybody thinks of it!

Linked list with uncounted references implementation for Win32.

Added:
- 'set_clear_ex' and 'set_clear_internal_ex' methods, differ in invalidation and conditional invalidation behavior and return type.. The 'set.clear()' method and 'tp_clear' type field both called the same method.
- 'set_invalidate_iter_linked' method. Iterate over the iterators of a set, mark them invalid, and clear the list.
- 'setiter_unlink_internal' method. Remove the iterator from the set's linked list of iterators.
- 'IterationError', global.
- New fields:
-- PySetObject: setiterobject *iter_linked. Pointer to the first element of the linked list of the iterators of the set.
-- setiterobject: setiterobject *linked_pred, *linked_succ. Predecessor and successor nodes in the linked list of iterators of the same set.
-- setiterobject: char ob_valid. Validation status of the iterator.
- Result is compared with original in 'set_intersection_update' and '_multi' to determine whether to invalidate the list of iterators. Asymptotic running time is unchanged.
- Pending: add 'tp_clear' field to 'PySetIter_Type'?
- Test script included, 'large numbers' test pending.

6 files changed: { setobject.h, setobject.c, exceptions.c, pyerrors.h, python3.def, python33stub.def }. Test script 'set_iterator_test.py' new. Linked list interface and pseudocode 'patch_pseudocode.txt'.
Zip file:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_patch.zip
Diff file of 3.3.0b2:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_diff.txt

Aaron Brady · Sep 2, 2012

Is there a link to the patch?

Please see below. It grew somewhat during development.
SNIP.

This feature is a hard sell as it is; I think that adding uncounted

references into the mix is only going to make that worse. May I

suggest an alternate approach? Internally tag each set or dict with a

"version", which is just a C int. Every time the hash table is

modified, increment the version. When an iterator is created, store

the current version on the iterator. When the iterator is advanced,

check that the iterator version matches the dict/set version. If

they're not equal, raise an error.

This should add less overhead than the linked list without any

concerns about reference counting. It does introduce a small bug in

that an error condition could be "missed", if the version is

incremented a multiple of 2**32 or 2**64 times between iterations --

but how often is that really likely to occur? Bearing in mind that

this error is meant for debugging and not production error handling,

you could even make the version a single byte and I'd still be fine

with that.

Cheers,

Ian

Hi Ian,

We could use a Python long object for the version index to prevent overflow.. Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any. Using the linked list collection, modification operations are expensive in rare cases. Using the version index, iteration is expensive in rare cases.

I was more interested in the linked list for conceptual reasons, so I developed it further. Changelog, diff file, test suite, and links are below. The devs should be aware that a competing patch might be developed. I wouldbe pleased to hear what everybody thinks of it!

Linked list with uncounted references implementation for Win32.

Added:
- 'set_clear_ex' and 'set_clear_internal_ex' methods, differ in invalidation and conditional invalidation behavior and return type.. The 'set.clear()' method and 'tp_clear' type field both called the same method.
- 'set_invalidate_iter_linked' method. Iterate over the iterators of a set, mark them invalid, and clear the list.
- 'setiter_unlink_internal' method. Remove the iterator from the set's linked list of iterators.
- 'IterationError', global.
- New fields:
-- PySetObject: setiterobject *iter_linked. Pointer to the first element of the linked list of the iterators of the set.
-- setiterobject: setiterobject *linked_pred, *linked_succ. Predecessor and successor nodes in the linked list of iterators of the same set.
-- setiterobject: char ob_valid. Validation status of the iterator.
- Result is compared with original in 'set_intersection_update' and '_multi' to determine whether to invalidate the list of iterators. Asymptotic running time is unchanged.
- Pending: add 'tp_clear' field to 'PySetIter_Type'?
- Test script included, 'large numbers' test pending.

6 files changed: { setobject.h, setobject.c, exceptions.c, pyerrors.h, python3.def, python33stub.def }. Test script 'set_iterator_test.py' new. Linked list interface and pseudocode 'patch_pseudocode.txt'.
Zip file:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_patch.zip
Diff file of 3.3.0b2:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_diff.txt

Ian Kelly · Sep 3, 2012

We could use a Python long object for the version index to prevent overflow. Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.

We could use a Python long; I just don't think the extra overhead is
justified in a data structure that is already highly optimized for
speed. Incrementing and testing a C int is *much* faster than doing
the same with a Python long.

Aaron Brady · Sep 3, 2012

We could use a Python long; I just don't think the extra overhead is

justified in a data structure that is already highly optimized for

speed. Incrementing and testing a C int is *much* faster than doing

the same with a Python long.

I think the technique would require two python longs and a bool in the set,and a python long in the iterator.

One long counts the number of existing (open) iterators. Another counts the version. The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

Aaron Brady · Sep 3, 2012

We could use a Python long; I just don't think the extra overhead is

justified in a data structure that is already highly optimized for

speed. Incrementing and testing a C int is *much* faster than doing

the same with a Python long.

I think the technique would require two python longs and a bool in the set,and a python long in the iterator.

One long counts the number of existing (open) iterators. Another counts the version. The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

Dave Angel · Sep 3, 2012

I think the technique would require two python longs and a bool in the set, and a python long in the iterator.

One long counts the number of existing (open) iterators. Another counts the version. The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

I think you're over-engineering the problem. it's a bug if an iterator
is used after some change is made to the set it's iterating over. We
don't need to catch every possible instance of the bug, that's what
testing is for. The point is to "probably" detect it, and for that, all
we need is a counter in the set and a counter in the open iterator.
Whenever changing the set, increment its count. And whenever iterating,
check the two counters. if they don't agree, throw an exception, and
destroy the iterator. i suppose that could be done with a flag, but it
could just as easily be done by zeroing the pointer to the set.

I'd figure a byte or two would be good enough for the counts, but a C
uint would be somewhat faster, and wouldn't wrap as quickly.

Serhiy Storchaka · Sep 3, 2012

May I
suggest an alternate approach? Internally tag each set or dict with a
"version", which is just a C int. Every time the hash table is
modified, increment the version. When an iterator is created, store
the current version on the iterator. When the iterator is advanced,
check that the iterator version matches the dict/set version. If
they're not equal, raise an error.

Oh, I'm surprised that this strategy is not used yet. I was sure that it
is used.

Aaron Brady · Sep 4, 2012

I think you're over-engineering the problem. it's a bug if an iterator

is used after some change is made to the set it's iterating over. We

don't need to catch every possible instance of the bug, that's what

testing is for. The point is to "probably" detect it, and for that, all

we need is a counter in the set and a counter in the open iterator.

Whenever changing the set, increment its count. And whenever iterating,

check the two counters. if they don't agree, throw an exception, and

destroy the iterator. i suppose that could be done with a flag, but it

could just as easily be done by zeroing the pointer to the set.

I'd figure a byte or two would be good enough for the counts, but a C

uint would be somewhat faster, and wouldn't wrap as quickly.

--

DaveA

Hi D. Angel,

The serial index constantly reminds me of upper limits. I have the same problem with PHP arrays, though it's not a problem with the language itself.

The linked list doesn't have a counter, it invalidates iterators when a modification is made, therefore it's the "correct" structure in my interpretation. But it does seem more precarious comparatively, IMHO.

Both strategies solve the problem I posed originally, they both involve trade-offs, and it's too late to include either in 3.3.0.

Aaron Brady · Sep 4, 2012

I think you're over-engineering the problem. it's a bug if an iterator

is used after some change is made to the set it's iterating over. We

don't need to catch every possible instance of the bug, that's what

testing is for. The point is to "probably" detect it, and for that, all

we need is a counter in the set and a counter in the open iterator.

Whenever changing the set, increment its count. And whenever iterating,

check the two counters. if they don't agree, throw an exception, and

destroy the iterator. i suppose that could be done with a flag, but it

could just as easily be done by zeroing the pointer to the set.

I'd figure a byte or two would be good enough for the counts, but a C

uint would be somewhat faster, and wouldn't wrap as quickly.

--

DaveA

Hi D. Angel,

The serial index constantly reminds me of upper limits. I have the same problem with PHP arrays, though it's not a problem with the language itself.

The linked list doesn't have a counter, it invalidates iterators when a modification is made, therefore it's the "correct" structure in my interpretation. But it does seem more precarious comparatively, IMHO.

Both strategies solve the problem I posed originally, they both involve trade-offs, and it's too late to include either in 3.3.0.

Steven D'Aprano · Sep 4, 2012

I think the technique would require two python longs and a bool in the
set, and a python long in the iterator.

One long counts the number of existing (open) iterators. Another counts
the version. The bool keeps track of whether an iterator has been
created since the last modification, in which case the next modification
requires incrementing the version and resetting the flag.

I think that is over-engineered and could be the difference between
having the patch accepted and having it rejected. After all, many people
will argue that the existing solution to the problem is good enough.

Dicts are extremely common, and your patch increases both the memory
usage of every dict, and the overhead of every write operation
(__setitem__, __delitem__, update). Only a very few dicts will actually
need this overhead, for the rest it is waste. It is important to keep
that waste to a minimum or risk having the patch rejected.

An unsigned C int can count up to 4,294,967,295. I propose that you say
that is enough iterators for anyone, and use a single, simple, version
counter in the dict and the iterator. If somebody exceeds that many
iterators to a single dict or set, and the version field overflows by
exactly 2**32 versions, the results are no worse than they are now. You
won't be introducing any new bugs.

Complicating the solution is, in my opinion, unnecessary. Why should
every set and dict carry the cost of incrementing TWO Python longs and a
flag when just a single C int is sufficient for all realistic use-cases?

The opportunity for failure is extremely narrow:

- I must have an iterator over a dict or set;
- and I must have paused the iteration in some way;
- and then I must create exactly 2**32 other iterators to the
same dict or set;
- and at some point modify the dict or set
- and then restart the first iterator

at which point some items returned by the iterator *may* be duplicated or
skipped (depends on the nature of the modifications).

Dave Angel · Sep 4, 2012

I think that is over-engineered and could be the difference between
having the patch accepted and having it rejected. After all, many people
will argue that the existing solution to the problem is good enough.

Dicts are extremely common, and your patch increases both the memory
usage of every dict, and the overhead of every write operation
(__setitem__, __delitem__, update). Only a very few dicts will actually
need this overhead, for the rest it is waste. It is important to keep
that waste to a minimum or risk having the patch rejected.

An unsigned C int can count up to 4,294,967,295. I propose that you say
that is enough iterators for anyone, and use a single, simple, version
counter in the dict and the iterator. If somebody exceeds that many
iterators to a single dict or set,

I think you have the count confused. it has to be a count of how many
changes have been made to the dict or set, not how many iterators exist.

and the version field overflows by
exactly 2**32 versions, the results are no worse than they are now. You
won't be introducing any new bugs.

Complicating the solution is, in my opinion, unnecessary. Why should
every set and dict carry the cost of incrementing TWO Python longs and a
flag when just a single C int is sufficient for all realistic use-cases?

The opportunity for failure is extremely narrow:

- I must have an iterator over a dict or set;
- and I must have paused the iteration in some way;
- and then I must create exactly 2**32 other iterators to the
same dict or set;
- and at some point modify the dict or set
- and then restart the first iterator

at which point some items returned by the iterator *may* be duplicated or
skipped (depends on the nature of the modifications).

I agree with almost your entire point, exact that the place where it
would fail to detect a bug is when somebody has modified the dict
exactly 2**32 times while an iterator is paused. See my own response,
of 4:27pm. (my time).

Steven D'Aprano · Sep 4, 2012

On 09/03/2012 09:26 PM, Steven D'Aprano wrote:

I think you have the count confused. it has to be a count of how many
changes have been made to the dict or set, not how many iterators exist.

Oops, yes you are absolutely right. It's a version number, not a count of
iterators.

Aaron Brady · Sep 8, 2012

Oops, yes you are absolutely right. It's a version number, not a count of

iterators.

Hello. We have a number of proposed solutions so far.

1) Collection of iterators
a) Linked list
i) Uncounted references
ii) Counted references
iii) Weak references
b) Weak set
2) Serial index / timestamp
a) No overflow - Python longs
b) Overflow - C ints / shorts / chars
c) Reset index if no iterators left
3) Iterator count
- Raise exception on set modifications, not iteration

Note, "2b" still leaves the possibility of missing a case and letting an error pass silently, as the current behavior does. The rest catch the error 100% of the time.

Anyway, I plan to develop the above patch for the 'dict' class. Would anyone like to take over or help me do it?

Thomas Rachel · Sep 8, 2012

Am 19.08.2012 00:14 schrieb MRAB:

In simple terms, when you create an immutable object it can contain
only references to pre-existing objects, but in order to create a cycle
you need to make an object refer to another which is created later, so
it's not possible to create a cycle out of immutable objects.

Yes, but if I add a list in-between, I can create a refcycle:

a = []
b = (a,)
a.append(b)

So b is a tuple consisting of one list which in turn contains b.

It is not a direct cycle, but an indirect one.

Or would that be detected via the list?

Thomas

Hans Mulder · Sep 8, 2012

Am 19.08.2012 00:14 schrieb MRAB:

In simple terms, when you create an immutable object it can contain
only references to pre-existing objects, but in order to create a cycle
you need to make an object refer to another which is created later, so
it's not possible to create a cycle out of immutable objects.

Click to expand...

Yes, but if I add a list in-between, I can create a refcycle:

a = []
b = (a,)
a.append(b)

So b is a tuple consisting of one list which in turn contains b.

It is not a direct cycle, but an indirect one.

It's a cycle and it contains an immutable object, but it's not
a cycle consisting of immutable objects only.

As MRAB was arguing: it is not possbible to create a cycle
consisting of immutable objects only (unless you do unspeakable
things at the C level).

-- HansM

Modify dict/set during iteration?	5	Oct 30, 2009
set, dict and other structures	8	Jan 31, 2005
MDP Modeling in GAMS	1	Mar 10, 2023
expanding a variable to a dict	8	Apr 3, 2008
generating unique set of dicts from a list of dicts	0	Jan 10, 2012
Builtin classes list, set, dict reimplemented via B-trees	0	Sep 15, 2005
list/tuple to dict...	17	Sep 15, 2004
set update in 2.5	2	Jan 28, 2007

set and dict iteration

Aaron Brady

Aaron Brady

Aaron Brady

Steven D'Aprano

Aaron Brady

Aaron Brady

Aaron Brady

Ian Kelly

Aaron Brady

Aaron Brady

Dave Angel

Serhiy Storchaka

Aaron Brady

Aaron Brady

Steven D'Aprano

Dave Angel

Steven D'Aprano

Aaron Brady

Thomas Rachel

Hans Mulder

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads