No trees in the stdlib?

L

Lie Ryan

João Valverde said:
Propose such alternative then. There are none that offer the same
performance. At best they're workarounds.

I don't care about recipes. That's called research.

If people don't find it to be useful, that's fine. Surprising, but fine.

Python devs, based on my observation, tend to choose a data structure
based on the interface and not its implementation. Binary Sorted Tree is
an implementation, its interface can be a sorted dict (sorted key-value
mapping) or a list (not the most natural interface for a tree).

Basically, python already have all the common interfaces, i.e. list,
set, and mapping/dict.

Let's see it like this. In how many ways can a list be implemented?
- Array (python's builtin list)
- Linked List
- Binary Tree
- and the list goes on...

All of them can expose their interface as list, but only array
implementation is available as builtin.
 
J

João Valverde

Paul said:
Again, at least in my case, I'd hope for an immutable structure.
Could you clarify what you mean by immutable? As in... not mutable? As
in without supporting insertions and deletions? That's has the same
performance as using binary search on a sorted list. What's the point of
using a tree for that?
 
P

Paul Rubin

João Valverde said:
Could you clarify what you mean by immutable? As in... not mutable? As
in without supporting insertions and deletions?

Correct.
That's has the same performance as using binary search on a sorted
list. What's the point of using a tree for that?

The idea is you can accomplish the equivalent of insertion or deletion
by allocating a new root, along with the path down to the place you
want to insert, i.e. O(log n) operations. So instead of mutating an
existing tree, you create a new tree that shares most of its structure
with the old tree, and switch over to using the new tree. This
trivially lets you maintain snapshots of old versions of the tree,
implement an "undo" operation, have a background thread do a complex
operation on a snapshot while the foreground thread does any number of
update-and-replace operations, etc.

This is very standard stuff. See:

http://en.wikipedia.org/wiki/Persistent_data_structure

The wikipedia article on AVL trees makes it pretty obvious how an
implementation would work.
 
J

João Valverde

Paul said:
Correct.



The idea is you can accomplish the equivalent of insertion or deletion
by allocating a new root, along with the path down to the place you
want to insert, i.e. O(log n) operations. So instead of mutating an
existing tree, you create a new tree that shares most of its structure
with the old tree, and switch over to using the new tree. This
trivially lets you maintain snapshots of old versions of the tree,
implement an "undo" operation, have a background thread do a complex
operation on a snapshot while the foreground thread does any number of
update-and-replace operations, etc.
Interesting, thanks. The concept is not difficult to understand but I'm
not sure it would be preferable. A copy operation should have the same
cost as a "snapshot", undo is kind of redundant and multithreading...
don't see a compelling use that would justify it. Also the interface is
a mapping so it'd be rather nice to emulate dict's.

Have you considered how the syntax would work in Python by the way? This:

new_tree = old_tree.insert(object)

Just looks wrong. The interface should support non functional idioms too.
 
J

João Valverde

João Valverde said:
Interesting, thanks. The concept is not difficult to understand but
I'm not sure it would be preferable. A copy operation should have the
same cost as a "snapshot", undo is kind of redundant and
multithreading... don't see a compelling use that would justify it.
Also the interface is a mapping so it'd be rather nice to emulate dict's.

Have you considered how the syntax would work in Python by the way? This:

new_tree = old_tree.insert(object)

Heh, that's a poor example for a mapping. But:

bst[key] = object

is even dicier for immutable structures no?
 
P

Paul Rubin

João Valverde said:
Interesting, thanks. The concept is not difficult to understand but
I'm not sure it would be preferable. A copy operation should have the
same cost as a "snapshot",

You mean a deep-copy? That is unnecessarily expensive; with a
functional structure you can snapshot (or copy) by copying a single
pointer.
undo is kind of redundant and multithreading... don't see a
compelling use that would justify it.

Here is one:
http://groups.google.com/group/comp.lang.python/msg/1fbe66701e4bc65b
Have you considered how the syntax would work in Python by the way? This:
new_tree = old_tree.insert(object)
Just looks wrong.

It looks fine to me. Obviously you could support a wrapper with
a mutating slot that holds a pointer to the tree.
 
J

João Valverde

Paul said:
You mean a deep-copy? That is unnecessarily expensive; with a
functional structure you can snapshot (or copy) by copying a single
pointer.

Shallow copy...

I just skimmed that but if someone really needs multithreading for such
intensive processing without wanting a database, fair enough I guess.
It looks fine to me. Obviously you could support a wrapper with
a mutating slot that holds a pointer to the tree.
I didn't get the last part, sorry. But I think you'd have a lot of users
annoyed that the interface is similar to a list yet their objects
mysteriously disappear. To me, tree.insert() implies mutability but I
defer that to others like yourself with more experience in Python than me.
 
N

Nobody

Correct.


The idea is you can accomplish the equivalent of insertion or deletion
by allocating a new root, along with the path down to the place you
want to insert, i.e. O(log n) operations. So instead of mutating an
existing tree, you create a new tree that shares most of its structure
with the old tree, and switch over to using the new tree.

The main issue here is that you need to be a bit smarter when it comes to
"modifying" the tree.

If you want to insert, delete or replace multiple elements, using repeated
insert()s (etc) on the root is sub-optimal, as you will end up repeatedly
duplicating the upper levels. Ideally you want to provide operations which
will add/remove/replace multiple elements in a single traversal.
 
T

Tim Wintle

To answer the question of what I need the BSTs for, without getting
into too many boring details it is to merge and sort IP blocklists,
that is, large datasets of ranges in the form of (IP address, IP
address, string).

As an anecdotal data point (honestly not trying to raise the "Python
is slow" strawman), I implemented the same algorithm in C and Python,
using pyavl. Round numbers were 4 mins vs 4 seconds, against Python
(plus pyavl).

Out of interest, I recently wrote something similar that imported (a
class of) snort rules for blacklisting ip traffic. I could only use the
standard library.

I ended up writing a simple tree using dict-like objects [1]. Afraid I
haven't got a taught CS background to know the name of the structure.

(note insertion wasn't the focus, and I didn't bother writing it to
handle updates/merges - this is a quick script I run every now and then,
so I'm sure it could be done better - I just liked having the standard
dict interface for each node)

I only had three levels of branching, using the first octet to branch at
the root node, the second octet to branch as the second node, and the
final two to branch at the third node's depth (since even then that's
normally sparse relative to the first two nodes).

It works well enough for me - I'm IO bound reading in ip addresses from
logs to check against the blacklist, and there is a fair bit of other
processing going on for each line.

(Obviously I converted the ip addresses to integers before doing all
this to avoid hashing strings etc)


[1]
(As rules could be for any subnet I overloaded some of the dict methods
to check against rules on unusual subnets etc. before checking
individual ips in the final part)
 
T

Terry Reedy

Paul said:
The idea is you can accomplish the equivalent of insertion or deletion
by allocating a new root, along with the path down to the place you
want to insert, i.e. O(log n) operations. So instead of mutating an
existing tree, you create a new tree that shares most of its structure
with the old tree, and switch over to using the new tree.

Now I get what your have been talking about over several posts.
Update and mutate are kind of synonymous in my mind, but the above
explains how they can be different.

This
trivially lets you maintain snapshots of old versions of the tree,
implement an "undo" operation, have a background thread do a complex
operation on a snapshot while the foreground thread does any number of
update-and-replace operations, etc.

This is very standard stuff.

Now for someone raised on arrays, iteration, and procedural programming ;-).

Reading it now.
 
J

João Valverde

João Valverde said:
Shallow copy...


I just skimmed that but if someone really needs multithreading for
such intensive processing without wanting a database, fair enough I
guess.

I didn't get the last part, sorry. But I think you'd have a lot of
users annoyed that the interface is similar to a list yet their
objects mysteriously disappear. To me, tree.insert() implies
mutability but I defer that to others like yourself with more
experience in Python than me.

Rereading this I got what you meant by "wrapper with mutating slot".
But that is (like I think you implied) functionally equivalent to a
mutating data structure, with worse performance because of additional
memory allocation and such. Is it faster to rebalance the tree with a
persistent data structure?
 
J

João Valverde

alex23 said:
And clearly neither has anyone else, hence the absence from the
stdlib. As others have pointed out, there are alternative approaches,
and plenty of recipes on ActiveState, which seem to have scratched
whatever itch there is for the data structure you're advocating.

I can't resist quoting Sedgewick here. Then I'll shut up about it.

http://www.cs.princeton.edu/~rs/talks/LLRB/LLRB.pdf said:
Abstract
The red-black tree model for implementing balanced search trees,
introduced by Guibas and Sedge-
wick thirty years ago, is now found throughout our computational
infrastructure. Red-black trees
are described in standard textbooks and are the underlying data
structure for symbol-table imple-
mentations within C++, Java, Python, BSD Unix, and many other modern
systems.

You'd think so, but no. You should correct him that in Python a balanced
search tree is the useless cross between a dict and a database.
 
P

Paul Rubin

João Valverde said:
Rereading this I got what you meant by "wrapper with mutating slot".
But that is (like I think you implied) functionally equivalent to a
mutating data structure, with worse performance because of additional
memory allocation and such. Is it faster to rebalance the tree with a
persistent data structure?

If you're going to use a self-balancing tree you will have to do some
tree rotations even if the tree is mutable. My guess is that it's
still O(log n) updates, but with a smaller proportionality constant.
 
L

Lawrence D'Oliveiro

João said:
Simple example usage case: Insert string into data structure in sorted
order if it doesn't exist, else retrieve it.

the_set = set( ... )

if str in the_set :
... "retrieval" case ...
else :
the_set.add(str)
#end if

Want sorted order?

sorted(tuple(the_set))

What could be simpler?
 
L

Lawrence D'Oliveiro

Miles said:
With O(n) insertions and removals, though. A decent self-balancing
binary tree will generally do those in O(log n).

For values of n below, say, a million, would you even notice the difference
on modern hardware?
 
L

Lawrence D'Oliveiro

But a dict can't be used to implement a (sorted) table ADT.

for key in sorted(the_dict.keys(), cmp = ... whatever ordering criteria you like ...) :
... do something with the_dict[key] ...
#end for
 
P

Paul Rubin

Lawrence D'Oliveiro said:
For values of n below, say, a million, would you even notice the difference
on modern hardware?

Huh? Yes, of course you'd notice, unless you were doing the operation
just once or something like that.
 
S

Steven D'Aprano

the_set = set( ... )

if str in the_set :
... "retrieval" case ...

Not terribly useful, because there's no way of attaching any data to
retrieve to the key. Apart from the case of interning strings (or other
keys), you'd probably want a dict rather than a set:

if str in the_dict:
return the_dict[str]
else:
the_dict[str] = something_or_other

else :
the_set.add(str)
#end if

Want sorted order?

sorted(tuple(the_set))

What could be simpler?

Dropping the entirely superfluous creation of a tuple:

sorted(the_set)

That's a reasonable approach for small sets of data, but it isn't very
useful if the_set contains 4GB of keys, because you have to duplicate the
keys, then sort them. More effective would be a data structure that was
always sorted. This has the further advantage that it's even simpler than
sorted(the_set):

the_set
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,218
Latest member
JolieDenha

Latest Threads

Top