No trees in the stdlib?

S

Stefan Behnel

Hallvard said:
It's purity that they don't return the modified tree/dict/whatever.
They can still return the deleted element and remain pure.

Fair enough.

Stefan
 
P

Paul Rubin

Stefan Behnel said:
I doubt that there are many AVL implementations that do that. Plus, if
deletion doesn't delete, I'd happily consider that a bug.

I've never seen one that DIDN'T do that, but maybe it's just the crowd
I hang out with.

See "Purely Functional Data Structures" by Chris Okasaki for more
detailed descriptions of such methods.
 
A

Aahz

Hash tables (dicts) are useful for many of the same things that trees are
useful for, but they are different data structures with different
strengths and weaknesses, and different uses. To argue that we don't need
trees because we have dicts makes as much sense as arguing that we don't
need dicts because we have lists.

The problem is that trees are like standards: there are so many to choose
from. Each has its own tradeoffs, and because Python dicts and lists can
substitute for many of the traditionals uses of trees, the tradeoffs are
less clear. I think nobody would object to adding trees to the standard
library, but it will certainly require a clear PEP, preferably with a
pointer to an existing PyPI library that has acquired real-world use.

(In particular, WRT the bisect module, although insertion and deletion
are O(N), the constant factor for doing a simple memory move at C speed
swamps bytecode until N gets very large -- and we already have
collections.deque() for some other common use cases.)
 
P

Paul Rubin

(In particular, WRT the bisect module, although insertion and deletion
are O(N), the constant factor for doing a simple memory move at C speed
swamps bytecode until N gets very large -- and we already have
collections.deque() for some other common use cases.)

Again, at least in my case, I'd hope for an immutable structure.

Also, "N very large" is likely to mean no more than a few thousand,
which is small by today's standards. And the C speed difference goes
away completely if the tree implementation is also in C.

Hedgehog Lisp has a good functional AVL tree implementation written in
C, that I might try to wrap with the Python C API sometime. I think
it is LGPL licensed though, not so good for the Python stdlib.
 
J

João Valverde

Aahz said:
The problem is that trees are like standards: there are so many to choose
from. Each has its own tradeoffs, and because Python dicts and lists can
substitute for many of the traditionals uses of trees, the tradeoffs are
less clear. I think nobody would object to adding trees to the standard
library, but it will certainly require a clear PEP, preferably with a
pointer to an existing PyPI library that has acquired real-world use.
Wish I had asked this before this year's GSoC started.

What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)). The particular algorithm to achieve this is a secondary
issue. It's a BST for sure, AVL vs RBT vs something else. It's my fault
for having opened the topic with simply "trees" instead, it would have
avoided this vagueness problem, but I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python. And I really enjoy the using this language.
 
S

Stefan Behnel

João Valverde said:
What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)).
[...]
I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python.

That's because it's simply not that a common need (in the sense that there
isn't a suitable alternative). I know that Trees have their use cases where
they really shine, but in surprisingly many cases a dict, a set, a list or
a combination of them will do just fine. And if you find a case where those
just don't fit, you may need a database anyway.

Stefan
 
A

Aahz

What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)). The particular algorithm to achieve this is a secondary
issue. It's a BST for sure, AVL vs RBT vs something else. It's my fault
for having opened the topic with simply "trees" instead, it would have
avoided this vagueness problem, but I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python. And I really enjoy the using this language.

Why AVL/RBT instead of B*? It's not that simple.... Another problem is
that unless the tree is coded in C, the constant factors are going to
swamp algorithmic complexity for many use cases -- that in turn makes it
more difficult to deploy a PyPI library for real-world testing.

Anyway, I'm *not* trying to discourage you, just explain some of the
roadblocks to acceptance that likely are why it hasn't already happened.

If you're serious about pushing this through, you have two options:

* Write the code and corresponding PEP yourself (which leads to the
second option, anyway)

* Lobby on the python-ideas mailing list
 
C

Carl Banks

It's purity that they don't return the modified tree/dict/whatever.
They can still return the deleted element and remain pure.

Correct, dict.pop does exactly this.


Carl Banks
 
R

Raymond Hettinger

[Tom Reed]
Why no trees in the standard library, if not as a built in?

The sqlite3 module is built on a binary-tree structure.
It can be used with persistent data or kept in-memory.
The gdbm module has similar flexibility (see the F mode).
FWIW, there are some ASPN implementing various types of
trees (red-black, pairing heaps, etc).


Raymond
 
R

Raymond Hettinger

[João Valverde]
What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)).

FWIW, Py3.1 has an OrderedDict() that preserves insertion order.
It has O(1) lookup, deletion, insertion, and popping; and O(n)
iteration. The ASPN Cookbook has equivalent code that runs on
earlier versions of Python.

in Python. And I really enjoy the using this language.

Am glad you like it.


Raymond
 
G

greg

João Valverde said:
What's lacking is an associative array that preserves ordering, doesn't
require a hash function and has fast insertions and deletions in
O(log(n)).

Careful here -- you can't get away from the need for
hashability just by using a tree. Even if you don't
need to actually hash the values, it's still important
that the criterion for ordering between objects doesn't
change while they're in the tree, otherwise they'll
be in the wrong place and won't be found by subsequent
lookups.
> I'm genuinely surprised to know
there are no data structures that efficiently support such a common need
in Python.

Is it really all that common? If it truly were common,
there probably *would* be something for it in the
stdlib by now.

What sort of things are you doing that you want such
a structure for? Maybe we can suggest a way of using
the existing data structures to achieve the same
goal.
 
J

João Valverde

greg said:
Careful here -- you can't get away from the need for
hashability just by using a tree. Even if you don't
need to actually hash the values, it's still important
that the criterion for ordering between objects doesn't
change while they're in the tree, otherwise they'll
be in the wrong place and won't be found by subsequent
lookups.

I'm aware :) Technically it's necessary to define a total ordering on
the set of keys.
Is it really all that common? If it truly were common,
there probably *would* be something for it in the
stdlib by now.
Obviously my experience differs, but those were my expectations.
What sort of things are you doing that you want such
a structure for? Maybe we can suggest a way of using
the existing data structures to achieve the same
goal.
To answer the question of what I need the BSTs for, without getting into
too many boring details it is to merge and sort IP blocklists, that is,
large datasets of ranges in the form of (IP address, IP address,
string). Originally I was also serializing them in a binary format (but
no more after a redesign). I kept the "merge and sort" part as a helper
script, but that is considerably simpler to implement.

Please note that I'm happy with my code, it works well. I intended to
implement it in C all along, even before trying Python. The python code
was a side thing for testing/curiosity/fun. It prompted my original
question but that was really about Python and the standard library
itself, and I don't wish to waste anyone's time.

As an anecdotal data point (honestly not trying to raise the "Python is
slow" strawman), I implemented the same algorithm in C and Python, using
pyavl. Round numbers were 4 mins vs 4 seconds, against Python (plus
pyavl). Even considering I'm a worse Python programmer than C
programmer, it's a lot. I know many will probably think I tried to do "C
in Python" but that's not the case, at least I don' t think so. Anyway
like I said, not really relevant to this discussion.
 
C

CTO

Whynotrees in the standard library, if not as a built in? I searched
the archive but couldn't find a relevant discussion. Seems like a
glaring omission considering the batteries included philosophy,
particularly balanced binary search trees.Nointerest,nogood
implementations, something other reason? Seems like a good fit for the
collections module. Can anyone shed some light?

Thanks.

I've written a graph library (trees being rooted connected acyclic
graphs)
called Graphine that you could try. Obviously not a part of the
standard
library, but it (or networkx) will almost certainly do what you're
looking
for. You can find graphine at graphine.org, or networkx at
networkx.lanl.gov.

Geremy Condra
 
J

João Valverde

João Valverde said:
I'm aware :) Technically it's necessary to define a total ordering on
the set of keys.
Obviously my experience differs, but those were my expectations.
To answer the question of what I need the BSTs for, without getting
into too many boring details it is to merge and sort IP blocklists,
that is, large datasets of ranges in the form of (IP address, IP
address, string). Originally I was also serializing them in a binary
format (but no more after a redesign). I kept the "merge and sort"
part as a helper script, but that is considerably simpler to implement.
Crap, this sentence is totally confusing. I meant kept the merge code as
a helper script and moved the rest to C, see next paragraph.
 
J

João Valverde

Aahz said:
Why AVL/RBT instead of B*? It's not that simple.... Another problem is
that unless the tree is coded in C, the constant factors are going to
swamp algorithmic complexity for many use cases -- that in turn makes it
more difficult to deploy a PyPI library for real-world testing.

I wouldn't consider anything other than C for such a module on
efficiency alone, unless it was a prototype of course. But I have little
knowledge about the Python C API.

About B* trees, again not an expert but I don't see how the added
complexity brings any noticeable advantage to implement the associative
array data structure I mentioned. Simple is better.
Anyway, I'm *not* trying to discourage you, just explain some of the
roadblocks to acceptance that likely are why it hasn't already happened.

If you're serious about pushing this through, you have two options:

* Write the code and corresponding PEP yourself (which leads to the
second option, anyway)

* Lobby on the python-ideas mailing list

Currently I don't have a strong need for this. I just believe it would
be a benefit to a language I like a lot. Lobbying isn't my thing. I'd
rather write code, but neither am I the most qualified person for the
job. It would certainly be interesting and fun and challenging in a good
way and a great way to learn some new stuff. But I would definitely need
mentoring or asking some silly questions on the mailing list. Maybe I'll
seriously consider it some other time.
 
S

Stefan Behnel

João Valverde said:
I wouldn't consider anything other than C for such a module on
efficiency alone, unless it was a prototype of course. But I have little
knowledge about the Python C API.

Cython is your true friend, if only for rapid prototyping.

http://cython.org/

Stefan
 
J

João Valverde

João Valverde said:
Currently I don't have a strong need for this. I just believe it would
be a benefit to a language I like a lot. Lobbying isn't my thing. I'd
rather write code, but neither am I the most qualified person for the
job. It would certainly be interesting and fun and challenging in a
good way and a great way to learn some new stuff. But I would
definitely need mentoring or asking some silly questions on the
mailing list. Maybe I'll seriously consider it some other time.
There's also another issue raise by Paul Rubin I wasn't even aware of,
that the LGPL is not suitable for the standard library. Having to write
a complete BST implementation in C is a drag. There are already good C
libraries available.
 
A

alex23

João Valverde said:
Currently I don't have a strong need for this.

And clearly neither has anyone else, hence the absence from the
stdlib. As others have pointed out, there are alternative approaches,
and plenty of recipes on ActiveState, which seem to have scratched
whatever itch there is for the data structure you're advocating.

While Python's motto is "batteries included" I've always felt there
was an implicit "but not the kitchen sink" following it. Just because
something "could" be useful shouldn't be grounds for inclusion. That's
what pypi & the recipes are for. Ideally, little should be created
wholesale for the stdlib, what should be added are the existing 3rd
party modules that have become so ubiquitous that their presence on
any python platform is just expected.
 
M

Miles Kaufmann

João Valverde said:
To answer the question of what I need the BSTs for, without getting
into too many boring details it is to merge and sort IP blocklists,
that is, large datasets of ranges in the form of (IP address, IP
address, string). Originally I was also serializing them in a binary
format (but no more after a redesign). I kept the "merge and sort"
part as a helper script, but that is considerably simpler to
implement.

...

As an anecdotal data point (honestly not trying to raise the "Python
is slow" strawman), I implemented the same algorithm in C and
Python, using pyavl. Round numbers were 4 mins vs 4 seconds, against
Python (plus pyavl). Even considering I'm a worse Python programmer
than C programmer, it's a lot. I know many will probably think I
tried to do "C in Python" but that's not the case, at least I don' t
think so. Anyway like I said, not really relevant to this discussion.

What format were you using to represent the IP addresses? (Is it a
Python class?) And why wouldn't you use a network address/subnet mask
pair to represent block ranges? (It seems like being able to
represent ranges that don't fit into a subnet's 2^n block wouldn't be
that common of an occurrence, and that it might be more useful to make
those ranges easier to manipulate.)

One of the major disadvantages of using a tree container is that
usually multiple comparisons must be done for every tree operation.
When that comparison involves a call into Python bytecode (for custom
cmp/lt methods) the cost can be substantial. Compare that to Python's
hash-based containers, which only need to call comparison methods in
the event of hash collisions (and that's hash collisions, not hash
table bucket collisions, since the containers cache each object's hash
value). I would imagine that tree-based containers would only be
worth using with objects with comparison methods implemented in C.

Not that I'm trying to be an apologist, or reject your arguments; I
can definitely see the use case for a well-implemented, fast tree-
based container for Python. And so much the better if, when you need
one, there was a clear consensus about what package to use (like PIL
for image manipulation--it won't meet every need, and there are others
out there, but it's usually the first recommended), rather than having
to search out and evaluate a dozen different ones.

-Miles
 
J

João Valverde

Miles said:
What format were you using to represent the IP addresses? (Is it a
Python class?) And why wouldn't you use a network address/subnet mask
pair to represent block ranges? (It seems like being able to
represent ranges that don't fit into a subnet's 2^n block wouldn't be
that common of an occurrence, and that it might be more useful to make
those ranges easier to manipulate.)
I was using a bytes subclass. I'm not free to choose CIDR notation,
range boundaries must be arbitrary.
One of the major disadvantages of using a tree container is that
usually multiple comparisons must be done for every tree operation.
When that comparison involves a call into Python bytecode (for custom
cmp/lt methods) the cost can be substantial. Compare that to Python's
hash-based containers, which only need to call comparison methods in
the event of hash collisions (and that's hash collisions, not hash
table bucket collisions, since the containers cache each object's hash
value). I would imagine that tree-based containers would only be
worth using with objects with comparison methods implemented in C.
I would flip your statement and say one of the advantages of using trees
is that they efficiently keep random input sorted. Obviously no
algorithm can do that with single comparisons. And not requiring a hash
function is a desirable quality for non-hashable objects. There's a
world beyond dicts. :)

I profiled the code and indeed the comparisons dominated the execution
time. Trimming the comparison function to the bare minimum, a single
python operation, almost doubled the program's speed.
Not that I'm trying to be an apologist, or reject your arguments; I
can definitely see the use case for a well-implemented, fast
tree-based container for Python. And so much the better if, when you
need one, there was a clear consensus about what package to use (like
PIL for image manipulation--it won't meet every need, and there are
others out there, but it's usually the first recommended), rather than
having to search out and evaluate a dozen different ones.
Thanks, and I'm not trying to start a religion either. ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,217
Latest member
IRMNikole

Latest Threads

Top