Tuples and immutability

S

Steven D'Aprano

I used to do essentially this, but it was time-prohibitive and produced
harder-to-read graphs - harder to read because the enormous values of
the bad trees were dwarfing the values of the good trees.

A log graph may be the solution to that: graph the log of the time rather
than time itself.
 
M

Marko Rauhamaa

Steven D'Aprano said:
If you are in a position to randomize the data before storing it in
the tree, an unbalanced binary tree is a solid contender.

Applications that can assume randomly distributed data are exceedingly
rare and even then, you can't easily discount the possibility of
worst-case ordering.

In fact, Dan himself said unbalanced trees performed so badly he
couldn't test them satisfactorily.


Marko
 
S

Steven D'Aprano

Applications that can assume randomly distributed data are exceedingly
rare

Please re-read what I wrote. I didn't say "if your data comes to you
fully randomized". I said "if you are in a position to randomize the data
before storing it". In other words, if you actively randomize the data
yourself.

and even then, you can't easily discount the possibility of
worst-case ordering.

Of course you can. If you have a million items, then the odds that those
million items happen to be sorted (the worst-case order) are 1 in a
million factorial. That's a rather small number, small enough that we can
discount it from ever happening, not in a million lifetimes of the
Universe.

Of course, you would be right to point out that one would also like to
avoid *nearly* worst-case ordering. Nevertheless, there are Vastly more
ways that the data will be sufficiently randomized as to avoid degenerate
performance than the worst-case poor performance.
In fact, Dan himself said unbalanced trees performed so badly he
couldn't test them satisfactorily.

You're misrepresenting what Dan said. What he actually said was that
unbalanced trees perform second only to dicts with random data, only with
sorted data do they perform badly. His exact words were:

"For a series of random keys, it would do quite well (probably second only
to dict), but for a series of sequential keys it would take longer than
anyone would reasonably want to wait"
 
M

Marko Rauhamaa

Steven D'Aprano said:
Please re-read what I wrote. I didn't say "if your data comes to you
fully randomized". I said "if you are in a position to randomize the
data before storing it". In other words, if you actively randomize the
data yourself.

Yes, you could implement a "hash table" that way.


Marko
 
R

Roy Smith

Steven D'Aprano said:
If you have a million items, then the odds that those
million items happen to be sorted (the worst-case order) are 1 in a
million factorial. That's a rather small number, small enough that we can
discount it from ever happening, not in a million lifetimes of the
Universe.

If the items are coming from some inherently random process, then I
agree. But, not all data sources are random.

Imagine you had a web site, and wanted to store various bits of data
from the stream of input requests. You decided that the data structure
you were going to use was a balanced tree. If your keys were, say, a
hash of the client's IP address, then it's a pretty good guess they're
random. But, what if the keys were the time the request was received?
Those are obviously not random, and using them as a keys in a balanced
tree would give you sub-optimum performance.

This is not a hypothetical scenario. Songza uses MongoDB as our
database. The indexes are balanced trees. One of our biggest
collections has a multi-component key, one of the components being the
request time truncated to the hour. Insertion time into that collection
has an obvious sawtooth shape, with performance degrading as each hour
progresses and jumping back up as the time rolls over to the next hour.
This is due (we believe :)) to the constant rebalancing of the index
trees.

Almost-sorted data sets are very common. For example, here's a pipeline
I run a lot (from memory, could have gotten some detail wrong):

grep pattern lofgile | awk '{print $7}' | sed 's/:[0-9][0-9]$//' | sort
| uniq -c

Field 7 is the timestamp for when a request came in. What I'm doing
here is counting how many requests of a certain type came in during each
minute of the day. Logging isn't exactly in chronological order, so I
need to sort the times before I count them. But, it's in *almost*
chronological order; a sort which had pathologically bad behavior on
almost sorted data would be unusable here.
 
E

Ethan Furman

A log graph may be the solution to that: graph the log of the time rather
than time itself.

I don't think that will solve the problem of not wanting to wait three days for the test to finish. ;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top