List as a dynamic array of increasing-sized arrays

M

MartinBroadhurst

In my experience, the most common data structures used for lists
supporting addition and removal at the tail, and random access, are
linked lists and dynamic arrays.

* Linked lists require following pointers for traversal (3 lookups per
element, poor cache performance), and have O(n) indexed access.
Adding and removing elements cause many small allocations and
deallocations of memory.
* Dynamic arrays have efficient traversal (1 lookup per element, good
cache performance) and indexed access is O(1).
Adding elements requires reallocation when full, however, which
involves copying all elements.

Instead, I should like to propose a structure consisting of a dynamic
array of increasing-sized arrays:

0 -> 0 1 2 3
1 -> 4 5 6 7
2 -> 8 9 10 11 12 13 14 15
3 -> 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

* Each array after the first two is twice the size of its predecessor.
This means that the structure has the same growth strategy as most
dynamic arrays (double when full), reducing the number of
allocations.
* As new arrays are appended to the end, no copying is required,
except for the pointers when the containing array is resized.
* The regular increase in the size of the arrays makes indexed access
efficient.

I have placed the source code for an implemenation here:
http://www.martinbroadhurst.com/c/list.html

To do:
* Allow specification of a starting capacity.
* The algorithm for finding the block for an index should use log2,
not repeated multiplication and addition.
* The block at the tail should not be freed as soon as it becomes
empty, as this means that a list whose population oscillates around a
power of 2 will repeatedly allocate and deallocate that block.
Instead, the block should be freed when some number of elements from
the next block have been removed, introducing a bit of "hysteresis".

Martin
 
M

MartinBroadhurst

Nice iterators.

Thanks, Armand. I think iterators are more useful, though, when
they're "polymorphic", and can decouple collection types from
consumers.
I wrote a polymorphic iterator with just a simple "get" function that
wraps data structure-specific implementations:
http://www.martinbroadhurst.com/source/iterator.h.html
http://www.martinbroadhurst.com/source/iterator.c.html
If I find the time I'm going to unify this with the list one,
providing reverse traversal, random access and eof/bof. Reverse
traversal and random-access can be optional for data structures that
don't support them.
 
M

MartinBroadhurst

You've reinvented obstacks.  Good show.

Thanks, Richard. I've heard of obstacks, but was unsure from the GNU
documentation how they are implemented.

Martin
 
Joined
Oct 21, 2010
Messages
1
Reaction score
0
typedef struct {
Dynarray *blocks;
unsigned int lastsize; /* Size of last block */
unsigned int lastcount; /* Number of elements in the last block */
unsigned int count;
} List;

You don't need to keep track of the count of elements.
Knowing lastsize and lastcount allows you to calculate it.

MP
 
G

Gene

In my experience, the most common data structures used for lists
supporting addition and removal at the tail, and random access, are
linked lists and dynamic arrays.

* Linked lists require following pointers for traversal (3 lookups per
element, poor cache performance), and have O(n) indexed access.
  Adding and removing elements cause many small allocations and
deallocations of memory.
* Dynamic arrays have efficient traversal (1 lookup per element, good
cache performance) and indexed access is O(1).
  Adding elements requires reallocation when full, however, which
involves copying all elements.

Instead, I should like to propose a structure consisting of a dynamic
array of increasing-sized arrays:

0 ->  0  1  2  3
1 ->  4  5  6  7
2 ->  8  9 10 11 12 13 14 15
3 -> 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

* Each array after the first two is twice the size of its predecessor.
  This means that the structure has the same growth strategy as most
dynamic arrays (double when full), reducing the number of
allocations.
* As new arrays are appended to the end, no copying is required,
except for the pointers when the containing array is resized.
* The regular increase in the size of the arrays makes indexed access
efficient.

I have placed the source code for an implemenation here:http://www.martinbroadhurst.com/c/list.html

This is pretty close to a VList: http://en.wikipedia.org/wiki/VList
Though they were originally defined as immutable, this was primarily
for thread safety reasons. Mutable VLists work fine.
 
M

MartinBroadhurst

On 23 Oct, 18:40, (e-mail address removed) (Richard Harter) wrote:

Thanks very much for reading it, Richard.
The link in your post isn't right; it should have been
<http://www.martinbroadhurst.com/c/list.c.html>.

My link was to a page containing the source for the dynarray that
contains the list, as well as the list itself. Apologies if this was
confusing.
Your design is good for
arrays, not so good for linked lists.

Yes, absolutely, it's an "iterable stack". Inserting in the middle
would be O(n) theoretically, and in practice worse than for a dynamic
array since there would be several block shifts rather than just one.

Perhaps I misnamed it. I'm not sure if the word "list" has as precise
a meaning in computer science as "stack" or "queue" do. I tend to
think of a list as an ordered collection that supports, at a minimum,
adding at the tail.

One could, however, *embed* a linked list into the structure, by
making it a collection of linked list nodes rather than void
pointers.

I wrote an implementation of a linked list embedded in an ordinary
dynamic array a while ago, my rationale being that it might provide
better locality of reference and more efficient allocation than one
based on individually allocated nodes.

Embedding a linked list into this structure, however, would remove the
copying problem associated with growing dynamic arrays.
Further, when a linked list is embedded in a dynamic array, the next
and previous pointers need to be integer offsets into the array buffer
rather than pointers, since otherwise they would be invalidated by
reallocation. As the "array-of-arrays" structure doesn't reallocate,
previous and next can be pointers.

Martin
 
M

MartinBroadhurst

Below are the timings in seconds for inserting 50,000 elements into
various sequential data structures. These are on an Intel Celeron
667MHz with 256MB of RAM, running Ubuntu Server 10.10 with glibc
version 2.10.1.

Doubly-linked list 0.022233
Singly-linked list 0.021964
Doubly-linked list in dynamic array 0.014827
Deque 0.012803
Dynamic array 0.006111
Dynamic array of increasing-sized arrays 0.005775

The dynamic array of increasing-sized arrays was a little over 4%
faster across the range 50,000 to 250,000 elements. Dynamic array was
faster for fewer or more elements than this.

Full result is here:

http://www.martinbroadhurst.com/c/list.html#performance

Martin
 
B

BartC

MartinBroadhurst said:
Below are the timings in seconds for inserting 50,000 elements into
various sequential data structures. These are on an Intel Celeron
667MHz with 256MB of RAM, running Ubuntu Server 10.10 with glibc
version 2.10.1.

Doubly-linked list 0.022233
Singly-linked list 0.021964
Doubly-linked list in dynamic array 0.014827
Deque 0.012803
Dynamic array 0.006111
Dynamic array of increasing-sized arrays 0.005775

The dynamic array of increasing-sized arrays was a little over 4%
faster across the range 50,000 to 250,000 elements. Dynamic array was
faster for fewer or more elements than this.

Full result is here:

http://www.martinbroadhurst.com/c/list.html#performance

You say "inserting" 50000 elements above. Yet the link says the elements are
added to the tail of each list.

I thought the dynamic array timing seemed pretty good.

Try inserting elements at the head of the list, or in the middle.
 
M

MartinBroadhurst

You say "inserting" 50000 elements above. Yet the link says the elements are
added to the tail of each list.

My mistake, I meant inserting in the sense of inserting into a stack.
I thought the dynamic array timing seemed pretty good.

I also. I had been led to believe that dynamic arrays would suffer
from excessive copying on reallocation by things like this -

http://www.drdobbs.com/architecture...ionid=KB05E2AGD4CNPQE1GHRSKHWATMY32JVN?pgno=5

- but, at least with glibc, and in that size range, that's not the
case.
Try inserting elements at the head of the list, or in the middle.

As I said in my original post, it's designed for adding at the tail. I
wouldn't expect it to compare favourably with a dynamic array or
linked list for insertion in the middle. My thinking was that it would
be a good data structure when you have an unknown amount of sequential
data and simply want to store it and read it back.

I *would* expect the linked list embedded in an array to be good for
insertion, though.

Martin
 
S

Seebs

They love to smack down the newbie with their superior *knowledge", but have
they actually *found out*?
Realloc *hardly ever* copies.

This, like everything else you've posted, is a gross oversimplification.

But wait! I can simplify this further: You're rude, you're arrogant,
and you don't seem to be able to disagree in a way suited to someone
who's gotten to his adult teeth yet.

*plonk*

-s
 
M

MartinBroadhurst

You've invented a "new" data structure, but you've solved a problem that
doesn't exist!

That's pretty much the conclusion I came to on the basis of the
empirical evidence I have.

I had read the first thread you mention, not the second. Things like
this were my motivation in looking for a better structure than a
dynamic array.
They love to smack down the newbie with their superior *knowledge", but have
they actually *found out*?

They were probably speaking in general terms, or on the basis of their
experience of particular platforms on which they had seen a problem
with dynamic arrays.
Realloc *hardly ever* copies.

My experiments showed that it copied about 20% of the time, but this
was just glibc. For all I know, glibc may be optimised for geometric
expansion. After all, it's being used for std::vector(T>s in C++.
Other allocators may be optimised for quite different scenarios. It is
also important *which* 20% of the reallocs result in a copy. If it's
the larger ones, this will have a greater impact.
1) If you only want to add at the end, use a dynamic array
2) If you need to do random inserts, use a linked list, but (as you've
discovered), embed it in a dynamic array

Again, that would be a valid conclusion from my results, reiterating
though that they were using one particular allocator.

There are, however, other considerations:

1) If your memory is half-used by a dynamic array, and you need to add
one more element, you will run out of memory. A linked list will
handle this fine.
It's easy to be complacent about memory usage when you're on a
platform with virtual memory, but not all platforms have this.

2) Dynamic arrays use up to twice as much memory as is needed at any
one time. Memory used in linked lists is in a constant ratio with the
elements. If the linked list is an "intrusive" one, where nodes
contain a significant amount of data, this ratio will be much more
favourable.

3) Data in a list can be highly structured, with elements containing
pointers to other elements. These pointers will be invalidated by
reallocation.
You could solve this problem by using the array-of-arrays structure I
began this thread by describing, though.

4) Some people favour the "open-work" interface exposed by linked
lists rather than using a more general container. It is handy to be
able to just have a node, and be able to tack another one on without
needing to have access to the container and go through its interface.

5) Linked lists can "recycle" nodes, by keeping nodes from removed
elements in a chain and using them preferentially when a new node is
required. This will mean less allocation and deallocation in a list
that changes frequently.

Martin
 
M

MartinBroadhurst

2) If you need to do random inserts, use a linked list, but (as you've
discovered), embed it in a dynamic array

You've reminded me of something I did a while ago.
If you have a linked list embedded in an array and you remove an
element, you can move the last element in the buffer into its place.
This means that:

1. You don't need a free node list
2. You don't need to keep track of the used range of the buffer
because used = count
3. Unordered traversal (like when you just want to call a function on
each element in any order) is just array traversal, because the
elements in the buffer are dense

Linked list embedded in an array with compaction:
http://www.martinbroadhurst.com/art...ist-embedded-in-an-array-with-compaction.html

Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top