What data structure for a move-to-front coder?


T

Tom Anderson

Evening all,

I'd like to implement a move-to-front coder, and i'm wondering how to do
it - specifically, what data structure to use. This is not a java-specific
question, and i haven't googled yet, so i'm just chucking this out in case
anyone else finds it interesting.

For those who aren't familiar with move-to-front (from here, MTF) coding
(and unless you've delved into the implementation of bzip2 compression,
you probably aren't), it's a scheme for coding a sequence of symbols from
an alphabet as numbers such that higher-order entropy (unevenness in the
position of symbols in the sequence) is converted into first-order entropy
(unevenness in the frequency of symbols). As an example, this string over
the alphabet {A, B, C, D}:

ABABBACDDDCCCDDDCAABBCBA

Becomes this sequence of numbers:

011101230010010012030212

The first string contains six of each of the letters; the second contains
10 zeroes, 8 ones, 4 twos and 3 threes. If you encoded the former with
Huffman coding, you'd have to allocate each symbol two bits, and it would
take 48 bits. You could encode the latter with one bit for 0, two for 1,
and three each for 2 and 3, which would make 40 bits in total. An amazing
saving, i'm sure you'll all agree. It's even better with bigger alphabets!

The way it works is that you put the symbols of the alphabet into a sort
of stack, in some fixed but arbitrary order. Like this:

A
B
C
D

To encode a symbol, you (a) find the depth of the symbol in the stack and
write that out and then (b) move that symbol to the top of the stack. So
for instance, if you had a C to encode, you'd write out a 2, and then have
a stack like this:

C
A
B
D

And encoding the string above looks like this:

ABABBACDDDCCCDDDCAABBCBA <-- the string

AABABBACDDDCCCDDDCAABBCB <-- top
BBABAABACCCDDDCCCDCCAABC the (before)
CCCCCCCBAAAAAAAAAADDCCAA stack (movement)
DDDDDDDDBBBBBBBBBBBBDDDD <-- bottom

011101230010010012030212 <-- the code

I've described this as a stack with a top, but in the literature it's
something which has a front rather than a top, hence the name.

Anyway, that's the classic MTF coder. It's not the most exciting or
effective compression technique in the world, but it's what i need in my
app right now.

Although in my app, things are slightly more complicated. I don't have a
fixed alphabet: i have a sequence of arbitrary symbols. Initially, my set
of possible symbols is empty, and as symbols come through, i learn about
them and add them to the set. I can still apply MTF to this, though, just
with a growing stack. Like this (again with letters - and this time i
think making the string less compressible, but never mind):

AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBAA

AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBA
ABBABABCABCBDBAAADBEFADFDEBAFCCDBEFDCCDFEB
ABCAACCDBBBADBEFAAFDEBAFFCDBEFFFCDFE
AACCCCCADBEEEAFDEBAAFCDBEEEECDF
CADBBBBAFDEBBAFCCBBBBBCD
CCCCCCCCCDEEEAAAAAAAAAC

!0!10111!2221!132002!!44213444550454341023450

Where ! means that the symbol isn't currently in the stack.

So, how would you implement an MTF coder?

Oh, and as an added complication, the symbols in my app are arbitrary
objects. They could be literally anything - Strings, Maps, Dates,
Customers, Tomcat instances, etc. And for an object to match an entry in
the stack, they have to be identical (==) rather than just equal.

The obvious way is with a stack (implemented with a list), as described:

public class MoveToFrontCoder<T> {
private List<T> stack = new ArrayList<T>();

/** @return the code for the object, or -1 if it's new */
public int encode(T obj) {
stack.add(0, obj);
Iterator<T> it = stack.iterator();
it.next(); // discard the one we just added
int idx = 0;
while (it.hasNext()) {
if (it.next() == obj) {
it.remove();
return idx;
}
++idx;
}
return -1;
}
}

Note that i can't just use indexOf because that's defined on top of
equals, not ==. Hence the iterator.

That's simple and works, but its average-case behaviour is O(n). Of
course, the whole point of MTF coding is that the normal case isn't
average (IYSWIM), so you'd hope it would be nearer O(1) in practice.

But is there a way to do it that has a better worse-case (if not
worst-case) behaviour?

I could use a LinkedList in the above structure, and it might be faster,
but it's still O(n).

Could i use an IdentityHashMap from objects to Integers, holding the
position of that object in the stack? But then how do i do updates?

My gut tells me that i could use an IdentityHashMap in which the values
were links in a linked list, which i could then rearrange easily, but i
can't see how i could compute stack positions in O(1) that way. And my gut
mostly does perl, so its advice is not usually worth bothering with.

A priority queue of some sort? Like a heap? But finding the index of an
object in a heap is O(n). Or possibly more - i think it's equivalent to
sorting, so O(n log n).

An IdentityHashMap pointing to nodes in an enfilade? I don't actually
understand those well enough to be sure that they're applicable, but i
think they might be, and Ted Nelson invented them, so they're cool.

Right, time to sleep on it, i think.

tom
 
Ad

Advertisements

B

blue indigo

On Wed, 11 Mar 2009 00:43:51 +0000, Tom Anderson wrote:

[needs a stack-like data structure with efficient move-to-top and indexing]

The best I can come up with on short notice is something like this:

Map<T,Entry> items = new IdentityHashMap<T,Entry>();
Entry headEntry = null;

public int getPosAndBump(T item) {
Entry entry = items.get(item);
if (entry == null) {
entry = new Entry();
entry.next = headEntry;
headEntry = entry;
items.put(item,entry);
while ((entry = entry.next) != null) {
entry.pos++;
}
return 0; // indicates new
}
int result = entry.pos;
if (result == 1) return result;
entry.pos = 0; // will be incremented to 1 below
entry.prev.next = entry.next;
if (entry.next != null) entry.next.prev = entry.prev;
Entry origEntry = entry;
while (entry.prev != null) {
entry.pos++;
entry = entry.prev;
}
entry.prev = origEntry;
origEntry.next = entry;
return result;
}

private static class Entry {
public int pos = 1; // new entries always go to top
public Entry prev = null;
public Entry next;
}

Untested. Addition of a previously-unseen object is O(n). Bumping of a
seen object is technically O(n), but only the items above it in the stack
are traversed, not the items below it. I can't think off the top of my
head how it can be made O(1); for every item its current position has to
be stored if getting that position is to be O(1), and then when an item is
moved to the top, it and the items above it need their position numbers
permuted.

The internal structure is an IdentityHashMap from objects to "entries",
which basically are nodes in a linked list of integers. IdentityHashMap
uses == for comparison and is used as you indicated that ==-comparison was
a requirement. It has O(1) performance if your items have good hash
functions. The entry splicing is simple. Previously-unseen objects are
O(n) because all of the stack item positions have to be bumped by one.

Here n is the "alphabet" size, the total number of unique objects seen so
far, rather than the number of bump operations done.

To get better performance than this (modulo any needed bug-fixes) would
require a cleverer method for representing positions than storing an
int with each item. I don't know of any, nor can I come up with one on
short notice. Others might chime in with something, and a search of
Wikipedia might be useful. Try looking at a bzip2 implementation's source
code if you can find one, and see if they did anything cleverer.

Most likely they didn't as O(alphabet size) performance isn't terrible for
many typical cases, for which alphabet size typically <= 256.
 
Ad

Advertisements

J

Joshua Cranmer

Tom said:
So, how would you implement an MTF coder?

A linked list would be sufficient, I think. Each update would take O(d)
time, where d is the number of elements you have to traverse.
The obvious way is with a stack (implemented with a list), as described:

public class MoveToFrontCoder<T> {
private List<T> stack = new ArrayList<T>();

I would do LinkedList<T> here instead of ArrayList<T>: you add front and
then remove; in LinkedList, these are both O(1) operations, but they're
both O(N) operations in ArrrayList.
That's simple and works, but its average-case behaviour is O(n). Of
course, the whole point of MTF coding is that the normal case isn't
average (IYSWIM), so you'd hope it would be nearer O(1) in practice.

But is there a way to do it that has a better worse-case (if not
worst-case) behaviour?

O(d) (where d is the average depth) is probably good enough. A hashtable
with some linking of entries, like blue indigo's suggestion, could be
faster, but improving on O(d) in practice would be hard. If the speed is
really so important that you are willing to admit much more complex
code, carry on reading.

The goal to a faster one is to have an "essentially" O(1) time to find
the index of the stack given an entry and an O(1) time to change the
amount. Getting close to this looks extraordinarily difficult; my mind
keeps envisioning some sort of union-find structure, but the updates
keep raising problems.
An IdentityHashMap pointing to nodes in an enfilade? I don't actually
understand those well enough to be sure that they're applicable, but i
think they might be, and Ted Nelson invented them, so they're cool.

Looking at Wikipedia's entry, this seems to be along the lines of what
my brain is trying to envision. It looks like it would be quite complex,
though, so I'm not sure that the probably small improvement in runtime
would be worth the agony you spend trying to debug it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top