T

#### Tom Anderson

I'd like to implement a move-to-front coder, and i'm wondering how to do

it - specifically, what data structure to use. This is not a java-specific

question, and i haven't googled yet, so i'm just chucking this out in case

anyone else finds it interesting.

For those who aren't familiar with move-to-front (from here, MTF) coding

(and unless you've delved into the implementation of bzip2 compression,

you probably aren't), it's a scheme for coding a sequence of symbols from

an alphabet as numbers such that higher-order entropy (unevenness in the

position of symbols in the sequence) is converted into first-order entropy

(unevenness in the frequency of symbols). As an example, this string over

the alphabet {A, B, C, D}:

ABABBACDDDCCCDDDCAABBCBA

Becomes this sequence of numbers:

011101230010010012030212

The first string contains six of each of the letters; the second contains

10 zeroes, 8 ones, 4 twos and 3 threes. If you encoded the former with

Huffman coding, you'd have to allocate each symbol two bits, and it would

take 48 bits. You could encode the latter with one bit for 0, two for 1,

and three each for 2 and 3, which would make 40 bits in total. An amazing

saving, i'm sure you'll all agree. It's even better with bigger alphabets!

The way it works is that you put the symbols of the alphabet into a sort

of stack, in some fixed but arbitrary order. Like this:

A

B

C

D

To encode a symbol, you (a) find the depth of the symbol in the stack and

write that out and then (b) move that symbol to the top of the stack. So

for instance, if you had a C to encode, you'd write out a 2, and then have

a stack like this:

C

A

B

D

And encoding the string above looks like this:

ABABBACDDDCCCDDDCAABBCBA <-- the string

AABABBACDDDCCCDDDCAABBCB <-- top

BBABAABACCCDDDCCCDCCAABC the (before)

CCCCCCCBAAAAAAAAAADDCCAA stack (movement)

DDDDDDDDBBBBBBBBBBBBDDDD <-- bottom

011101230010010012030212 <-- the code

I've described this as a stack with a top, but in the literature it's

something which has a front rather than a top, hence the name.

Anyway, that's the classic MTF coder. It's not the most exciting or

effective compression technique in the world, but it's what i need in my

app right now.

Although in my app, things are slightly more complicated. I don't have a

fixed alphabet: i have a sequence of arbitrary symbols. Initially, my set

of possible symbols is empty, and as symbols come through, i learn about

them and add them to the set. I can still apply MTF to this, though, just

with a growing stack. Like this (again with letters - and this time i

think making the string less compressible, but never mind):

AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBAA

AABAABABCABCBDBADDDBEFADFDEBAFCDDBEFDCDDFEBA

ABBABABCABCBDBAAADBEFADFDEBAFCCDBEFDCCDFEB

ABCAACCDBBBADBEFAAFDEBAFFCDBEFFFCDFE

AACCCCCADBEEEAFDEBAAFCDBEEEECDF

CADBBBBAFDEBBAFCCBBBBBCD

CCCCCCCCCDEEEAAAAAAAAAC

!0!10111!2221!132002!!44213444550454341023450

Where ! means that the symbol isn't currently in the stack.

So, how would you implement an MTF coder?

Oh, and as an added complication, the symbols in my app are arbitrary

objects. They could be literally anything - Strings, Maps, Dates,

Customers, Tomcat instances, etc. And for an object to match an entry in

the stack, they have to be identical (==) rather than just equal.

The obvious way is with a stack (implemented with a list), as described:

public class MoveToFrontCoder<T> {

private List<T> stack = new ArrayList<T>();

/** @return the code for the object, or -1 if it's new */

public int encode(T obj) {

stack.add(0, obj);

Iterator<T> it = stack.iterator();

it.next(); // discard the one we just added

int idx = 0;

while (it.hasNext()) {

if (it.next() == obj) {

it.remove();

return idx;

}

++idx;

}

return -1;

}

}

Note that i can't just use indexOf because that's defined on top of

equals, not ==. Hence the iterator.

That's simple and works, but its average-case behaviour is O(n). Of

course, the whole point of MTF coding is that the normal case isn't

average (IYSWIM), so you'd hope it would be nearer O(1) in practice.

But is there a way to do it that has a better worse-case (if not

worst-case) behaviour?

I could use a LinkedList in the above structure, and it might be faster,

but it's still O(n).

Could i use an IdentityHashMap from objects to Integers, holding the

position of that object in the stack? But then how do i do updates?

My gut tells me that i could use an IdentityHashMap in which the values

were links in a linked list, which i could then rearrange easily, but i

can't see how i could compute stack positions in O(1) that way. And my gut

mostly does perl, so its advice is not usually worth bothering with.

A priority queue of some sort? Like a heap? But finding the index of an

object in a heap is O(n). Or possibly more - i think it's equivalent to

sorting, so O(n log n).

An IdentityHashMap pointing to nodes in an enfilade? I don't actually

understand those well enough to be sure that they're applicable, but i

think they might be, and Ted Nelson invented them, so they're cool.

Right, time to sleep on it, i think.

tom