Baeza-Yates intersection Java implementation

Discussion in 'Java' started by Sebastian, Dec 1, 2010.

  1. Sebastian

    Sebastian Guest

    I am trying to implement the Baeza-Yates intersection algorithm, but
    need some help.

    This algorithm is among the fastest known algorithms for intersecting
    sorted sets. For small m, Baeza-Yates is O(m * log(n)), while for n =
    alpha * m it is O(n), i. e. it performs particularly well when the
    one set is sufficiently larger than the other.

    E. g., see "Experimental Analysis of a Fast Intersection Algorithm for
    Sorted Sequences" by Ricardo Baeza-Yates and Alejandro Salinger

    http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.7899&rep=rep1&type=pdf

    There is a C implementation by Erik Frey, see
    http://fawx.com/
    http://github.com/erikfrey/themas/tree/master/src/set_intersection/

    Unfortunately, I cannot read C very well. Below is my attempt to come
    up with a Java version, which does not work (it generates stack
    overflows, and my attempts to repair that led to other incorrect
    behavior. Also, lowerBound should probably use a binary search).

    Perhaps it would be worthwhile to put together a working version
    and post it somewhere for everyone's benefit.

    -- Sebastian


    import java.util.ArrayList;
    import java.util.List;
    import java.util.Comparator;

    public class Intersection

    /**
    * Finds the intersection of two sorted random access lists.
    * @param <T> type of the list elements
    * @param list1 a random access list of T sorted by cmp
    * @param list2 a random access list of T sorted by cmp
    * @param cmp a Comparator for T's
    * @return a random access list of T sorted by cmp containing the
    intersection of list1 and list2
    */
    public static <T> List<T> intersect( List<T> list1, List<T> list2,
    Comparator<T> cmp )
    {
    int size1 = list1.size();
    int size2 = list2.size();
    int maxSize = Math.max( size1, size2 );
    return baeza_intersect( list1, 0, (size1 - 1), list2, 0, (size2 -
    1), cmp, new ArrayList<T>( maxSize ) );
    }

    private static <T> List<T> baeza_intersect( List<T> list1, int
    begin1, int end1, List<T> list2, int begin2, int end2,
    Comparator<T> cmp,
    List<T> out )
    {
    if( (end1 - begin1) <= (end2 - begin2) ) {
    if( begin1 > end1 ) {
    return out;
    }
    // binary probe (could use interpolation probe if there were a
    // distance measure on T
    int probe1 = begin1 + ((end1 - begin1) / 2);
    int probe2 = lowerBound( list2, begin2, end2, list1.get( probe1
    ), cmp );
    if( probe1 < end1 || probe1 < end2 ) {
    baeza_intersect( list1, begin1, probe1, list2, begin2, probe2,
    cmp, out );
    }
    if( (probe2 != end2) &&(cmp.compare( list1.get( probe1 ),
    list2.get( probe2 ) ) == 0) ) {
    out.add( list2.get( probe2 ) );
    }
    probe1++;
    probe2++;
    return baeza_intersect( list1, probe1, end1, list2, probe2, end2,
    cmp, out );
    }
    else {
    return baeza_intersect( list2, begin2, end2, list1, begin1, end1,
    cmp, out );
    }
    }

    private static <T> int lowerBound( List<T> list, int begin, int end,
    T value, Comparator<T> cmp )
    {
    // should use binary search?
    int i = begin;

    while( (i < end) && (cmp.compare( list.get( i ), value ) < 0) ) {
    i++;
    }

    return i;
    }
    }
    Sebastian, Dec 1, 2010
    #1
    1. Advertising

  2. Sebastian

    Sebastian Guest

    sorry, there's typo in the OP. The condition enclosing the
    first recursive call should read:

    if( probe1 < end1 || probe2 < end2 ) {

    That gets rid of the stack overflow, but it's still incorrect.

    Consider what happens when the only element in the intersection
    should be the last element of list2, and probe2 == end2. That
    case is not covered, but removing the check probe2 != end2 before
    adding an element to the output may lead to the same element being
    included twice.

    -- Sebastian
    Sebastian, Dec 1, 2010
    #2
    1. Advertising

  3. Sebastian

    Tom Anderson Guest

    On Wed, 1 Dec 2010, Sebastian wrote:

    > I am trying to implement the Baeza-Yates intersection algorithm, but
    > need some help.
    >
    > This algorithm is among the fastest known algorithms for intersecting
    > sorted sets. For small m, Baeza-Yates is O(m * log(n)), while for n = alpha *
    > m it is O(n), i. e. it performs particularly well when the
    > one set is sufficiently larger than the other.
    >
    > E. g., see "Experimental Analysis of a Fast Intersection Algorithm for Sorted
    > Sequences" by Ricardo Baeza-Yates and Alejandro Salinger
    >
    > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.91.7899&rep=rep1&type=pdf
    >
    > There is a C implementation by Erik Frey, see
    > http://fawx.com/
    > http://github.com/erikfrey/themas/tree/master/src/set_intersection/
    >
    > Unfortunately, I cannot read C very well. Below is my attempt to come
    > up with a Java version, which does not work (it generates stack
    > overflows, and my attempts to repair that led to other incorrect
    > behavior. Also, lowerBound should probably use a binary search).


    I hadn't come across this algorithm before, so thanks for bringing it to
    my attention.

    Firstly, you may be interested in this paper:

    http://www.siam.org/proceedings/alenex/2007/alx07_007sandersp.pdf

    Which says that although Baeza-Yates is very clever, it actually isn't
    that fast in practice. The algorithm they give which handily beats it is
    very badly explained (they fall into horrible computer science notation to
    explain what is very much a programmer's algorithm), but quite simple. The
    key point, i think, is that if you stick to a fairly simple
    iterate-over-the-lists approach, you can use a modestly clever compressed
    representation which lets you get the result faster than using an
    ingenious but complex algorithm on an uncompressed representation.

    That algorithm is written in terms of intersecting sets of numbers, but i
    think most of what they do can be applied to sets of comparable objects -
    you can't use the bucketed structure they describe, but you can do
    something pretty similar without compression. Or you could use hashcodes,
    and use exactly their algorithm.

    Which is not to say BY isn't an interesting thing to implement anyway, of
    course.

    Secondly, i think you might have better luck by starting with the theory
    and writing your own implementation than by trying to read Mr Frey's code.
    I'm sure his code is very good, but writing is often easier than
    translating. It's certainly easier to know where your head is at while
    doing it.

    Thirdly, i think you might be interested in the subList method of List.
    Specifically, because it will let you write the recursive method in terms
    of whole lists, rather than having to juggle bounds and offsets and so on.
    I imagine some simple but hard to spot mistake in those bounds and offsets
    is behind the infinite loop which is causing your stack overflow.

    Fourthly, i don't have anything specific to say on your implementation -
    i'm afraid to say i don't have time to get into the details of debugging
    someone else's code right now. I urge you to have a go with subList,
    though, and see if it makes things clearer.

    tom

    --
    I have often thought that the questions which we were unable to ask and
    the speeches that we never had a chance to deliver were probably the
    best ones. -- John Wilkinson MP, on Parliament
    Tom Anderson, Dec 1, 2010
    #3
  4. Sebastian

    Tom Anderson Guest

    On Thu, 2 Dec 2010, bugbear wrote:

    > Tom Anderson wrote:
    >
    >> Firstly, you may be interested in this paper:
    >>
    >> http://www.siam.org/proceedings/alenex/2007/alx07_007sandersp.pdf
    >>
    >> Which says that although Baeza-Yates is very clever, it actually isn't
    >> that fast in practice. The algorithm they give which handily beats it is
    >> very badly explained (they fall into horrible computer science notation
    >> to explain what is very much a programmer's algorithm)

    >
    > I'm not I understand your notion of a conflict between "computer science"
    > and "programmer".


    There are more greek letters and subscripts in computer science.

    tom

    --
    Remember when we said there was no future? Well, this is it.
    Tom Anderson, Dec 2, 2010
    #4
  5. "bugbear" <bugbear@trim_papermule.co.uk_trim> escribió en el mensaje
    news:...
    > Tom Anderson wrote:
    >
    >>
    >> Firstly, you may be interested in this paper:
    >>
    >> http://www.siam.org/proceedings/alenex/2007/alx07_007sandersp.pdf
    >>
    >> Which says that although Baeza-Yates is very clever, it actually isn't
    >> that fast in practice. The algorithm they give which handily beats it is
    >> very badly explained (they fall into horrible computer science notation
    >> to explain what is very much a programmer's algorithm)

    >
    > I'm not I understand your notion of a conflict between "computer science"
    > and "programmer".


    It is not much of a conflict.

    In fact, many programmers began their careers as computer scientists.
    Leonardo Azpurua, Dec 3, 2010
    #5
  6. Sebastian

    Tom Anderson Guest

    On Thu, 2 Dec 2010, Patricia Shanahan wrote:

    > On 12/2/2010 1:09 AM, bugbear wrote:
    >> Tom Anderson wrote:
    >>
    >>> Firstly, you may be interested in this paper:
    >>>
    >>> http://www.siam.org/proceedings/alenex/2007/alx07_007sandersp.pdf
    >>>
    >>> Which says that although Baeza-Yates is very clever, it actually isn't
    >>> that fast in practice. The algorithm they give which handily beats it is
    >>> very badly explained (they fall into horrible computer science notation
    >>> to explain what is very much a programmer's algorithm)

    >>
    >> I'm not I understand your notion of a conflict between "computer science"
    >> and "programmer".

    >
    > Maybe the distinction Tom is making is between algorithms that are
    > primarily of interest to computer scientists studying issues such as the
    > minimum computational complexity for a specific problem, and those
    > algorithms that are useful in practical programs.


    Something along those lines. It's more that the algorithm they describe is
    not very intellectually interesting - it boils down to iterating over a
    list of numbers stored in a two-level array using the obvious pair of
    nested loops. There's no deep insight, clever tricks, exploitation of
    mathematical truths, relation to a more general space of problems, etc.
    Just a couple of loops that actually run pretty fast. It's the kind of
    algorithm an experienced programmer who knew a bit about the memory
    hierarchy could write without any background in computer science (i
    consider memory hierarchy to be electronic engineering!).

    FWIW, the data structure and algorithm in the paper is basically:

    // assume all values in the lists are are positive or zero

    // i will not use any bit compression, just store full-size ints
    // this misses some of the point of the algorithm, but it's fiddly, and
    // doesn't add anything to the explanation

    // utility functions

    final int LOW_BITS = 14; // or whatever

    int lowBits(int i) {
    return i & ((1 << LOW_BITS) - 1); // extract the bottom bits
    }

    int highBits(int i) {
    return i >> LOW_BITS;
    }

    // i am using array chopping and changing for simplicity
    // you probably wouldn't want to implement it like this
    // both these functions are horrific

    int[] subarray(int[] array, int start, int end) {
    int[] subarray = new int[end - start];
    System.arraycopy(array, start, subarray, 0, subarray.length);
    return subarray;
    }

    int[] ensureCapacity(int[] array, int size) {
    if (array.length >= size) return array;
    int[] newArray = new int[size];
    System.arraycopy(array, 0, newArray, 0, array.length);
    return newArray;
    }

    // code to build the structure

    int[] buildDeltaChain(int[] sortedNumbers) {
    int[] chain = new int[sortedNumbers.length];
    int prev = 0;
    for (int i = 0; i < sortedNumbers.length; ++i) {
    int cur = lowBits(sortedNumbers);
    n = cur - prev;
    prev = cur;
    }
    }

    int[][] buildTwoLevelTable(int[] sortedNumbers) {
    int start = 0;
    int[][] deltaChains = new int[0][]; // 50% chance this is the wrong syntax, sorry
    while (start < sortedNumbers.length) {
    int highBits = highBits(sortedNumbers[start]);
    int end = start + 1;
    while (highBits(sortedNumbers[end]) == highBits) ++end;
    int[] deltaChain = buildDeltaChain(subarray(sortedNumbers, start, end));
    deltaChains = ensureCapacity(deltaChains, (highBits + 1));
    deltaChains[highBits] = deltaChain;
    start = end;
    }
    }

    // code to intersect another sorted list with the structure

    final int REMOVED = -1; // we will rub out anything not in the structure by overwriting it with this

    public void intersect(int[] sortedNumbers) {
    int currentChainHighBits = -1;
    int[] currentChain = null;
    int currentChainIndex = -1; // always points to the next element to look at
    int currentChainValue = -1;
    for (int i = 0; i < sortedNumbers.length; ++i) {
    int n = sortedNumbers;
    int highBits = highBits(n);
    if (highBits != currentChainHighBits) {
    // move to the head of a new chain
    currentChainHighBits = highBits;
    currentChain = deltaChains[highBits];
    currentChainIndex = 0;
    currentChainValue = currentChain[0];
    }
    int lowBits = lowBits(n);
    while (currentChainValue < lowBits) {
    ++currentChainIndex;
    if (currentChainIndex >= currentChain.length) {
    currentChainValue = Integer.MAX_VALUE; // this will make following queries on this chain finish immediately
    break;
    }
    currentChainValue = currentChainValue + currentChain[currentChainIndex];
    }
    if (currentChainValue != n) sortedNumbers = REMOVED;
    }
    }

    This is off the top of my head without looking at the paper again, so
    there could well be syntax errors, bugs, and other deviances.

    Thinking about it, you probably want to store all the delta chains in one
    big array, as one big delta chain, and then have the upper level of the
    structure be an index on it: for each value of highBits, it would store
    the lowBits value and index in the chain of the first such element.

    Anyway, my point is that if you look through the above code, there's
    nothing terribly clever - just some fairly workmanlike looping over
    arrays, with some delta encoding thrown in. It's a good and effective
    algorithm, but it doesn't really *teach* us anything.

    tom

    --
    I don't know what a ceilidh is and I dread to ask. I'm picturing some kind
    of mechanical contraption with dildos attached to pistons. -- Jonathan M
    Tom Anderson, Dec 4, 2010
    #6
  7. Sebastian

    BGB Guest

    On 12/2/2010 2:09 AM, bugbear wrote:
    > Tom Anderson wrote:
    >
    >>
    >> Firstly, you may be interested in this paper:
    >>
    >> http://www.siam.org/proceedings/alenex/2007/alx07_007sandersp.pdf
    >>
    >> Which says that although Baeza-Yates is very clever, it actually isn't
    >> that fast in practice. The algorithm they give which handily beats it is
    >> very badly explained (they fall into horrible computer science notation
    >> to explain what is very much a programmer's algorithm)

    >
    > I'm not I understand your notion of a conflict between "computer science"
    > and "programmer".
    >


    many "computer science" people are more math-heads than programmers.

    they may know their way around their derivatives and integrals really
    well, but fall on their face when it comes to writing any code in an
    actual programming language.


    many programmers can write piles of code, but fail to have any
    understanding of all this theory, and despite beating together a
    generally working rigid body physics engine, and later bomb hard core in
    Physics and Calculus classes because of an inability to really get their
    heads around set-theory and continuous systems (or much of anything in
    math-land much beyond algebra or pre-algebra...).


    a computer science person will describe type-systems with elaborate
    notations (with so many greek letters), and promote concepts such as the
    Hindley-Milner type system.

    the actual programmer may write a compiler, generally representing types
    as objects containing the relevant pieces of information (base-type
    number, number of array levels, ...), and figure "well, types come from
    the declarations and flow from inwards to outwards".


    a computer scientist may want to eliminate all type declarations since
    they feel they are ugly and obscure the logic of the program or hurt
    generality or similar.

    the programmer may like the type declarations since it makes it more
    obvious which things go in which arguments and what one gets back, and
    because it makes the compiler start more readily spewing error messages
    if they mess up (vs silently accepting the program and doing something
    strange as a result), ...

    actually, the computer scientist may actually end up thinking using
    Haskell to implement software is a good idea, and a programmer may
    conclude (after looking at or trying to use the language) that in fact
    it would be a very bad idea...

    "them be strange folk who hate the nulls and the assignments... them
    rather use them their unions and infinite recursions instead.".

    ....



    or such...
    BGB, Dec 4, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. SPG
    Replies:
    15
    Views:
    12,657
    ACE01
    Nov 7, 2006
  2. Replies:
    1
    Views:
    3,367
    Betty
    Apr 11, 2005
  3. Replies:
    6
    Views:
    1,050
  4. Alain Frisch
    Replies:
    3
    Views:
    417
    Richard Tobin
    May 3, 2005
  5. Daniel Pitts

    Almost Useful: Java Type Intersection

    Daniel Pitts, Nov 23, 2007, in forum: Java
    Replies:
    2
    Views:
    604
    Daniel Pitts
    Nov 23, 2007
Loading...

Share This Page