Suggest improvements

M

Michael Angelo Ravera

I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

inline void getMultIncr(int* i, int *mult, int *incr, int size)
{
// i contains a pointer to the (remainder of the) vector
// *mult is a pointer to the number of elements in the current pattern
// *incr is a pointer to the increment in the current pattern
// size contains the number of elements in the (remainder of the)
vector

        int mark;
        int k;

        *mult = 1;
        *incr = 0;

        if (size == 1) return;

//try to find duplicated values
        mark = i[0];
        for (k=1; k < size; k++)
        {
                if (i[k] != mark) break;
                (*mult)++;
        }
        if (*mult > 1 || size == 2) return;

// now look for possible runs
        *incr = i[1] - i[0];
        if (i[2] - i[1] != *incr) return;

        *mult = 3;
        for (k=3; k < size; k++)
        {
                if (i[k] - i[k-1] != *incr) break;
                (*mult)++;
        }
        return;



}- Hide quoted text -

- Show quoted text -


If you are trying to compress
1 2 3 5 8 8 12 6 5 4
into 1, next two +1, 5, 8 twice, 12, 6, next 2 -1 so that it looks
something like:
1 2+1 5 2*8 12 6 2-1

and you only care about duplicated consecutive values, then you can do
the whole calculation in one pass with a difference and run length
counter

int idx_in;
int idx_out;
int diff;
int prev_diff;
int run_len;
int out_array [];
int in_array [];
int in_size; /* Argument */

for (out_array [0] = in_array [0], idx_in = 1, idx_out = 1,
prev_diff = impossible_value, run_len = 0;
idx_in < in_size; idx_in ++)
{
if (!(diff = in_array [idx_in] - in_array [idx_in - 1]) ||
(diff == prev_diff))
run_len ++;
else
{
if (run_len)
{
if (prev_diff)
{
out_array [idx_out ++] = previous_item_runs (run_len)
/* with or without +1 */
out_array [idx_out ++] = prev_diff;
}
else
out_array [idx_out ++] = previous_item_is_dup (run_len);
run_len = 0;
}
out_array [idx_out ++] = in)array [idx_in];
}
prev_diff = diff;
}

There are a couple of subtleties not handled, but it will be hard to
do much better.

The reason for requiring a run of three but only a duplicate of two is
that you probably need three cells to note a run, but only two to note
a duplicate.
 
D

Dann Corbit

If you are trying to compress
1 2 3 5 8 8 12 6 5 4
into 1, next two +1, 5, 8 twice, 12, 6, next 2 -1 so that it looks
something like:
1 2+1 5 2*8 12 6 2-1

and you only care about duplicated consecutive values, then you can do
the whole calculation in one pass with a difference and run length
counter

If he just wants to compress the data, there are more effective methods
than RLL.
 
M

Michael Angelo Ravera

If he just wants to compress the data, there are more effective methods
than RLL.

Oh, I agree, but, if you want to do compression for something like a
real-time video, this is the kind of problem that you are trying to
solve.
 
T

Tim Rentsch

Gus Gassmann said:
I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions. [snip]

Two suggestions (given the later information that this is being
done to transmit an array as XML):

1. A better perspective might be to focus on what
are the endpoints of the relevant sub-array, and
leave the size of the sub-array as a secondary
(derived) quantity;

2. The larger problem (of transmitting the entire array)
is still small enough so that it might reasonably be
done as just a single function.

As the saying goes, I have a modest example here:

void
transmit_array_values( int values[], unsigned n ){
int delta;
unsigned i, j;

for( i = 0, j = 1; ALWAYS( j-i == 1 ) && j < n; j++ ){
delta = values[j] - values;
while( j < n && values[j] - values[j-1] == delta ) j++;

if( ALWAYS( j-i > 1 ) && delta == 0 ){
repetition( j-i, values );
i = j;

} else if( j-i > 2 ){
progression( j-i, values, delta );
i = j;

} else if( ALWAYS( j-i == 2 ) ){
singleton( values );
i = --j;

}
}

if( i < n ) singleton( values );
}


Notes --

The variable 'i' holds the start of the (next) relevant sub-array,
the variable 'j' holds one past the end (of the same sub-array);
this makes the size be 'j-i' for the non-singleton cases. Since
the singleton case sends only one value, but j-i == 2, we
decrement j to start the next sub-array at the next array element
(ie, right after the single value just transmitted).

The different kinds of subsequences -- duplicates, runs, single
values -- are handled by 'repetition()', 'progression()',
and 'singleton()', respectively. (Yes, clearly these function
names could be better; I thought these names would serve
their purpose well enough.)

The ALWAYS() macro yields the logical value of its argument
expression, but also requires the value to be true. Sort of
a combination of a truth value and an assertion -- in fact a
plausible definition could be

#define ALWAYS(e) (assert(e), 1)

I think it should be easy enough to look at the function
logic and construct a convincing argument that each of the
ALWAYS() conditions given above is in fact always true.
(Disclaimer: assuming no bad effects from integer overflow,
etc, which of course would mean all bets are off.)

At the end of the outer loop there might be one element
left over that hasn't yet been transmitted, which occurs
exactly when 'i < n'. Hence the last statement.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top