Suggest improvements

G

Gus Gassmann

I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

inline void getMultIncr(int* i, int *mult, int *incr, int size)
{
// i contains a pointer to the (remainder of the) vector
// *mult is a pointer to the number of elements in the current pattern
// *incr is a pointer to the increment in the current pattern
// size contains the number of elements in the (remainder of the)
vector

int mark;
int k;

*mult = 1;
*incr = 0;

if (size == 1) return;

//try to find duplicated values
mark = i[0];
for (k=1; k < size; k++)
{
if (i[k] != mark) break;
(*mult)++;
}
if (*mult > 1 || size == 2) return;

// now look for possible runs
*incr = i[1] - i[0];
if (i[2] - i[1] != *incr) return;

*mult = 3;
for (k=3; k < size; k++)
{
if (i[k] - i[k-1] != *incr) break;
(*mult)++;
}
return;
}
 
I

Ike Naar

I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

A duplicated element is a run with increment 0.
So you only have to look for runs.
 
G

Gene

I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

The code looks reasonable. You might get a few percent better with
some tweaks, but just as likely not. It the vector is truly huge, you
might get a boost from dividing the array into N parts and using N
threads to search in parallel with a N core processor. But it's just
as likely the memory system or other bottleneck will prevent a
performance gain. Stitching the results together at the end will be
messy but not too hard.
 
B

Ben Bacarisse

Gus Gassmann said:
I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern.

Good plan though it may have an impact if your secondary concern about
there being multiple decompositions ever comes to the fore.
In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

This is not an expensive function and, at first sight, a fast version
and a slow version will likely be within a few percent of each other.
What algorithm is this going to be a part of where such differences
might be significant?
inline void getMultIncr(int* i, int *mult, int *incr, int size)
{

Start with it not being inline, then measure and see if making it inline
helps. Many modern compilers will be better than you are at deciding if
there is any value in inlining a function call.

I don't like to waste the possibility of a return value. I'd return the
"run length" (*mult in your case). It will probably simplify the code
a little.
// i contains a pointer to the (remainder of the) vector
// *mult is a pointer to the number of elements in the current pattern
// *incr is a pointer to the increment in the current pattern
// size contains the number of elements in the (remainder of the)
vector

int mark;
int k;

*mult = 1;
*incr = 0;

if (size == 1) return;

I might try to generalise the code to cope with size == 0 as well. It
often makes calling a function simpler if there are fewer special cases
to test before the call. With a version that returns the run length,
all you need is

if (size <= 1) return size;
//try to find duplicated values
mark = i[0];
for (k=1; k < size; k++)
{
if (i[k] != mark) break;

A loop with an initial break is often better written by adding the
(negated) test into the loop's condition:

for (k = 1; k < size && i[k] == mark; k++) ...
(*mult)++;
}
if (*mult > 1 || size == 2) return;

The size == 2 part looks wrong to me. Given i = {1, 2}; I'd say that

getMultIncr(i, &length, &inc, 2);

should set length == 2 and inc == 1. Your code will return with length
== 1 and inc == 0. However, I can see that there is scope for ambiguity
here -- you might want to call this two singletons though that seems odd
to me.
// now look for possible runs
*incr = i[1] - i[0];
if (i[2] - i[1] != *incr) return;

*mult = 3;
for (k=3; k < size; k++)
{
if (i[k] - i[k-1] != *incr) break;
(*mult)++;
}
return;
}

I see that there is no way to return a length of 2 with an increment, so
maybe you have decided that 1, 2 is really two singletons.

You can almost certainly reduce the number of special cases here.
Duplicate runs can be detected in the same way that you detect
incrementing sequences.

Looks like a fun function. If I get time I'll write one and we can
compare styles.
 
N

Niklas Holsti

Richard said:
It's good to point that out, but I doubt that that was what he
was talking about. The issue is that the boundary between runs
is ambiguous. Thus, his example can also be partitioned as
(1,2) (3,5) (8,8) (12,6) (5,4)

No, because a "run" should have at least three numbers in it, according
to the OP.

I wonder if it would be better to define "runs" and repetitions as
possibly overlapping at their end-points, so that the sequence

1 2 3 5 7 9 4 9 9 8 7

would be decomposed into the run (1 2 3) with increment +1, the run (3 5
7 9) with increment +2, the singleton 4, the repetition (9 9), and the
run (9 8 7) with increment -1. It seems to me that this
endpoint-overlapping decomposition is unique, if the runs and
repetitions are taken as long as possible, and singletons are used only
when necessary.
 
G

Gus Gassmann

It's good to point that out, but I doubt that that was what he
was talking about.  The issue is that the boundary between runs
is ambiguous.  Thus, his example can also be partitioned as
(1,2) (3,5) (8,8) (12,6) (5,4)
 
G

Gus Gassmann

Gus Gassmann said:
I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)
My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern.

Good plan though it may have an impact if your secondary concern about
there being multiple decompositions ever comes to the fore.
In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].
I am concerned about performance and would be grateful for insights
and suggestions.

This is not an expensive function and, at first sight, a fast version
and a slow version will likely be within a few percent of each other.
What algorithm is this going to be a part of where such differences
might be significant?

I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.
Start with it not being inline, then measure and see if making it inline
helps.  Many modern compilers will be better than you are at deciding if
there is any value in inlining a function call.

I don't like to waste the possibility of a return value.  I'd return the
"run length" (*mult in your case).  It will probably simplify the code
a little.

My preference is for symmetry in this case, unless theer is a
performance penalty. It has been suggested that I should set the
function signature to
I might try to generalise the code to cope with size == 0 as well.  It
often makes calling a function simpler if there are fewer special cases
to test before the call.  With a version that returns the run length,
all you need is

  if (size <= 1) return size;

That seems like an easy addition, although I didn't think it
necessary.

I embed the function call in a loop of the following kind:

for(int k=0; k<idim;)
{
getMultIncr(&i[k],&mult,&incr,idim-k);
//process the return
k += mult;
}

Still, the performance should not be affected, and a litt;e bit of
bulletproofing never hurt. I'll take your suggestion :)
//try to find duplicated values
   mark = i[0];
   for (k=1; k < size; k++)
   {
           if (i[k] != mark) break;

A loop with an initial break is often better written by adding the
(negated) test into the loop's condition:

  for (k = 1; k < size && i[k] == mark; k++) ...

Now we're getting somewhere. This is new to me and looks interesting.
I'll try that.
           (*mult)++;
   }
   if (*mult > 1 || size == 2) return;

The size == 2 part looks wrong to me.  Given i = {1, 2}; I'd say that

  getMultIncr(i, &length, &inc, 2);

should set length == 2 and inc == 1.  Your code will return with length
== 1 and inc == 0.  However, I can see that there is scope for ambiguity
here -- you might want to call this two singletons though that seems odd
to me.
// now look for possible runs
   *incr = i[1] - i[0];
   if (i[2] - i[1] != *incr) return;
   *mult = 3;
   for (k=3; k < size; k++)
   {
           if (i[k] - i[k-1] != *incr) break;
           (*mult)++;
   }
   return;
}

I see that there is no way to return a length of 2 with an increment, so
maybe you have decided that 1, 2 is really two singletons.

Precisely. For the application I have in mind, a run of length 2 (such
as your 1, 2) is inferior to two singletons. However, I did toy with
the idea of returning a run of length 2 with nonzero increment and
intercepting that outside of the function. It has to be treated
separately, so there is the tradeoff of two calls to the function
versus the added processing. I am not sure which way that would go.
Duplicate runs can be detected in the same way that you detect
incrementing sequences.

Looks like a fun function.  If I get time I'll write one and we can
compare styles.

That would be lovely. Thanks.
 
G

Gus Gassmann

No, because a "run" should have at least three numbers in it, according
to the OP.

I wonder if it would be better to define "runs" and repetitions as
possibly overlapping at their end-points, so that the sequence

    1 2 3 5 7 9 4 9 9 8 7

would be decomposed into the run (1 2 3) with increment +1, the run (3 5
7 9) with increment +2, the singleton 4, the repetition (9 9), and the
run (9 8 7) with increment -1. It seems to me that this
endpoint-overlapping decomposition is unique, if the runs and
repetitions are taken as long as possible, and singletons are used only
when necessary.

The decomposition has to be mutually exclusive. So (1 2 3), (3 5 7 9)
does not work. It must be either (1 2 3), (5 7 9) or 1, 2, (3 5 7 9).
In this case, the former would be better, but it is clear that the
greedy algorithm does not always work. For instance in the sequence

1 2 3 3 3 5 7 9

the decomposition

(1 2 3) (3) (3 5 7 9)

is preferred to

(1 2 3) (3 3) (5 7 9)
 
B

BartC

Gus Gassmann said:
I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.
I embed the function call in a loop of the following kind:

for(int k=0; k<idim;)
{
getMultIncr(&i[k],&mult,&incr,idim-k);
//process the return
k += mult;
}

You haven't shown any code for the file processing.

Is the 90MB the input or output size? How long does the above loop take,
without any processing of the result? How long does it take without
executing the loop at all (ie. just setting up the data)?

Compressing a 90MB input file (is that text or binary?) shouldn't take two
hours, especially with
a variation of run-length-encoding which you seem to be doing.
 
N

Niklas Holsti

Gus said:
The decomposition has to be mutually exclusive.

Yep, I understood this aim from your original post, but I was just
thinking about how to make the decomposition unique and as informative
as possible -- I thought you might be aiming at some intelligent
analysis of the data.

But I see from another post that your aim is data compression, which
explains the need for mutual exclusion of the runs and repetitions.
Apologies if I injected noise in the discussion
 
G

Gus Gassmann

I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.
I embed the function call in a loop of the following kind:
for(int k=0; k<idim;)
{
   getMultIncr(&i[k],&mult,&incr,idim-k);
   //process the return
   k += mult;
}

You haven't shown any code for the file processing.

Is the 90MB the input or output size? How long does the above loop take,
without any processing of the result? How long does it take without
executing the loop at all (ie. just setting up the data)?

Compressing a 90MB input file (is that text or binary?) shouldn't take two
hours, especially with
a variation of run-length-encoding which you seem to be doing.
--
Maybe I should explain a bit about that process. The files are xml
files and contain vectors in the following format:

<el>1</el>
<el>2</el>
<el>3</el>
....

The schema allows for mult and incr attributes to express this run
more compactly as

<el mult="3" incr="1">1</el>

etc. It is thought (!) that the compressed version of the files is
smaller (true --- we seem to be saving about 40% of the file space),
faster to transmit (should be true, but I don't know if this is a
bottleneck), and faster to parse (this remains to be established).
Obviously the time it takes to compress the file needs to be factored
in as well, so I want to make sure that this part is as efficient as
possible.

Hope this explains things a bit better.
 
B

BartC

Gus Gassmann said:
I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.
I embed the function call in a loop of the following kind:
for(int k=0; k<idim;)
{
getMultIncr(&i[k],&mult,&incr,idim-k);
//process the return
k += mult;
}

You haven't shown any code for the file processing.

Is the 90MB the input or output size? How long does the above loop take,
without any processing of the result? How long does it take without
executing the loop at all (ie. just setting up the data)?

Compressing a 90MB input file (is that text or binary?) shouldn't take
two
hours, especially with
a variation of run-length-encoding which you seem to be doing.
--
Maybe I should explain a bit about that process. The files are xml
files and contain vectors in the following format:

<el>1</el>
<el>2</el>
<el>3</el>

So the 90MB refers to the input XML in text format? In that case you'd only
have 7-8 million numbers if they were all single digit, and probably nearer
5 million.

Running your code on *250 million* numbers (randomly initialised with data
with an average run of 5) took between 1 and 2.5 seconds at most on my
machine.

So I'm not sure that's your bottleneck.

Adding all the text processing will slow things down considerably, but it
should still be no more than a minute or so.

So I might look elsewhere for the problems.
...

The schema allows for mult and incr attributes to express this run
more compactly as

<el mult="3" incr="1">1</el>

etc. It is thought (!) that the compressed version of the files is
smaller (true --- we seem to be saving about 40% of the file space),

Doesn't save much on this 3-run example. Changing "mult" to "m", and "incr"
to "i", would save quite a bit too..
 
B

Ben Bacarisse

Gus Gassmann said:
Gus Gassmann said:
I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)
My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern.

Good plan though it may have an impact if your secondary concern about
there being multiple decompositions ever comes to the fore.
In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].
I am concerned about performance and would be grateful for insights
and suggestions.

This is not an expensive function and, at first sight, a fast version
and a slow version will likely be within a few percent of each other.
What algorithm is this going to be a part of where such differences
might be significant?

I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.

Is there any evidence that this code is responsible for more than a tiny
fraction of those 2 hours? One should be wary of guessing when
performance is involved (measure, always measure) but the algorithm is
linear and simple so I'd image that the IO alone takes way longer than
the run-length scan.

Anyway, measure. If your run-length code is only than a few percent of
the run-time don't try to optimise it. Leave it clear and
self-evident. Shaving 5% off 5% of the run-time is rarely worth while.
My preference is for symmetry in this case, unless theer is a
performance penalty. It has been suggested that I should set the
function signature to
std::pair <int,int> getMultIncr(int* i, int size)
but this is outside of my comfort zone. (Me no speaka da C so good...)

That's C++. You probably need to decide if you're using C or C++ since
C++ has its own preferred way to handle the traversal of sequences.

That would be lovely. Thanks.

Here's how I'd do it. First, I'd write a function that does the
mathematically correct thing and then write a function to take care of
the fact that you permit duplicates of 2 to be reported but you only
want increasing or decreasing runs to be three or more in length:

/* Return the length of the longest prefix of the array 'data'
* that is in arithmetic progression. The difference between
* elements (which may be zero) is return in '*diff'.
*
* Only the first 'n_items' of 'data' will be considered.
*/

size_t run_length(int *data, size_t n_items, int *diff)
{
*diff = 0;
if (n_items < 2)
return n_items;
*diff = data[1] - data[0];
size_t rl = 2;
while (rl < n_items && data[rl] == data[rl - 1] + *diff)
++rl;
return rl;
}

/* The required behaviour seems to be asymmetric. Runs that are
* duplicates can be any length but when diff > 0 only runs of
* three or more should be considered.
*/

size_t run_length_gte3(int *data, size_t n_items, int *diff)
{
size_t rl = run_length(data, n_items, diff);
return rl >= 3 || *diff == 0 ? rl : 1;
}
 
J

James Dow Allen

Looks like a fun function.  If I get time I'll write one and we can
compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first! :)

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen
 
J

James Dow Allen

Looks like a fun function. If I get time I'll write one and we can
compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first! :)

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen
 
J

James

Looks like a fun function. If I get time I'll write one and we can
compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first! :)

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen
 
J

James Dow Allen

Looks like a fun function. If I get time I'll write one and we can
compare styles.

(I apologize if we end up with six copies of this. GGroups
doesn't seem to be working.....)

It does look fun. And just the right size for quick style
compares. Let me go first! :)

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen
 
K

Keith Thompson

Gus Gassmann said:
My preference is for symmetry in this case, unless theer is a
performance penalty. It has been suggested that I should set the
function signature to
std::pair <int,int> getMultIncr(int* i, int size)
but this is outside of my comfort zone. (Me no speaka da C so good...)
[...]

std::pair is C++, not C. If you want to discuss it, the folks in
comp.lang.c++ or clc++.moderated will be glad to help.
 
J

James Dow Allen

Looks like a fun function. If I get time I'll write one and we can
compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first! :)

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen
 
D

Dann Corbit

I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

The code looks reasonable. You might get a few percent better with
some tweaks, but just as likely not. It the vector is truly huge, you
might get a boost from dividing the array into N parts and using N
threads to search in parallel with a N core processor. But it's just
as likely the memory system or other bottleneck will prevent a
performance gain. Stitching the results together at the end will be
messy but not too hard.

I am not totally sure what you are doing, but it looks to me like you
are trying to discover partitions in the data.

Here is what I use. Perhaps it can be useful to you somehow. If it
does not do what you want, the idea of what it is doing is simple enough
to modify for your purposes I guess.

GE() is a function that returns true if first item is greater than or
equal to the second item. An etype is a typedef of the data type.

The dimention of ps should be array size divided by two.

typedef struct tag_par {
size_t start;
size_t end;
char ascending;
} partition;

/*
** The purpose of this routine is to discover partitions in the data.
** This is a two state FSM.
** Ascend = 1, Descend = 0.
**
** This is a vastly improved version of my ugly code.
** The large improvement was from Kang Su Gatlin.
**
*/
int parscan(
etype array[],
unsigned long n,
partition ps[],
long max_par
)
{
unsigned long i;
char direction;
long pcount = 0;

ps[pcount].start = 0;
ps[pcount].ascending = GE(array[1], array[0]);
for (i = 1; i < n; i++) {
direction = GE(array, array[i - 1]);
if (ps[pcount].ascending != direction) {
ps[pcount].end = i - 1;
pcount++;
if (pcount > max_par)
return -max_par;
ps[pcount].start = i;
if (i == n - 1)
ps[pcount].ascending = 1;
else
ps[pcount].ascending = GE(array[i + 1], array);
}
}
ps[pcount].end = n - 1;
pcount++;
return pcount;
}
/*
P.S.
This post is better suited to */
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,906
Latest member
SkinfixSkintag

Latest Threads

Top