Suggest improvements

Discussion in 'C Programming' started by Gus Gassmann, Oct 14, 2010.

1. Gus GassmannGuest

I have a large vector i of integers and I need to identify repeated
values and runs (of length at least 3) in it. For instance,
i = 1 2 3 5 8 8 12 6 5 4
has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
singleton 12, and a run of 3 (with increment -1). (The decomposition
might not be unique, but that is a secondary concern.)

My approach has been to write a function to identify the first pattern
only, and to call this function repeatedly starting right after the
last identified pattern. In the above example, I would call the
function successively with
i[0],i[3],i[4],i[6], and i[7].

I am concerned about performance and would be grateful for insights
and suggestions.

inline void getMultIncr(int* i, int *mult, int *incr, int size)
{
// i contains a pointer to the (remainder of the) vector
// *mult is a pointer to the number of elements in the current pattern
// *incr is a pointer to the increment in the current pattern
// size contains the number of elements in the (remainder of the)
vector

int mark;
int k;

*mult = 1;
*incr = 0;

if (size == 1) return;

//try to find duplicated values
mark = i[0];
for (k=1; k < size; k++)
{
if (i[k] != mark) break;
(*mult)++;
}
if (*mult > 1 || size == 2) return;

// now look for possible runs
*incr = i[1] - i[0];
if (i[2] - i[1] != *incr) return;

*mult = 3;
for (k=3; k < size; k++)
{
if (i[k] - i[k-1] != *incr) break;
(*mult)++;
}
return;
}

Gus Gassmann, Oct 14, 2010

2. Ike NaarGuest

On 2010-10-14, Gus Gassmann <> wrote:
> I have a large vector i of integers and I need to identify repeated
> values and runs (of length at least 3) in it. For instance,
> i = 1 2 3 5 8 8 12 6 5 4
> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> singleton 12, and a run of 3 (with increment -1). (The decomposition
> might not be unique, but that is a secondary concern.)

A duplicated element is a run with increment 0.
So you only have to look for runs.

Ike Naar, Oct 14, 2010

3. GeneGuest

On Oct 14, 7:51 am, Gus Gassmann <>
wrote:
> I have a large vector i of integers and I need to identify repeated
> values and runs (of length at least 3) in it. For instance,
> i = 1 2 3 5 8 8 12 6 5 4
> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> singleton 12, and a run of 3 (with increment -1). (The decomposition
> might not be unique, but that is a secondary concern.)
>
> My approach has been to write a function to identify the first pattern
> only, and to call this function repeatedly starting right after the
> last identified pattern. In the above example, I would call the
> function successively with
> i[0],i[3],i[4],i[6], and i[7].
>
> I am concerned about performance and would be grateful for insights
> and suggestions.

The code looks reasonable. You might get a few percent better with
some tweaks, but just as likely not. It the vector is truly huge, you
might get a boost from dividing the array into N parts and using N
threads to search in parallel with a N core processor. But it's just
as likely the memory system or other bottleneck will prevent a
performance gain. Stitching the results together at the end will be
messy but not too hard.

Gene, Oct 14, 2010
4. Ben BacarisseGuest

Gus Gassmann <> writes:

> I have a large vector i of integers and I need to identify repeated
> values and runs (of length at least 3) in it. For instance,
> i = 1 2 3 5 8 8 12 6 5 4
> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> singleton 12, and a run of 3 (with increment -1). (The decomposition
> might not be unique, but that is a secondary concern.)
>
> My approach has been to write a function to identify the first pattern
> only, and to call this function repeatedly starting right after the
> last identified pattern.

Good plan though it may have an impact if your secondary concern about
there being multiple decompositions ever comes to the fore.

> In the above example, I would call the
> function successively with
> i[0],i[3],i[4],i[6], and i[7].
>
> I am concerned about performance and would be grateful for insights
> and suggestions.

This is not an expensive function and, at first sight, a fast version
and a slow version will likely be within a few percent of each other.
What algorithm is this going to be a part of where such differences
might be significant?

> inline void getMultIncr(int* i, int *mult, int *incr, int size)
> {

Start with it not being inline, then measure and see if making it inline
helps. Many modern compilers will be better than you are at deciding if
there is any value in inlining a function call.

I don't like to waste the possibility of a return value. I'd return the
"run length" (*mult in your case). It will probably simplify the code
a little.

> // i contains a pointer to the (remainder of the) vector
> // *mult is a pointer to the number of elements in the current pattern
> // *incr is a pointer to the increment in the current pattern
> // size contains the number of elements in the (remainder of the)
> vector
>
> int mark;
> int k;
>
> *mult = 1;
> *incr = 0;
>
> if (size == 1) return;

I might try to generalise the code to cope with size == 0 as well. It
often makes calling a function simpler if there are fewer special cases
to test before the call. With a version that returns the run length,
all you need is

if (size <= 1) return size;

> //try to find duplicated values
> mark = i[0];
> for (k=1; k < size; k++)
> {
> if (i[k] != mark) break;

A loop with an initial break is often better written by adding the
(negated) test into the loop's condition:

for (k = 1; k < size && i[k] == mark; k++) ...

> (*mult)++;
> }
> if (*mult > 1 || size == 2) return;

The size == 2 part looks wrong to me. Given i = {1, 2}; I'd say that

getMultIncr(i, &length, &inc, 2);

should set length == 2 and inc == 1. Your code will return with length
== 1 and inc == 0. However, I can see that there is scope for ambiguity
here -- you might want to call this two singletons though that seems odd
to me.

> // now look for possible runs
> *incr = i[1] - i[0];
> if (i[2] - i[1] != *incr) return;
>
> *mult = 3;
> for (k=3; k < size; k++)
> {
> if (i[k] - i[k-1] != *incr) break;
> (*mult)++;
> }
> return;
> }

I see that there is no way to return a length of 2 with an increment, so
maybe you have decided that 1, 2 is really two singletons.

You can almost certainly reduce the number of special cases here.
Duplicate runs can be detected in the same way that you detect
incrementing sequences.

Looks like a fun function. If I get time I'll write one and we can
compare styles.

--
Ben.

Ben Bacarisse, Oct 14, 2010
5. Niklas HolstiGuest

Richard Harter wrote:
> On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
> <> wrote:
>
>> On 2010-10-14, Gus Gassmann <> wrote:
>>> I have a large vector i of integers and I need to identify repeated
>>> values and runs (of length at least 3) in it. For instance,
>>> i = 1 2 3 5 8 8 12 6 5 4
>>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
>>> singleton 12, and a run of 3 (with increment -1). (The decomposition
>>> might not be unique, but that is a secondary concern.)

>> A duplicated element is a run with increment 0.
>> So you only have to look for runs.

>
> It's good to point that out, but I doubt that that was what he
> was talking about. The issue is that the boundary between runs
> is ambiguous. Thus, his example can also be partitioned as
> (1,2) (3,5) (8,8) (12,6) (5,4)

No, because a "run" should have at least three numbers in it, according
to the OP.

I wonder if it would be better to define "runs" and repetitions as
possibly overlapping at their end-points, so that the sequence

1 2 3 5 7 9 4 9 9 8 7

would be decomposed into the run (1 2 3) with increment +1, the run (3 5
7 9) with increment +2, the singleton 4, the repetition (9 9), and the
run (9 8 7) with increment -1. It seems to me that this
endpoint-overlapping decomposition is unique, if the runs and
repetitions are taken as long as possible, and singletons are used only
when necessary.

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

Niklas Holsti, Oct 14, 2010
6. Gus GassmannGuest

On Oct 14, 10:27 am, (Richard Harter) wrote:
> On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
>
> <> wrote:
> >On 2010-10-14, Gus Gassmann <> wrote:
> >> I have a large vector i of integers and I need to identify repeated
> >> values and runs (of length at least 3) in it. For instance,
> >> i = 1 2 3 5 8 8 12 6 5 4
> >> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> >> singleton 12, and a run of 3 (with increment -1). (The decomposition
> >> might not be unique, but that is a secondary concern.)

>
> >A duplicated element is a run with increment 0.
> >So you only have to look for runs.

>
> It's good to point that out, but I doubt that that was what he
> was talking about.  The issue is that the boundary between runs
> is ambiguous.  Thus, his example can also be partitioned as
> (1,2) (3,5) (8,8) (12,6) (5,4)

Gus Gassmann, Oct 14, 2010
7. Gus GassmannGuest

On Oct 14, 10:30 am, Ben Bacarisse <> wrote:
> Gus Gassmann <> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it. For instance,
> > i = 1 2 3 5 8 8 12 6 5 4
> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> > singleton 12, and a run of 3 (with increment -1). (The decomposition
> > might not be unique, but that is a secondary concern.)

>
> > My approach has been to write a function to identify the first pattern
> > only, and to call this function repeatedly starting right after the
> > last identified pattern.

>
> Good plan though it may have an impact if your secondary concern about
> there being multiple decompositions ever comes to the fore.
>
> > In the above example, I would call the
> > function successively with
> > i[0],i[3],i[4],i[6], and i[7].

>
> > I am concerned about performance and would be grateful for insights
> > and suggestions.

>
> This is not an expensive function and, at first sight, a fast version
> and a slow version will likely be within a few percent of each other.
> What algorithm is this going to be a part of where such differences
> might be significant?

I am trying to write an xml file containing several sparse vectors. I
could print out every array element as a singleton, but file size
matters, and the mult/incr is already part of the schema. But
processing speed is important, too. I got a bit freaked out when a
90MB file took over 2 hours to process last night. I don't think of
that as viable.

> > inline void getMultIncr(int* i, int *mult, int *incr, int size)
> > {

>
> Start with it not being inline, then measure and see if making it inline
> helps.  Many modern compilers will be better than you are at deciding if
> there is any value in inlining a function call.
>
> I don't like to waste the possibility of a return value.  I'd return the
> "run length" (*mult in your case).  It will probably simplify the code
> a little.

My preference is for symmetry in this case, unless theer is a
performance penalty. It has been suggested that I should set the
function signature to
std:air <int,int> getMultIncr(int* i, int size)
but this is outside of my comfort zone. (Me no speaka da C so good...)

> > // i contains a pointer to the (remainder of the) vector
> > // *mult is a pointer to the number of elements in the current pattern
> > // *incr is a pointer to the increment in the current pattern
> > // size contains the number of elements in the (remainder of the)
> > vector

>
> >    int mark;
> >    int k;

>
> >    *mult = 1;
> >    *incr = 0;

>
> >    if (size == 1) return;

>
> I might try to generalise the code to cope with size == 0 as well.  It
> often makes calling a function simpler if there are fewer special cases
> to test before the call.  With a version that returns the run length,
> all you need is
>
>   if (size <= 1) return size;

That seems like an easy addition, although I didn't think it
necessary.

I embed the function call in a loop of the following kind:

for(int k=0; k<idim
{
getMultIncr(&i[k],&mult,&incr,idim-k);
//process the return
k += mult;
}

Still, the performance should not be affected, and a litt;e bit of
bulletproofing never hurt. I'll take your suggestion

> > //try to find duplicated values
> >    mark = i[0];
> >    for (k=1; k < size; k++)
> >    {
> >            if (i[k] != mark) break;

>
> A loop with an initial break is often better written by adding the
> (negated) test into the loop's condition:
>
>   for (k = 1; k < size && i[k] == mark; k++) ...

Now we're getting somewhere. This is new to me and looks interesting.
I'll try that.

> >            (*mult)++;
> >    }
> >    if (*mult > 1 || size == 2) return;

>
> The size == 2 part looks wrong to me.  Given i = {1, 2}; I'd say that
>
>   getMultIncr(i, &length, &inc, 2);
>
> should set length == 2 and inc == 1.  Your code will return with length
> == 1 and inc == 0.  However, I can see that there is scope for ambiguity
> here -- you might want to call this two singletons though that seems odd
> to me.
>
> > // now look for possible runs
> >    *incr = i[1] - i[0];
> >    if (i[2] - i[1] != *incr) return;

>
> >    *mult = 3;
> >    for (k=3; k < size; k++)
> >    {
> >            if (i[k] - i[k-1] != *incr) break;
> >            (*mult)++;
> >    }
> >    return;
> > }

>
> I see that there is no way to return a length of 2 with an increment, so
> maybe you have decided that 1, 2 is really two singletons.

Precisely. For the application I have in mind, a run of length 2 (such
as your 1, 2) is inferior to two singletons. However, I did toy with
the idea of returning a run of length 2 with nonzero increment and
intercepting that outside of the function. It has to be treated
separately, so there is the tradeoff of two calls to the function
versus the added processing. I am not sure which way that would go.

> > You can almost certainly reduce the number of special cases here.

> Duplicate runs can be detected in the same way that you detect
> incrementing sequences.
>
> Looks like a fun function.  If I get time I'll write one and we can
> compare styles.

That would be lovely. Thanks.

Gus Gassmann, Oct 14, 2010
8. Gus GassmannGuest

On Oct 14, 11:02 am, Niklas Holsti <>
wrote:
> Richard Harter wrote:
> > On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
> > <> wrote:

>
> >> On 2010-10-14, Gus Gassmann <> wrote:
> >>> I have a large vector i of integers and I need to identify repeated
> >>> values and runs (of length at least 3) in it. For instance,
> >>> i = 1 2 3 5 8 8 12 6 5 4
> >>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> >>> singleton 12, and a run of 3 (with increment -1). (The decomposition
> >>> might not be unique, but that is a secondary concern.)
> >> A duplicated element is a run with increment 0.
> >> So you only have to look for runs.

>
> > It's good to point that out, but I doubt that that was what he
> > was talking about.  The issue is that the boundary between runs
> > is ambiguous.  Thus, his example can also be partitioned as
> > (1,2) (3,5) (8,8) (12,6) (5,4)

>
> No, because a "run" should have at least three numbers in it, according
> to the OP.
>
> I wonder if it would be better to define "runs" and repetitions as
> possibly overlapping at their end-points, so that the sequence
>
>     1 2 3 5 7 9 4 9 9 8 7
>
> would be decomposed into the run (1 2 3) with increment +1, the run (3 5
> 7 9) with increment +2, the singleton 4, the repetition (9 9), and the
> run (9 8 7) with increment -1. It seems to me that this
> endpoint-overlapping decomposition is unique, if the runs and
> repetitions are taken as long as possible, and singletons are used only
> when necessary.

The decomposition has to be mutually exclusive. So (1 2 3), (3 5 7 9)
does not work. It must be either (1 2 3), (5 7 9) or 1, 2, (3 5 7 9).
In this case, the former would be better, but it is clear that the
greedy algorithm does not always work. For instance in the sequence

1 2 3 3 3 5 7 9

the decomposition

(1 2 3) (3) (3 5 7 9)

is preferred to

(1 2 3) (3 3) (5 7 9)

Gus Gassmann, Oct 14, 2010
9. BartCGuest

"Gus Gassmann" <> wrote in message
news:...
> On Oct 14, 10:30 am, Ben Bacarisse <> wrote:

>> This is not an expensive function and, at first sight, a fast version
>> and a slow version will likely be within a few percent of each other.
>> What algorithm is this going to be a part of where such differences
>> might be significant?

>
> I am trying to write an xml file containing several sparse vectors. I
> could print out every array element as a singleton, but file size
> matters, and the mult/incr is already part of the schema. But
> processing speed is important, too. I got a bit freaked out when a
> 90MB file took over 2 hours to process last night. I don't think of
> that as viable.

> I embed the function call in a loop of the following kind:
>
> for(int k=0; k<idim
> {
> getMultIncr(&i[k],&mult,&incr,idim-k);
> //process the return
> k += mult;
> }

You haven't shown any code for the file processing.

Is the 90MB the input or output size? How long does the above loop take,
without any processing of the result? How long does it take without
executing the loop at all (ie. just setting up the data)?

Compressing a 90MB input file (is that text or binary?) shouldn't take two
hours, especially with
a variation of run-length-encoding which you seem to be doing.

--
Bartc

BartC, Oct 14, 2010
10. Niklas HolstiGuest

Gus Gassmann wrote:
> On Oct 14, 11:02 am, Niklas Holsti <>
> wrote:
>> On 2010-10-14, Gus Gassmann <> wrote:
>>> I have a large vector i of integers and I need to identify repeated
>>> values and runs (of length at least 3) in it. For instance,
>>> i = 1 2 3 5 8 8 12 6 5 4
>>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
>>> singleton 12, and a run of 3 (with increment -1). (The decomposition
>>> might not be unique, but that is a secondary concern.)

...
>> I wonder if it would be better to define "runs" and repetitions as
>> possibly overlapping at their end-points, so that the sequence
>>
>> 1 2 3 5 7 9 4 9 9 8 7
>>
>> would be decomposed into the run (1 2 3) with increment +1, the run (3 5
>> 7 9) with increment +2, the singleton 4, the repetition (9 9), and the
>> run (9 8 7) with increment -1. It seems to me that this
>> endpoint-overlapping decomposition is unique, if the runs and
>> repetitions are taken as long as possible, and singletons are used only
>> when necessary.

>
> The decomposition has to be mutually exclusive.

Yep, I understood this aim from your original post, but I was just
thinking about how to make the decomposition unique and as informative
as possible -- I thought you might be aiming at some intelligent
analysis of the data.

But I see from another post that your aim is data compression, which
explains the need for mutual exclusion of the runs and repetitions.
Apologies if I injected noise in the discussion

--
Niklas Holsti
Tidorum Ltd
niklas holsti tidorum fi
. @ .

Niklas Holsti, Oct 14, 2010
11. Gus GassmannGuest

On Oct 14, 11:48 am, "BartC" <> wrote:
> "Gus Gassmann" <> wrote in message
>
> news:...
>
>
>
>
>
> > On Oct 14, 10:30 am, Ben Bacarisse <> wrote:
> >> This is not an expensive function and, at first sight, a fast version
> >> and a slow version will likely be within a few percent of each other.
> >> What algorithm is this going to be a part of where such differences
> >> might be significant?

>
> > I am trying to write an xml file containing several sparse vectors. I
> > could print out every array element as a singleton, but file size
> > matters, and the mult/incr is already part of the schema. But
> > processing speed is important, too. I got a bit freaked out when a
> > 90MB file took over 2 hours to process last night. I don't think of
> > that as viable.
> > I embed the function call in a loop of the following kind:

>
> > for(int k=0; k<idim
> > {
> >    getMultIncr(&i[k],&mult,&incr,idim-k);
> >    //process the return
> >    k += mult;
> > }

>
> You haven't shown any code for the file processing.
>
> Is the 90MB the input or output size? How long does the above loop take,
> without any processing of the result? How long does it take without
> executing the loop at all (ie. just setting up the data)?
>
> Compressing a 90MB input file (is that text or binary?) shouldn't take two
> hours, especially with
> a variation of run-length-encoding which you seem to be doing.

--
Maybe I should explain a bit about that process. The files are xml
files and contain vectors in the following format:

<el>1</el>
<el>2</el>
<el>3</el>
....

The schema allows for mult and incr attributes to express this run
more compactly as

<el mult="3" incr="1">1</el>

etc. It is thought (!) that the compressed version of the files is
smaller (true --- we seem to be saving about 40% of the file space),
faster to transmit (should be true, but I don't know if this is a
bottleneck), and faster to parse (this remains to be established).
Obviously the time it takes to compress the file needs to be factored
in as well, so I want to make sure that this part is as efficient as
possible.

Hope this explains things a bit better.

Gus Gassmann, Oct 14, 2010
12. BartCGuest

"Gus Gassmann" <> wrote in message
news:...
> On Oct 14, 11:48 am, "BartC" <> wrote:
>> "Gus Gassmann" <> wrote in message

>> > I am trying to write an xml file containing several sparse vectors. I
>> > could print out every array element as a singleton, but file size
>> > matters, and the mult/incr is already part of the schema. But
>> > processing speed is important, too. I got a bit freaked out when a
>> > 90MB file took over 2 hours to process last night. I don't think of
>> > that as viable.
>> > I embed the function call in a loop of the following kind:

>>
>> > for(int k=0; k<idim
>> > {
>> > getMultIncr(&i[k],&mult,&incr,idim-k);
>> > //process the return
>> > k += mult;
>> > }

>>
>> You haven't shown any code for the file processing.
>>
>> Is the 90MB the input or output size? How long does the above loop take,
>> without any processing of the result? How long does it take without
>> executing the loop at all (ie. just setting up the data)?
>>
>> Compressing a 90MB input file (is that text or binary?) shouldn't take
>> two
>> hours, especially with
>> a variation of run-length-encoding which you seem to be doing.

> --
> Maybe I should explain a bit about that process. The files are xml
> files and contain vectors in the following format:
>
> <el>1</el>
> <el>2</el>
> <el>3</el>

So the 90MB refers to the input XML in text format? In that case you'd only
have 7-8 million numbers if they were all single digit, and probably nearer
5 million.

Running your code on *250 million* numbers (randomly initialised with data
with an average run of 5) took between 1 and 2.5 seconds at most on my
machine.

So I'm not sure that's your bottleneck.

Adding all the text processing will slow things down considerably, but it
should still be no more than a minute or so.

So I might look elsewhere for the problems.

> ...
>
> The schema allows for mult and incr attributes to express this run
> more compactly as
>
> <el mult="3" incr="1">1</el>
>
> etc. It is thought (!) that the compressed version of the files is
> smaller (true --- we seem to be saving about 40% of the file space),

Doesn't save much on this 3-run example. Changing "mult" to "m", and "incr"
to "i", would save quite a bit too..

--
Bartc

BartC, Oct 14, 2010
13. Ben BacarisseGuest

Gus Gassmann <> writes:

> On Oct 14, 10:30Â am, Ben Bacarisse <> wrote:
>> Gus Gassmann <> writes:
>> > I have a large vector i of integers and I need to identify repeated
>> > values and runs (of length at least 3) in it. For instance,
>> > i = 1 2 3 5 8 8 12 6 5 4
>> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
>> > singleton 12, and a run of 3 (with increment -1). (The decomposition
>> > might not be unique, but that is a secondary concern.)

>>
>> > My approach has been to write a function to identify the first pattern
>> > only, and to call this function repeatedly starting right after the
>> > last identified pattern.

>>
>> Good plan though it may have an impact if your secondary concern about
>> there being multiple decompositions ever comes to the fore.
>>
>> > In the above example, I would call the
>> > function successively with
>> > i[0],i[3],i[4],i[6], and i[7].

>>
>> > I am concerned about performance and would be grateful for insights
>> > and suggestions.

>>
>> This is not an expensive function and, at first sight, a fast version
>> and a slow version will likely be within a few percent of each other.
>> What algorithm is this going to be a part of where such differences
>> might be significant?

>
> I am trying to write an xml file containing several sparse vectors. I
> could print out every array element as a singleton, but file size
> matters, and the mult/incr is already part of the schema. But
> processing speed is important, too. I got a bit freaked out when a
> 90MB file took over 2 hours to process last night. I don't think of
> that as viable.

Is there any evidence that this code is responsible for more than a tiny
fraction of those 2 hours? One should be wary of guessing when
performance is involved (measure, always measure) but the algorithm is
linear and simple so I'd image that the IO alone takes way longer than
the run-length scan.

Anyway, measure. If your run-length code is only than a few percent of
the run-time don't try to optimise it. Leave it clear and
self-evident. Shaving 5% off 5% of the run-time is rarely worth while.

>> > inline void getMultIncr(int* i, int *mult, int *incr, int size)
>> > {

>>
>> Start with it not being inline, then measure and see if making it inline
>> helps. Â Many modern compilers will be better than you are at deciding if
>> there is any value in inlining a function call.
>>
>> I don't like to waste the possibility of a return value. Â I'd return the
>> "run length" (*mult in your case). Â It will probably simplify the code
>> a little.

>
> My preference is for symmetry in this case, unless theer is a
> performance penalty. It has been suggested that I should set the
> function signature to
> std:air <int,int> getMultIncr(int* i, int size)
> but this is outside of my comfort zone. (Me no speaka da C so good...)

That's C++. You probably need to decide if you're using C or C++ since
C++ has its own preferred way to handle the traversal of sequences.

<snip>
>> Looks like a fun function. Â If I get time I'll write one and we can
>> compare styles.

>
> That would be lovely. Thanks.

Here's how I'd do it. First, I'd write a function that does the
mathematically correct thing and then write a function to take care of
the fact that you permit duplicates of 2 to be reported but you only
want increasing or decreasing runs to be three or more in length:

/* Return the length of the longest prefix of the array 'data'
* that is in arithmetic progression. The difference between
* elements (which may be zero) is return in '*diff'.
*
* Only the first 'n_items' of 'data' will be considered.
*/

size_t run_length(int *data, size_t n_items, int *diff)
{
*diff = 0;
if (n_items < 2)
return n_items;
*diff = data[1] - data[0];
size_t rl = 2;
while (rl < n_items && data[rl] == data[rl - 1] + *diff)
++rl;
return rl;
}

/* The required behaviour seems to be asymmetric. Runs that are
* duplicates can be any length but when diff > 0 only runs of
* three or more should be considered.
*/

size_t run_length_gte3(int *data, size_t n_items, int *diff)
{
size_t rl = run_length(data, n_items, diff);
return rl >= 3 || *diff == 0 ? rl : 1;
}

--
Ben.

Ben Bacarisse, Oct 14, 2010
14. James Dow AllenGuest

On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
> Gus Gassmann <> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.

> Looks like a fun function.  If I get time I'll write one and we can
> compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen

James Dow Allen, Oct 14, 2010
15. James Dow AllenGuest

On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
> Gus Gassmann <> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.

> Looks like a fun function. If I get time I'll write one and we can
> compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen

James Dow Allen, Oct 14, 2010
16. JamesGuest

On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
> Gus Gassmann <> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.

> Looks like a fun function. If I get time I'll write one and we can
> compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen

James, Oct 14, 2010
17. James Dow AllenGuest

On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
> Gus Gassmann <> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.

> Looks like a fun function. If I get time I'll write one and we can
> compare styles.

(I apologize if we end up with six copies of this. GGroups
doesn't seem to be working.....)

It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen

James Dow Allen, Oct 14, 2010
18. Keith ThompsonGuest

Gus Gassmann <> writes:
[...]
> My preference is for symmetry in this case, unless theer is a
> performance penalty. It has been suggested that I should set the
> function signature to
> std:air <int,int> getMultIncr(int* i, int size)
> but this is outside of my comfort zone. (Me no speaka da C so good...)

[...]

std:air is C++, not C. If you want to discuss it, the folks in
comp.lang.c++ or clc++.moderated will be glad to help.

--
Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
Nokia
"We must do something. This is something. Therefore, we must do this."
-- Antony Jay and Jonathan Lynn, "Yes Minister"

Keith Thompson, Oct 14, 2010
19. James Dow AllenGuest

On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
> Gus Gassmann <> writes:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it.

> Looks like a fun function. If I get time I'll write one and we can
> compare styles.

It does look fun. And just the right size for quick style
compares. Let me go first!

My emphasis was on simplicity. I *think* there's only one
case where non-greedy heuristic is needed to minimize
number of runs; define MINIM to enable its detection.

#include <stdio.h>

int Ivalu[100], Isize;

/*
* Might be just right for return-by-value of
* small struct EXCEPT that since caller can deduce
* increment easily it's SIMPLER to return only run-length.
*
* Remember cc -DMINIM to optimize input like
* 3 3 3 4 5 0
*/
int getrun(int seq[], int siz)
{
int diff, tdiff, j;

if (siz < 2)
return 1;
diff = seq[1] - seq[0];
for (j = 1; j < siz - 1; j++) {
tdiff = seq[j+1] - seq[j];
if (diff != tdiff)
break;
}
if (j == 1)
return diff ? 1 : 2;
#ifdef MINIM
if (diff == 0
&& 1 < j && j < siz - 2
&& tdiff == seq[j+2] - seq[j+1]) {
j -= 1;
}
#endif
return j + 1;
}

int main(int argc, char **argv)
{
int i, r;

for (i = 1; i < argc; i++)
Ivalu[i - 1] = atoi(argv);
Isize = i - 1;

for (i = 0; i < Isize; i += r) {
r = getrun(Ivalu + i, Isize - i);
if (r == 1)
printf(" Singleton %d\n",
Ivalu);
else
printf(" Seq Length = %d Start = %d Increm = %d\n",
r, Ivalu, Ivalu[i+1] - Ivalu);
}
return 0;
}

James Dow Allen

James Dow Allen, Oct 14, 2010
20. Dann CorbitGuest

>
> On Oct 14, 7:51 am, Gus Gassmann <>
> wrote:
> > I have a large vector i of integers and I need to identify repeated
> > values and runs (of length at least 3) in it. For instance,
> > i = 1 2 3 5 8 8 12 6 5 4
> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
> > singleton 12, and a run of 3 (with increment -1). (The decomposition
> > might not be unique, but that is a secondary concern.)
> >
> > My approach has been to write a function to identify the first pattern
> > only, and to call this function repeatedly starting right after the
> > last identified pattern. In the above example, I would call the
> > function successively with
> > i[0],i[3],i[4],i[6], and i[7].
> >
> > I am concerned about performance and would be grateful for insights
> > and suggestions.

>
> The code looks reasonable. You might get a few percent better with
> some tweaks, but just as likely not. It the vector is truly huge, you
> might get a boost from dividing the array into N parts and using N
> threads to search in parallel with a N core processor. But it's just
> as likely the memory system or other bottleneck will prevent a
> performance gain. Stitching the results together at the end will be
> messy but not too hard.

I am not totally sure what you are doing, but it looks to me like you
are trying to discover partitions in the data.

Here is what I use. Perhaps it can be useful to you somehow. If it
does not do what you want, the idea of what it is doing is simple enough
to modify for your purposes I guess.

GE() is a function that returns true if first item is greater than or
equal to the second item. An etype is a typedef of the data type.

The dimention of ps should be array size divided by two.

typedef struct tag_par {
size_t start;
size_t end;
char ascending;
} partition;

/*
** The purpose of this routine is to discover partitions in the data.
** This is a two state FSM.
** Ascend = 1, Descend = 0.
**
** This is a vastly improved version of my ugly code.
** The large improvement was from Kang Su Gatlin.
**
*/
int parscan(
etype array[],
unsigned long n,
partition ps[],
long max_par
)
{
unsigned long i;
char direction;
long pcount = 0;

ps[pcount].start = 0;
ps[pcount].ascending = GE(array[1], array[0]);
for (i = 1; i < n; i++) {
direction = GE(array, array[i - 1]);
if (ps[pcount].ascending != direction) {
ps[pcount].end = i - 1;
pcount++;
if (pcount > max_par)
return -max_par;
ps[pcount].start = i;
if (i == n - 1)
ps[pcount].ascending = 1;
else
ps[pcount].ascending = GE(array[i + 1], array);
}
}
ps[pcount].end = n - 1;
pcount++;
return pcount;
}
/*
P.S.
This post is better suited to news:comp.programming.
*/

Dann Corbit, Oct 14, 2010