Suggest improvements

Discussion in 'C Programming' started by Gus Gassmann, Oct 14, 2010.

  1. Gus Gassmann

    Gus Gassmann Guest

    I have a large vector i of integers and I need to identify repeated
    values and runs (of length at least 3) in it. For instance,
    i = 1 2 3 5 8 8 12 6 5 4
    has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    singleton 12, and a run of 3 (with increment -1). (The decomposition
    might not be unique, but that is a secondary concern.)

    My approach has been to write a function to identify the first pattern
    only, and to call this function repeatedly starting right after the
    last identified pattern. In the above example, I would call the
    function successively with
    i[0],i[3],i[4],i[6], and i[7].

    I am concerned about performance and would be grateful for insights
    and suggestions.

    inline void getMultIncr(int* i, int *mult, int *incr, int size)
    {
    // i contains a pointer to the (remainder of the) vector
    // *mult is a pointer to the number of elements in the current pattern
    // *incr is a pointer to the increment in the current pattern
    // size contains the number of elements in the (remainder of the)
    vector

    int mark;
    int k;

    *mult = 1;
    *incr = 0;

    if (size == 1) return;

    //try to find duplicated values
    mark = i[0];
    for (k=1; k < size; k++)
    {
    if (i[k] != mark) break;
    (*mult)++;
    }
    if (*mult > 1 || size == 2) return;

    // now look for possible runs
    *incr = i[1] - i[0];
    if (i[2] - i[1] != *incr) return;

    *mult = 3;
    for (k=3; k < size; k++)
    {
    if (i[k] - i[k-1] != *incr) break;
    (*mult)++;
    }
    return;
    }
    Gus Gassmann, Oct 14, 2010
    #1
    1. Advertising

  2. Gus Gassmann

    Ike Naar Guest

    On 2010-10-14, Gus Gassmann <> wrote:
    > I have a large vector i of integers and I need to identify repeated
    > values and runs (of length at least 3) in it. For instance,
    > i = 1 2 3 5 8 8 12 6 5 4
    > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > singleton 12, and a run of 3 (with increment -1). (The decomposition
    > might not be unique, but that is a secondary concern.)


    A duplicated element is a run with increment 0.
    So you only have to look for runs.
    Ike Naar, Oct 14, 2010
    #2
    1. Advertising

  3. Gus Gassmann

    Gene Guest

    On Oct 14, 7:51 am, Gus Gassmann <>
    wrote:
    > I have a large vector i of integers and I need to identify repeated
    > values and runs (of length at least 3) in it. For instance,
    > i = 1 2 3 5 8 8 12 6 5 4
    > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > singleton 12, and a run of 3 (with increment -1). (The decomposition
    > might not be unique, but that is a secondary concern.)
    >
    > My approach has been to write a function to identify the first pattern
    > only, and to call this function repeatedly starting right after the
    > last identified pattern. In the above example, I would call the
    > function successively with
    > i[0],i[3],i[4],i[6], and i[7].
    >
    > I am concerned about performance and would be grateful for insights
    > and suggestions.


    The code looks reasonable. You might get a few percent better with
    some tweaks, but just as likely not. It the vector is truly huge, you
    might get a boost from dividing the array into N parts and using N
    threads to search in parallel with a N core processor. But it's just
    as likely the memory system or other bottleneck will prevent a
    performance gain. Stitching the results together at the end will be
    messy but not too hard.
    Gene, Oct 14, 2010
    #3
  4. Gus Gassmann <> writes:

    > I have a large vector i of integers and I need to identify repeated
    > values and runs (of length at least 3) in it. For instance,
    > i = 1 2 3 5 8 8 12 6 5 4
    > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > singleton 12, and a run of 3 (with increment -1). (The decomposition
    > might not be unique, but that is a secondary concern.)
    >
    > My approach has been to write a function to identify the first pattern
    > only, and to call this function repeatedly starting right after the
    > last identified pattern.


    Good plan though it may have an impact if your secondary concern about
    there being multiple decompositions ever comes to the fore.

    > In the above example, I would call the
    > function successively with
    > i[0],i[3],i[4],i[6], and i[7].
    >
    > I am concerned about performance and would be grateful for insights
    > and suggestions.


    This is not an expensive function and, at first sight, a fast version
    and a slow version will likely be within a few percent of each other.
    What algorithm is this going to be a part of where such differences
    might be significant?

    > inline void getMultIncr(int* i, int *mult, int *incr, int size)
    > {


    Start with it not being inline, then measure and see if making it inline
    helps. Many modern compilers will be better than you are at deciding if
    there is any value in inlining a function call.

    I don't like to waste the possibility of a return value. I'd return the
    "run length" (*mult in your case). It will probably simplify the code
    a little.

    > // i contains a pointer to the (remainder of the) vector
    > // *mult is a pointer to the number of elements in the current pattern
    > // *incr is a pointer to the increment in the current pattern
    > // size contains the number of elements in the (remainder of the)
    > vector
    >
    > int mark;
    > int k;
    >
    > *mult = 1;
    > *incr = 0;
    >
    > if (size == 1) return;


    I might try to generalise the code to cope with size == 0 as well. It
    often makes calling a function simpler if there are fewer special cases
    to test before the call. With a version that returns the run length,
    all you need is

    if (size <= 1) return size;

    > //try to find duplicated values
    > mark = i[0];
    > for (k=1; k < size; k++)
    > {
    > if (i[k] != mark) break;


    A loop with an initial break is often better written by adding the
    (negated) test into the loop's condition:

    for (k = 1; k < size && i[k] == mark; k++) ...

    > (*mult)++;
    > }
    > if (*mult > 1 || size == 2) return;


    The size == 2 part looks wrong to me. Given i = {1, 2}; I'd say that

    getMultIncr(i, &length, &inc, 2);

    should set length == 2 and inc == 1. Your code will return with length
    == 1 and inc == 0. However, I can see that there is scope for ambiguity
    here -- you might want to call this two singletons though that seems odd
    to me.

    > // now look for possible runs
    > *incr = i[1] - i[0];
    > if (i[2] - i[1] != *incr) return;
    >
    > *mult = 3;
    > for (k=3; k < size; k++)
    > {
    > if (i[k] - i[k-1] != *incr) break;
    > (*mult)++;
    > }
    > return;
    > }


    I see that there is no way to return a length of 2 with an increment, so
    maybe you have decided that 1, 2 is really two singletons.

    You can almost certainly reduce the number of special cases here.
    Duplicate runs can be detected in the same way that you detect
    incrementing sequences.

    Looks like a fun function. If I get time I'll write one and we can
    compare styles.

    --
    Ben.
    Ben Bacarisse, Oct 14, 2010
    #4
  5. Richard Harter wrote:
    > On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
    > <> wrote:
    >
    >> On 2010-10-14, Gus Gassmann <> wrote:
    >>> I have a large vector i of integers and I need to identify repeated
    >>> values and runs (of length at least 3) in it. For instance,
    >>> i = 1 2 3 5 8 8 12 6 5 4
    >>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    >>> singleton 12, and a run of 3 (with increment -1). (The decomposition
    >>> might not be unique, but that is a secondary concern.)

    >> A duplicated element is a run with increment 0.
    >> So you only have to look for runs.

    >
    > It's good to point that out, but I doubt that that was what he
    > was talking about. The issue is that the boundary between runs
    > is ambiguous. Thus, his example can also be partitioned as
    > (1,2) (3,5) (8,8) (12,6) (5,4)


    No, because a "run" should have at least three numbers in it, according
    to the OP.

    I wonder if it would be better to define "runs" and repetitions as
    possibly overlapping at their end-points, so that the sequence

    1 2 3 5 7 9 4 9 9 8 7

    would be decomposed into the run (1 2 3) with increment +1, the run (3 5
    7 9) with increment +2, the singleton 4, the repetition (9 9), and the
    run (9 8 7) with increment -1. It seems to me that this
    endpoint-overlapping decomposition is unique, if the runs and
    repetitions are taken as long as possible, and singletons are used only
    when necessary.

    --
    Niklas Holsti
    Tidorum Ltd
    niklas holsti tidorum fi
    . @ .
    Niklas Holsti, Oct 14, 2010
    #5
  6. Gus Gassmann

    Gus Gassmann Guest

    On Oct 14, 10:27 am, (Richard Harter) wrote:
    > On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
    >
    > <> wrote:
    > >On 2010-10-14, Gus Gassmann <> wrote:
    > >> I have a large vector i of integers and I need to identify repeated
    > >> values and runs (of length at least 3) in it. For instance,
    > >> i = 1 2 3 5 8 8 12 6 5 4
    > >> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > >> singleton 12, and a run of 3 (with increment -1). (The decomposition
    > >> might not be unique, but that is a secondary concern.)

    >
    > >A duplicated element is a run with increment 0.
    > >So you only have to look for runs.

    >
    > It's good to point that out, but I doubt that that was what he
    > was talking about.  The issue is that the boundary between runs
    > is ambiguous.  Thus, his example can also be partitioned as
    > (1,2) (3,5) (8,8) (12,6) (5,4)
    Gus Gassmann, Oct 14, 2010
    #6
  7. Gus Gassmann

    Gus Gassmann Guest

    On Oct 14, 10:30 am, Ben Bacarisse <> wrote:
    > Gus Gassmann <> writes:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it. For instance,
    > > i = 1 2 3 5 8 8 12 6 5 4
    > > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > > singleton 12, and a run of 3 (with increment -1). (The decomposition
    > > might not be unique, but that is a secondary concern.)

    >
    > > My approach has been to write a function to identify the first pattern
    > > only, and to call this function repeatedly starting right after the
    > > last identified pattern.

    >
    > Good plan though it may have an impact if your secondary concern about
    > there being multiple decompositions ever comes to the fore.
    >
    > > In the above example, I would call the
    > > function successively with
    > > i[0],i[3],i[4],i[6], and i[7].

    >
    > > I am concerned about performance and would be grateful for insights
    > > and suggestions.

    >
    > This is not an expensive function and, at first sight, a fast version
    > and a slow version will likely be within a few percent of each other.
    > What algorithm is this going to be a part of where such differences
    > might be significant?


    I am trying to write an xml file containing several sparse vectors. I
    could print out every array element as a singleton, but file size
    matters, and the mult/incr is already part of the schema. But
    processing speed is important, too. I got a bit freaked out when a
    90MB file took over 2 hours to process last night. I don't think of
    that as viable.

    > > inline void getMultIncr(int* i, int *mult, int *incr, int size)
    > > {

    >
    > Start with it not being inline, then measure and see if making it inline
    > helps.  Many modern compilers will be better than you are at deciding if
    > there is any value in inlining a function call.
    >
    > I don't like to waste the possibility of a return value.  I'd return the
    > "run length" (*mult in your case).  It will probably simplify the code
    > a little.


    My preference is for symmetry in this case, unless theer is a
    performance penalty. It has been suggested that I should set the
    function signature to
    std::pair <int,int> getMultIncr(int* i, int size)
    but this is outside of my comfort zone. (Me no speaka da C so good...)

    > > // i contains a pointer to the (remainder of the) vector
    > > // *mult is a pointer to the number of elements in the current pattern
    > > // *incr is a pointer to the increment in the current pattern
    > > // size contains the number of elements in the (remainder of the)
    > > vector

    >
    > >    int mark;
    > >    int k;

    >
    > >    *mult = 1;
    > >    *incr = 0;

    >
    > >    if (size == 1) return;

    >
    > I might try to generalise the code to cope with size == 0 as well.  It
    > often makes calling a function simpler if there are fewer special cases
    > to test before the call.  With a version that returns the run length,
    > all you need is
    >
    >   if (size <= 1) return size;


    That seems like an easy addition, although I didn't think it
    necessary.

    I embed the function call in a loop of the following kind:

    for(int k=0; k<idim;)
    {
    getMultIncr(&i[k],&mult,&incr,idim-k);
    //process the return
    k += mult;
    }

    Still, the performance should not be affected, and a litt;e bit of
    bulletproofing never hurt. I'll take your suggestion :)

    > > //try to find duplicated values
    > >    mark = i[0];
    > >    for (k=1; k < size; k++)
    > >    {
    > >            if (i[k] != mark) break;

    >
    > A loop with an initial break is often better written by adding the
    > (negated) test into the loop's condition:
    >
    >   for (k = 1; k < size && i[k] == mark; k++) ...


    Now we're getting somewhere. This is new to me and looks interesting.
    I'll try that.

    > >            (*mult)++;
    > >    }
    > >    if (*mult > 1 || size == 2) return;

    >
    > The size == 2 part looks wrong to me.  Given i = {1, 2}; I'd say that
    >
    >   getMultIncr(i, &length, &inc, 2);
    >
    > should set length == 2 and inc == 1.  Your code will return with length
    > == 1 and inc == 0.  However, I can see that there is scope for ambiguity
    > here -- you might want to call this two singletons though that seems odd
    > to me.
    >
    > > // now look for possible runs
    > >    *incr = i[1] - i[0];
    > >    if (i[2] - i[1] != *incr) return;

    >
    > >    *mult = 3;
    > >    for (k=3; k < size; k++)
    > >    {
    > >            if (i[k] - i[k-1] != *incr) break;
    > >            (*mult)++;
    > >    }
    > >    return;
    > > }

    >
    > I see that there is no way to return a length of 2 with an increment, so
    > maybe you have decided that 1, 2 is really two singletons.


    Precisely. For the application I have in mind, a run of length 2 (such
    as your 1, 2) is inferior to two singletons. However, I did toy with
    the idea of returning a run of length 2 with nonzero increment and
    intercepting that outside of the function. It has to be treated
    separately, so there is the tradeoff of two calls to the function
    versus the added processing. I am not sure which way that would go.

    > > You can almost certainly reduce the number of special cases here.

    > Duplicate runs can be detected in the same way that you detect
    > incrementing sequences.
    >
    > Looks like a fun function.  If I get time I'll write one and we can
    > compare styles.


    That would be lovely. Thanks.
    Gus Gassmann, Oct 14, 2010
    #7
  8. Gus Gassmann

    Gus Gassmann Guest

    On Oct 14, 11:02 am, Niklas Holsti <>
    wrote:
    > Richard Harter wrote:
    > > On Thu, 14 Oct 2010 12:59:06 +0000 (UTC), Ike Naar
    > > <> wrote:

    >
    > >> On 2010-10-14, Gus Gassmann <> wrote:
    > >>> I have a large vector i of integers and I need to identify repeated
    > >>> values and runs (of length at least 3) in it. For instance,
    > >>> i = 1 2 3 5 8 8 12 6 5 4
    > >>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > >>> singleton 12, and a run of 3 (with increment -1). (The decomposition
    > >>> might not be unique, but that is a secondary concern.)
    > >> A duplicated element is a run with increment 0.
    > >> So you only have to look for runs.

    >
    > > It's good to point that out, but I doubt that that was what he
    > > was talking about.  The issue is that the boundary between runs
    > > is ambiguous.  Thus, his example can also be partitioned as
    > > (1,2) (3,5) (8,8) (12,6) (5,4)

    >
    > No, because a "run" should have at least three numbers in it, according
    > to the OP.
    >
    > I wonder if it would be better to define "runs" and repetitions as
    > possibly overlapping at their end-points, so that the sequence
    >
    >     1 2 3 5 7 9 4 9 9 8 7
    >
    > would be decomposed into the run (1 2 3) with increment +1, the run (3 5
    > 7 9) with increment +2, the singleton 4, the repetition (9 9), and the
    > run (9 8 7) with increment -1. It seems to me that this
    > endpoint-overlapping decomposition is unique, if the runs and
    > repetitions are taken as long as possible, and singletons are used only
    > when necessary.


    The decomposition has to be mutually exclusive. So (1 2 3), (3 5 7 9)
    does not work. It must be either (1 2 3), (5 7 9) or 1, 2, (3 5 7 9).
    In this case, the former would be better, but it is clear that the
    greedy algorithm does not always work. For instance in the sequence

    1 2 3 3 3 5 7 9

    the decomposition

    (1 2 3) (3) (3 5 7 9)

    is preferred to

    (1 2 3) (3 3) (5 7 9)
    Gus Gassmann, Oct 14, 2010
    #8
  9. Gus Gassmann

    BartC Guest

    "Gus Gassmann" <> wrote in message
    news:...
    > On Oct 14, 10:30 am, Ben Bacarisse <> wrote:


    >> This is not an expensive function and, at first sight, a fast version
    >> and a slow version will likely be within a few percent of each other.
    >> What algorithm is this going to be a part of where such differences
    >> might be significant?

    >
    > I am trying to write an xml file containing several sparse vectors. I
    > could print out every array element as a singleton, but file size
    > matters, and the mult/incr is already part of the schema. But
    > processing speed is important, too. I got a bit freaked out when a
    > 90MB file took over 2 hours to process last night. I don't think of
    > that as viable.


    > I embed the function call in a loop of the following kind:
    >
    > for(int k=0; k<idim;)
    > {
    > getMultIncr(&i[k],&mult,&incr,idim-k);
    > //process the return
    > k += mult;
    > }


    You haven't shown any code for the file processing.

    Is the 90MB the input or output size? How long does the above loop take,
    without any processing of the result? How long does it take without
    executing the loop at all (ie. just setting up the data)?

    Compressing a 90MB input file (is that text or binary?) shouldn't take two
    hours, especially with
    a variation of run-length-encoding which you seem to be doing.

    --
    Bartc
    BartC, Oct 14, 2010
    #9
  10. Gus Gassmann wrote:
    > On Oct 14, 11:02 am, Niklas Holsti <>
    > wrote:
    >> On 2010-10-14, Gus Gassmann <> wrote:
    >>> I have a large vector i of integers and I need to identify repeated
    >>> values and runs (of length at least 3) in it. For instance,
    >>> i = 1 2 3 5 8 8 12 6 5 4
    >>> has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    >>> singleton 12, and a run of 3 (with increment -1). (The decomposition
    >>> might not be unique, but that is a secondary concern.)

    ...
    >> I wonder if it would be better to define "runs" and repetitions as
    >> possibly overlapping at their end-points, so that the sequence
    >>
    >> 1 2 3 5 7 9 4 9 9 8 7
    >>
    >> would be decomposed into the run (1 2 3) with increment +1, the run (3 5
    >> 7 9) with increment +2, the singleton 4, the repetition (9 9), and the
    >> run (9 8 7) with increment -1. It seems to me that this
    >> endpoint-overlapping decomposition is unique, if the runs and
    >> repetitions are taken as long as possible, and singletons are used only
    >> when necessary.

    >
    > The decomposition has to be mutually exclusive.


    Yep, I understood this aim from your original post, but I was just
    thinking about how to make the decomposition unique and as informative
    as possible -- I thought you might be aiming at some intelligent
    analysis of the data.

    But I see from another post that your aim is data compression, which
    explains the need for mutual exclusion of the runs and repetitions.
    Apologies if I injected noise in the discussion

    --
    Niklas Holsti
    Tidorum Ltd
    niklas holsti tidorum fi
    . @ .
    Niklas Holsti, Oct 14, 2010
    #10
  11. Gus Gassmann

    Gus Gassmann Guest

    On Oct 14, 11:48 am, "BartC" <> wrote:
    > "Gus Gassmann" <> wrote in message
    >
    > news:...
    >
    >
    >
    >
    >
    > > On Oct 14, 10:30 am, Ben Bacarisse <> wrote:
    > >> This is not an expensive function and, at first sight, a fast version
    > >> and a slow version will likely be within a few percent of each other.
    > >> What algorithm is this going to be a part of where such differences
    > >> might be significant?

    >
    > > I am trying to write an xml file containing several sparse vectors. I
    > > could print out every array element as a singleton, but file size
    > > matters, and the mult/incr is already part of the schema. But
    > > processing speed is important, too. I got a bit freaked out when a
    > > 90MB file took over 2 hours to process last night. I don't think of
    > > that as viable.
    > > I embed the function call in a loop of the following kind:

    >
    > > for(int k=0; k<idim;)
    > > {
    > >    getMultIncr(&i[k],&mult,&incr,idim-k);
    > >    //process the return
    > >    k += mult;
    > > }

    >
    > You haven't shown any code for the file processing.
    >
    > Is the 90MB the input or output size? How long does the above loop take,
    > without any processing of the result? How long does it take without
    > executing the loop at all (ie. just setting up the data)?
    >
    > Compressing a 90MB input file (is that text or binary?) shouldn't take two
    > hours, especially with
    > a variation of run-length-encoding which you seem to be doing.

    --
    Maybe I should explain a bit about that process. The files are xml
    files and contain vectors in the following format:

    <el>1</el>
    <el>2</el>
    <el>3</el>
    ....

    The schema allows for mult and incr attributes to express this run
    more compactly as

    <el mult="3" incr="1">1</el>

    etc. It is thought (!) that the compressed version of the files is
    smaller (true --- we seem to be saving about 40% of the file space),
    faster to transmit (should be true, but I don't know if this is a
    bottleneck), and faster to parse (this remains to be established).
    Obviously the time it takes to compress the file needs to be factored
    in as well, so I want to make sure that this part is as efficient as
    possible.

    Hope this explains things a bit better.
    Gus Gassmann, Oct 14, 2010
    #11
  12. Gus Gassmann

    BartC Guest

    "Gus Gassmann" <> wrote in message
    news:...
    > On Oct 14, 11:48 am, "BartC" <> wrote:
    >> "Gus Gassmann" <> wrote in message


    >> > I am trying to write an xml file containing several sparse vectors. I
    >> > could print out every array element as a singleton, but file size
    >> > matters, and the mult/incr is already part of the schema. But
    >> > processing speed is important, too. I got a bit freaked out when a
    >> > 90MB file took over 2 hours to process last night. I don't think of
    >> > that as viable.
    >> > I embed the function call in a loop of the following kind:

    >>
    >> > for(int k=0; k<idim;)
    >> > {
    >> > getMultIncr(&i[k],&mult,&incr,idim-k);
    >> > //process the return
    >> > k += mult;
    >> > }

    >>
    >> You haven't shown any code for the file processing.
    >>
    >> Is the 90MB the input or output size? How long does the above loop take,
    >> without any processing of the result? How long does it take without
    >> executing the loop at all (ie. just setting up the data)?
    >>
    >> Compressing a 90MB input file (is that text or binary?) shouldn't take
    >> two
    >> hours, especially with
    >> a variation of run-length-encoding which you seem to be doing.

    > --
    > Maybe I should explain a bit about that process. The files are xml
    > files and contain vectors in the following format:
    >
    > <el>1</el>
    > <el>2</el>
    > <el>3</el>


    So the 90MB refers to the input XML in text format? In that case you'd only
    have 7-8 million numbers if they were all single digit, and probably nearer
    5 million.

    Running your code on *250 million* numbers (randomly initialised with data
    with an average run of 5) took between 1 and 2.5 seconds at most on my
    machine.

    So I'm not sure that's your bottleneck.

    Adding all the text processing will slow things down considerably, but it
    should still be no more than a minute or so.

    So I might look elsewhere for the problems.

    > ...
    >
    > The schema allows for mult and incr attributes to express this run
    > more compactly as
    >
    > <el mult="3" incr="1">1</el>
    >
    > etc. It is thought (!) that the compressed version of the files is
    > smaller (true --- we seem to be saving about 40% of the file space),


    Doesn't save much on this 3-run example. Changing "mult" to "m", and "incr"
    to "i", would save quite a bit too..

    --
    Bartc
    BartC, Oct 14, 2010
    #12
  13. Gus Gassmann <> writes:

    > On Oct 14, 10:30 am, Ben Bacarisse <> wrote:
    >> Gus Gassmann <> writes:
    >> > I have a large vector i of integers and I need to identify repeated
    >> > values and runs (of length at least 3) in it. For instance,
    >> > i = 1 2 3 5 8 8 12 6 5 4
    >> > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    >> > singleton 12, and a run of 3 (with increment -1). (The decomposition
    >> > might not be unique, but that is a secondary concern.)

    >>
    >> > My approach has been to write a function to identify the first pattern
    >> > only, and to call this function repeatedly starting right after the
    >> > last identified pattern.

    >>
    >> Good plan though it may have an impact if your secondary concern about
    >> there being multiple decompositions ever comes to the fore.
    >>
    >> > In the above example, I would call the
    >> > function successively with
    >> > i[0],i[3],i[4],i[6], and i[7].

    >>
    >> > I am concerned about performance and would be grateful for insights
    >> > and suggestions.

    >>
    >> This is not an expensive function and, at first sight, a fast version
    >> and a slow version will likely be within a few percent of each other.
    >> What algorithm is this going to be a part of where such differences
    >> might be significant?

    >
    > I am trying to write an xml file containing several sparse vectors. I
    > could print out every array element as a singleton, but file size
    > matters, and the mult/incr is already part of the schema. But
    > processing speed is important, too. I got a bit freaked out when a
    > 90MB file took over 2 hours to process last night. I don't think of
    > that as viable.


    Is there any evidence that this code is responsible for more than a tiny
    fraction of those 2 hours? One should be wary of guessing when
    performance is involved (measure, always measure) but the algorithm is
    linear and simple so I'd image that the IO alone takes way longer than
    the run-length scan.

    Anyway, measure. If your run-length code is only than a few percent of
    the run-time don't try to optimise it. Leave it clear and
    self-evident. Shaving 5% off 5% of the run-time is rarely worth while.

    >> > inline void getMultIncr(int* i, int *mult, int *incr, int size)
    >> > {

    >>
    >> Start with it not being inline, then measure and see if making it inline
    >> helps.  Many modern compilers will be better than you are at deciding if
    >> there is any value in inlining a function call.
    >>
    >> I don't like to waste the possibility of a return value.  I'd return the
    >> "run length" (*mult in your case).  It will probably simplify the code
    >> a little.

    >
    > My preference is for symmetry in this case, unless theer is a
    > performance penalty. It has been suggested that I should set the
    > function signature to
    > std::pair <int,int> getMultIncr(int* i, int size)
    > but this is outside of my comfort zone. (Me no speaka da C so good...)


    That's C++. You probably need to decide if you're using C or C++ since
    C++ has its own preferred way to handle the traversal of sequences.

    <snip>
    >> Looks like a fun function.  If I get time I'll write one and we can
    >> compare styles.

    >
    > That would be lovely. Thanks.


    Here's how I'd do it. First, I'd write a function that does the
    mathematically correct thing and then write a function to take care of
    the fact that you permit duplicates of 2 to be reported but you only
    want increasing or decreasing runs to be three or more in length:

    /* Return the length of the longest prefix of the array 'data'
    * that is in arithmetic progression. The difference between
    * elements (which may be zero) is return in '*diff'.
    *
    * Only the first 'n_items' of 'data' will be considered.
    */

    size_t run_length(int *data, size_t n_items, int *diff)
    {
    *diff = 0;
    if (n_items < 2)
    return n_items;
    *diff = data[1] - data[0];
    size_t rl = 2;
    while (rl < n_items && data[rl] == data[rl - 1] + *diff)
    ++rl;
    return rl;
    }

    /* The required behaviour seems to be asymmetric. Runs that are
    * duplicates can be any length but when diff > 0 only runs of
    * three or more should be considered.
    */

    size_t run_length_gte3(int *data, size_t n_items, int *diff)
    {
    size_t rl = run_length(data, n_items, diff);
    return rl >= 3 || *diff == 0 ? rl : 1;
    }

    --
    Ben.
    Ben Bacarisse, Oct 14, 2010
    #13
  14. On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
    > Gus Gassmann <> writes:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it.


    > Looks like a fun function.  If I get time I'll write one and we can
    > compare styles.


    It does look fun. And just the right size for quick style
    compares. Let me go first! :)

    My emphasis was on simplicity. I *think* there's only one
    case where non-greedy heuristic is needed to minimize
    number of runs; define MINIM to enable its detection.

    #include <stdio.h>

    int Ivalu[100], Isize;

    /*
    * Might be just right for return-by-value of
    * small struct EXCEPT that since caller can deduce
    * increment easily it's SIMPLER to return only run-length.
    *
    * Remember cc -DMINIM to optimize input like
    * 3 3 3 4 5 0
    */
    int getrun(int seq[], int siz)
    {
    int diff, tdiff, j;

    if (siz < 2)
    return 1;
    diff = seq[1] - seq[0];
    for (j = 1; j < siz - 1; j++) {
    tdiff = seq[j+1] - seq[j];
    if (diff != tdiff)
    break;
    }
    if (j == 1)
    return diff ? 1 : 2;
    #ifdef MINIM
    if (diff == 0
    && 1 < j && j < siz - 2
    && tdiff == seq[j+2] - seq[j+1]) {
    j -= 1;
    }
    #endif
    return j + 1;
    }

    int main(int argc, char **argv)
    {
    int i, r;

    for (i = 1; i < argc; i++)
    Ivalu[i - 1] = atoi(argv);
    Isize = i - 1;

    for (i = 0; i < Isize; i += r) {
    r = getrun(Ivalu + i, Isize - i);
    if (r == 1)
    printf(" Singleton %d\n",
    Ivalu);
    else
    printf(" Seq Length = %d Start = %d Increm = %d\n",
    r, Ivalu, Ivalu[i+1] - Ivalu);
    }
    return 0;
    }

    James Dow Allen
    James Dow Allen, Oct 14, 2010
    #14
  15. On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
    > Gus Gassmann <> writes:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it.


    > Looks like a fun function. If I get time I'll write one and we can
    > compare styles.


    It does look fun. And just the right size for quick style
    compares. Let me go first! :)

    My emphasis was on simplicity. I *think* there's only one
    case where non-greedy heuristic is needed to minimize
    number of runs; define MINIM to enable its detection.

    #include <stdio.h>

    int Ivalu[100], Isize;

    /*
    * Might be just right for return-by-value of
    * small struct EXCEPT that since caller can deduce
    * increment easily it's SIMPLER to return only run-length.
    *
    * Remember cc -DMINIM to optimize input like
    * 3 3 3 4 5 0
    */
    int getrun(int seq[], int siz)
    {
    int diff, tdiff, j;

    if (siz < 2)
    return 1;
    diff = seq[1] - seq[0];
    for (j = 1; j < siz - 1; j++) {
    tdiff = seq[j+1] - seq[j];
    if (diff != tdiff)
    break;
    }
    if (j == 1)
    return diff ? 1 : 2;
    #ifdef MINIM
    if (diff == 0
    && 1 < j && j < siz - 2
    && tdiff == seq[j+2] - seq[j+1]) {
    j -= 1;
    }
    #endif
    return j + 1;
    }

    int main(int argc, char **argv)
    {
    int i, r;

    for (i = 1; i < argc; i++)
    Ivalu[i - 1] = atoi(argv);
    Isize = i - 1;

    for (i = 0; i < Isize; i += r) {
    r = getrun(Ivalu + i, Isize - i);
    if (r == 1)
    printf(" Singleton %d\n",
    Ivalu);
    else
    printf(" Seq Length = %d Start = %d Increm = %d\n",
    r, Ivalu, Ivalu[i+1] - Ivalu);
    }
    return 0;
    }

    James Dow Allen
    James Dow Allen, Oct 14, 2010
    #15
  16. Gus Gassmann

    James Guest

    On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
    > Gus Gassmann <> writes:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it.


    > Looks like a fun function. If I get time I'll write one and we can
    > compare styles.


    It does look fun. And just the right size for quick style
    compares. Let me go first! :)

    My emphasis was on simplicity. I *think* there's only one
    case where non-greedy heuristic is needed to minimize
    number of runs; define MINIM to enable its detection.

    #include <stdio.h>

    int Ivalu[100], Isize;

    /*
    * Might be just right for return-by-value of
    * small struct EXCEPT that since caller can deduce
    * increment easily it's SIMPLER to return only run-length.
    *
    * Remember cc -DMINIM to optimize input like
    * 3 3 3 4 5 0
    */
    int getrun(int seq[], int siz)
    {
    int diff, tdiff, j;

    if (siz < 2)
    return 1;
    diff = seq[1] - seq[0];
    for (j = 1; j < siz - 1; j++) {
    tdiff = seq[j+1] - seq[j];
    if (diff != tdiff)
    break;
    }
    if (j == 1)
    return diff ? 1 : 2;
    #ifdef MINIM
    if (diff == 0
    && 1 < j && j < siz - 2
    && tdiff == seq[j+2] - seq[j+1]) {
    j -= 1;
    }
    #endif
    return j + 1;
    }

    int main(int argc, char **argv)
    {
    int i, r;

    for (i = 1; i < argc; i++)
    Ivalu[i - 1] = atoi(argv);
    Isize = i - 1;

    for (i = 0; i < Isize; i += r) {
    r = getrun(Ivalu + i, Isize - i);
    if (r == 1)
    printf(" Singleton %d\n",
    Ivalu);
    else
    printf(" Seq Length = %d Start = %d Increm = %d\n",
    r, Ivalu, Ivalu[i+1] - Ivalu);
    }
    return 0;
    }

    James Dow Allen
    James, Oct 14, 2010
    #16
  17. On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
    > Gus Gassmann <> writes:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it.


    > Looks like a fun function. If I get time I'll write one and we can
    > compare styles.


    (I apologize if we end up with six copies of this. GGroups
    doesn't seem to be working.....)

    It does look fun. And just the right size for quick style
    compares. Let me go first! :)

    My emphasis was on simplicity. I *think* there's only one
    case where non-greedy heuristic is needed to minimize
    number of runs; define MINIM to enable its detection.

    #include <stdio.h>

    int Ivalu[100], Isize;

    /*
    * Might be just right for return-by-value of
    * small struct EXCEPT that since caller can deduce
    * increment easily it's SIMPLER to return only run-length.
    *
    * Remember cc -DMINIM to optimize input like
    * 3 3 3 4 5 0
    */
    int getrun(int seq[], int siz)
    {
    int diff, tdiff, j;

    if (siz < 2)
    return 1;
    diff = seq[1] - seq[0];
    for (j = 1; j < siz - 1; j++) {
    tdiff = seq[j+1] - seq[j];
    if (diff != tdiff)
    break;
    }
    if (j == 1)
    return diff ? 1 : 2;
    #ifdef MINIM
    if (diff == 0
    && 1 < j && j < siz - 2
    && tdiff == seq[j+2] - seq[j+1]) {
    j -= 1;
    }
    #endif
    return j + 1;
    }

    int main(int argc, char **argv)
    {
    int i, r;

    for (i = 1; i < argc; i++)
    Ivalu[i - 1] = atoi(argv);
    Isize = i - 1;

    for (i = 0; i < Isize; i += r) {
    r = getrun(Ivalu + i, Isize - i);
    if (r == 1)
    printf(" Singleton %d\n",
    Ivalu);
    else
    printf(" Seq Length = %d Start = %d Increm = %d\n",
    r, Ivalu, Ivalu[i+1] - Ivalu);
    }
    return 0;
    }

    James Dow Allen
    James Dow Allen, Oct 14, 2010
    #17
  18. Gus Gassmann <> writes:
    [...]
    > My preference is for symmetry in this case, unless theer is a
    > performance penalty. It has been suggested that I should set the
    > function signature to
    > std::pair <int,int> getMultIncr(int* i, int size)
    > but this is outside of my comfort zone. (Me no speaka da C so good...)

    [...]

    std::pair is C++, not C. If you want to discuss it, the folks in
    comp.lang.c++ or clc++.moderated will be glad to help.

    --
    Keith Thompson (The_Other_Keith) <http://www.ghoti.net/~kst>
    Nokia
    "We must do something. This is something. Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"
    Keith Thompson, Oct 14, 2010
    #18
  19. On Oct 14, 8:30 pm, Ben Bacarisse <> wrote:
    > Gus Gassmann <> writes:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it.


    > Looks like a fun function. If I get time I'll write one and we can
    > compare styles.


    It does look fun. And just the right size for quick style
    compares. Let me go first! :)

    My emphasis was on simplicity. I *think* there's only one
    case where non-greedy heuristic is needed to minimize
    number of runs; define MINIM to enable its detection.

    #include <stdio.h>

    int Ivalu[100], Isize;

    /*
    * Might be just right for return-by-value of
    * small struct EXCEPT that since caller can deduce
    * increment easily it's SIMPLER to return only run-length.
    *
    * Remember cc -DMINIM to optimize input like
    * 3 3 3 4 5 0
    */
    int getrun(int seq[], int siz)
    {
    int diff, tdiff, j;

    if (siz < 2)
    return 1;
    diff = seq[1] - seq[0];
    for (j = 1; j < siz - 1; j++) {
    tdiff = seq[j+1] - seq[j];
    if (diff != tdiff)
    break;
    }
    if (j == 1)
    return diff ? 1 : 2;
    #ifdef MINIM
    if (diff == 0
    && 1 < j && j < siz - 2
    && tdiff == seq[j+2] - seq[j+1]) {
    j -= 1;
    }
    #endif
    return j + 1;
    }

    int main(int argc, char **argv)
    {
    int i, r;

    for (i = 1; i < argc; i++)
    Ivalu[i - 1] = atoi(argv);
    Isize = i - 1;

    for (i = 0; i < Isize; i += r) {
    r = getrun(Ivalu + i, Isize - i);
    if (r == 1)
    printf(" Singleton %d\n",
    Ivalu);
    else
    printf(" Seq Length = %d Start = %d Increm = %d\n",
    r, Ivalu, Ivalu[i+1] - Ivalu);
    }
    return 0;
    }

    James Dow Allen
    James Dow Allen, Oct 14, 2010
    #19
  20. Gus Gassmann

    Dann Corbit Guest

    In article <268c1764-7089-4bc0-96ad-e25267d78835
    @y37g2000vbl.googlegroups.com>, says...
    >
    > On Oct 14, 7:51 am, Gus Gassmann <>
    > wrote:
    > > I have a large vector i of integers and I need to identify repeated
    > > values and runs (of length at least 3) in it. For instance,
    > > i = 1 2 3 5 8 8 12 6 5 4
    > > has a run of 3 (with increment 1), a singleton 5, a duplicated 8, a
    > > singleton 12, and a run of 3 (with increment -1). (The decomposition
    > > might not be unique, but that is a secondary concern.)
    > >
    > > My approach has been to write a function to identify the first pattern
    > > only, and to call this function repeatedly starting right after the
    > > last identified pattern. In the above example, I would call the
    > > function successively with
    > > i[0],i[3],i[4],i[6], and i[7].
    > >
    > > I am concerned about performance and would be grateful for insights
    > > and suggestions.

    >
    > The code looks reasonable. You might get a few percent better with
    > some tweaks, but just as likely not. It the vector is truly huge, you
    > might get a boost from dividing the array into N parts and using N
    > threads to search in parallel with a N core processor. But it's just
    > as likely the memory system or other bottleneck will prevent a
    > performance gain. Stitching the results together at the end will be
    > messy but not too hard.


    I am not totally sure what you are doing, but it looks to me like you
    are trying to discover partitions in the data.

    Here is what I use. Perhaps it can be useful to you somehow. If it
    does not do what you want, the idea of what it is doing is simple enough
    to modify for your purposes I guess.

    GE() is a function that returns true if first item is greater than or
    equal to the second item. An etype is a typedef of the data type.

    The dimention of ps should be array size divided by two.

    typedef struct tag_par {
    size_t start;
    size_t end;
    char ascending;
    } partition;

    /*
    ** The purpose of this routine is to discover partitions in the data.
    ** This is a two state FSM.
    ** Ascend = 1, Descend = 0.
    **
    ** This is a vastly improved version of my ugly code.
    ** The large improvement was from Kang Su Gatlin.
    **
    */
    int parscan(
    etype array[],
    unsigned long n,
    partition ps[],
    long max_par
    )
    {
    unsigned long i;
    char direction;
    long pcount = 0;

    ps[pcount].start = 0;
    ps[pcount].ascending = GE(array[1], array[0]);
    for (i = 1; i < n; i++) {
    direction = GE(array, array[i - 1]);
    if (ps[pcount].ascending != direction) {
    ps[pcount].end = i - 1;
    pcount++;
    if (pcount > max_par)
    return -max_par;
    ps[pcount].start = i;
    if (i == n - 1)
    ps[pcount].ascending = 1;
    else
    ps[pcount].ascending = GE(array[i + 1], array);
    }
    }
    ps[pcount].end = n - 1;
    pcount++;
    return pcount;
    }
    /*
    P.S.
    This post is better suited to news:comp.programming.
    */
    Dann Corbit, Oct 14, 2010
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David
    Replies:
    4
    Views:
    406
    David
    Nov 29, 2005
  2. Matthew Scott
    Replies:
    0
    Views:
    297
    Matthew Scott
    May 6, 2005
  3. Mark Carter
    Replies:
    2
    Views:
    151
    Steven D'Aprano
    Jan 18, 2013
  4. Joel Goldstick
    Replies:
    0
    Views:
    95
    Joel Goldstick
    Aug 3, 2013
  5. Terry Reedy
    Replies:
    29
    Views:
    224
    Piet van Oostrum
    Oct 1, 2013
Loading...

Share This Page