how to replace a substring in a string using C?

Discussion in 'C Programming' started by Paul, Oct 30, 2005.

  1. Paul

    Paul Guest

    hi, there,

    for example,

    char *mystr="##this is##a examp#le";

    I want to replace all the "##" in mystr with "****". How can I do this?

    I checked all the string functions in C, but did not find one.

    thanks.
     
    Paul, Oct 30, 2005
    #1
    1. Advertising

  2. Paul

    Guest

    I think the best choice is to write your own function. It is easy task.
     
    , Oct 30, 2005
    #2
    1. Advertising

  3. Paul

    Mike Wahler Guest

    "Paul" <> wrote in message
    news:...
    > hi, there,
    >
    > for example,
    >
    > char *mystr="##this is##a examp#le";
    >
    > I want to replace all the "##" in mystr with "****". How can I do this?


    With the above construct, you can't. It's not
    allowed to modify a string literal (more specifically,
    an attempt to do so produces 'undefined behavior'.

    However, if you change the code to store your string
    in an array, it can be modified.

    >
    > I checked all the string functions in C, but did not find one.
    >
    > thanks.


    #include <stdio.h>
    #include <string.h>

    char *replace(char *s, char old, char new)
    {
    char *p = s;

    while(*p)
    {
    if(*p == old)
    *p = new;

    ++p;
    }

    return s;
    }

    int main()
    {
    char mystr[] = "##this is##a examp#le";
    puts(mystr);
    puts(replace(mystr, '#', '*'));
    return 0;
    }

    -Mike
     
    Mike Wahler, Oct 30, 2005
    #3
  4. Paul

    Netocrat Guest

    On Sun, 30 Oct 2005 19:45:30 +0000, Mike Wahler wrote:
    > "Paul" <> wrote in message
    > news:...
    >> hi, there,
    >>
    >> for example,
    >>
    >> char *mystr="##this is##a examp#le";
    >>
    >> I want to replace all the "##" in mystr with "****". How can I do this?


    Your code looked fine Mike but I gathered the OP wanted to deal more
    generally with string rather than character replacement, so here's an
    extended version.

    Output:

    ##this is##a examp#le
    ****this is****a examp#le

    Code:

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>

    char *replace(const char *s, const char *old, const char *new)
    {
    char *ret;
    int i, count = 0;
    size_t newlen = strlen(new);
    size_t oldlen = strlen(old);

    for (i = 0; s != '\0'; i++) {
    if (strstr(&s, old) == &s) {
    count++;
    i += oldlen - 1;
    }
    }

    ret = malloc(i + count * (newlen - oldlen));
    if (ret == NULL)
    exit(EXIT_FAILURE);

    i = 0;
    while (*s) {
    if (strstr(s, old) == s) {
    strcpy(&ret, new);
    i += newlen;
    s += oldlen;
    } else
    ret[i++] = *s++;
    }
    ret = '\0';

    return ret;
    }

    int main(void)
    {
    char mystr[] = "##this is##a examp#le";
    char *newstr = NULL;

    puts(mystr);
    newstr = replace(mystr, "##", "****");
    printf("%s\n", newstr);

    free(newstr);
    return 0;
    }

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Oct 30, 2005
    #4
  5. Paul

    Skarmander Guest

    Netocrat wrote:
    <snip>
    > char *replace(const char *s, const char *old, const char *new)
    > {

    <snip>
    > ret = malloc(i + count * (newlen - oldlen));
    > if (ret == NULL)
    > exit(EXIT_FAILURE);
    >


    Whoa, what's this? Just return NULL. It's massively rude to terminate
    the program in a function like this. It's *likely* the caller can't
    respond to out-of-memory either (and won't check for it if they're
    sloppy), but making the decision here is silly.

    S.
     
    Skarmander, Oct 30, 2005
    #5
  6. Paul

    Netocrat Guest

    On Sun, 30 Oct 2005 20:40:58 +0000, Netocrat wrote:
    [...]
    > ret = malloc(i + count * (newlen - oldlen));


    Off-by-one; should be:
    ret = malloc(i + 1 + count * (newlen - oldlen));

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Oct 30, 2005
    #6
  7. Paul

    Netocrat Guest

    On Sun, 30 Oct 2005 21:46:46 +0100, Skarmander wrote:

    > Netocrat wrote:
    > <snip>
    >> char *replace(const char *s, const char *old, const char *new)
    >> {

    > <snip>
    >> ret = malloc(i + count * (newlen - oldlen));
    >> if (ret == NULL)
    >> exit(EXIT_FAILURE);

    >
    > Whoa, what's this? Just return NULL. It's massively rude to terminate
    > the program in a function like this. It's *likely* the caller can't
    > respond to out-of-memory either (and won't check for it if they're
    > sloppy), but making the decision here is silly.


    In a generic library function I agree 100%. When it's part of a specific
    simple application, a policy to abort on memory failure isn't
    unreasonable. The possibility did occur to me, and the reason I didn't
    take it up is that it would have added extra lines to the (demonstration)
    code. OTOH, this is intended to be a portable function, so your objection
    is sound.

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Oct 30, 2005
    #7
  8. Paul

    Skarmander Guest

    Netocrat wrote:
    > On Sun, 30 Oct 2005 21:46:46 +0100, Skarmander wrote:
    >
    >
    >>Netocrat wrote:
    >><snip>
    >>
    >>>char *replace(const char *s, const char *old, const char *new)
    >>>{

    >>
    >><snip>
    >>
    >>> ret = malloc(i + count * (newlen - oldlen));
    >>> if (ret == NULL)
    >>> exit(EXIT_FAILURE);

    >>
    >>Whoa, what's this? Just return NULL. It's massively rude to terminate
    >>the program in a function like this. It's *likely* the caller can't
    >>respond to out-of-memory either (and won't check for it if they're
    >>sloppy), but making the decision here is silly.

    >
    >
    > In a generic library function I agree 100%. When it's part of a specific
    > simple application, a policy to abort on memory failure isn't
    > unreasonable. The possibility did occur to me, and the reason I didn't
    > take it up is that it would have added extra lines to the (demonstration)
    > code. OTOH, this is intended to be a portable function, so your objection
    > is sound.
    >


    The portability (that is, portability of a function across applications)
    wasn't actually my primary concern; my primary concern was separation of
    concerns. :) Side effects of a function should be minimized to those
    that are documented. A function like `replace' has no business
    terminating the program, unless it's clearly stated to do just that if
    it can't allocate memory.

    A "policy to abort on memory failure" is reasonable only if it really is
    a policy, not something the function happens to do because it's
    acceptable in the (implicit, unstated) context. You shouldn't have to be
    required to know this. Doing this in `main' or a major subroutine
    without drawing attention to it is fine, but not a string replacement
    function, no matter what program or library it occurs in.

    I'm not pretending I'm telling you something new, but especially since
    this is demonstration code, I'm spelling it out for the innocent.
    Encouraging good habits does pay off even if things become a little more
    complicated to understand. Better that than writing code that's easy,
    but contains a trap. We should at least point out the trap, then. As
    we've just done. :)

    S.
     
    Skarmander, Oct 30, 2005
    #8
  9. Paul

    Guest

    Paul wrote:
    > char *mystr="##this is##a examp#le";
    >
    > I want to replace all the "##" in mystr with "****". How can I do this?
    >
    > I checked all the string functions in C, but did not find one.


    That is because the C library doesn't have one. (BTW, The Better
    String Library, http://bstring.sf.net/ does contain a function
    "bfindreplace()" that does this.) You are also pointing mystr to a
    constant string, which you cannot modify at all. So you would either
    have to change the storage or allow the creation of new storage for
    your desired result.

    Ok, there is a big difference in general between the three cases of
    expanding, shrinking, or having the same sized before and after
    strings. The first may require additional storage, while the other two
    never do. In this case you are expanding, so you have to solve the
    problem of requiring additional storage for your result.

    Also you have to define what you mean by "replace all". For example,
    if you want to replace "##" with "***#", then what happens when you
    apply it to "foo###bar"? Does it become "foo***##bar" or does it
    become "foo******#bar"?

    --
    Paul Hsieh
    http://www.pobox.com/~qed/
    http://bstring.sf.net/
     
    , Oct 30, 2005
    #9
  10. Paul

    Jordan Abel Guest

    On 2005-10-30, Netocrat <> wrote:
    > On Sun, 30 Oct 2005 21:46:46 +0100, Skarmander wrote:
    >
    >> Netocrat wrote:
    >> <snip>
    >>> char *replace(const char *s, const char *old, const char *new)
    >>> {

    >> <snip>
    >>> ret = malloc(i + count * (newlen - oldlen));
    >>> if (ret == NULL)
    >>> exit(EXIT_FAILURE);

    >>
    >> Whoa, what's this? Just return NULL. It's massively rude to
    >> terminate the program in a function like this. It's *likely* the
    >> caller can't respond to out-of-memory either (and won't check for
    >> it if they're sloppy), but making the decision here is silly.

    >
    > In a generic library function I agree 100%. When it's part of a
    > specific simple application, a policy to abort on memory failure
    > isn't unreasonable.


    Then I'd do it in a malloc wrapper so that it'd be done everywhere.

    > The possibility did occur to me, and the reason I didn't take it
    > up is that it would have added extra lines to the (demonstration)
    > code. OTOH, this is intended to be a portable function, so your
    > objection is sound.
     
    Jordan Abel, Oct 31, 2005
    #10
  11. On Sun, 30 Oct 2005 20:40:58 GMT, Netocrat <>
    wrote:
    <snip>
    > char *replace(const char *s, const char *old, const char *new)
    > {
    > char *ret;
    > int i, count = 0;
    > size_t newlen = strlen(new);
    > size_t oldlen = strlen(old);
    >
    > for (i = 0; s != '\0'; i++) {
    > if (strstr(&s, old) == &s) {
    > count++;
    > i += oldlen - 1;
    > }
    > }


    This is wasting a good deal of work in strstr() finding (or excluding)
    matches that you throw away. Either something like:
    for( i = 0; s != '\0'; )
    if( strncmp (&s, old, oldlen) == 0 ) ++count, i += oldlen;
    else ++i;
    /* or can probably use memcmp since in practice it will safely
    stop on or near null != nonnull and return nonmatch, but not
    AFAICT guaranteed, and being not str* may alarm some */
    or something like:
    const char * p = s;
    while( (p = strstr (p, old)) != NULL )
    ++count, p += oldlen;

    >
    > ret = malloc(i + count * (newlen - oldlen));


    As already noted, + 1. For my strstr form, strlen(s).

    > if (ret == NULL)
    > exit(EXIT_FAILURE);
    >
    > i = 0;
    > while (*s) {
    > if (strstr(s, old) == s) {


    Similarly to above.

    > strcpy(&ret, new);


    Could be memcpy ( , , newlen); you don't need a null terminator
    each time through the loop as you do one below after the loop.

    > i += newlen;
    > s += oldlen;
    > } else
    > ret[i++] = *s++;
    > }
    > ret = '\0';


    <snip>
    - David.Thompson1 at worldnet.att.net
     
    Dave Thompson, Nov 7, 2005
    #11
  12. Paul

    Netocrat Guest

    On Mon, 07 Nov 2005 04:21:19 +0000, Dave Thompson wrote:
    > On Sun, 30 Oct 2005 20:40:58 GMT, Netocrat <>
    > wrote:
    > <snip>
    >> char *replace(const char *s, const char *old, const char *new)
    >> {
    >> char *ret;
    >> int i, count = 0;
    >> size_t newlen = strlen(new);
    >> size_t oldlen = strlen(old);
    >>
    >> for (i = 0; s != '\0'; i++) {
    >> if (strstr(&s, old) == &s) {
    >> count++;
    >> i += oldlen - 1;
    >> }
    >> }

    >
    > This is wasting a good deal of work in strstr() finding (or excluding)
    > matches that you throw away. Either something like:
    > for( i = 0; s != '\0'; )
    > if( strncmp (&s, old, oldlen) == 0 ) ++count, i += oldlen;
    > else ++i;


    Yes that's the behaviour I intended; the loop evolved from something
    like your second suggestion below (requiring a call to strlen) and I
    neglected to replace strstr with a more optimal function.

    Moving the loop increment into the else statement makes a lot of sense by
    avoiding the subtraction of one when strstr (/strncmp/memcmp) succeeds.

    The comma operator is helpful here to remove braces, although I prefer the
    statements on a new indented line as it requires less eye horizontal eye
    movement, more clearly separates the condition from the subsequent
    statement and thus to me is more readable.

    > /* or can probably use memcmp since in practice it will safely
    > stop on or near null != nonnull and return nonmatch, but not
    > AFAICT guaranteed, and being not str* may alarm some */


    You seem to be correct - "7.21.4.1 The memcmp function" doesn't require it
    to stop as long as it ultimately returns the correct result - but it's in
    the implementation's interests to make it as efficient as possible.

    It looks more suitable to me since strncmp requires an additional test for
    '\0' on every character comparison that in theory reduces its efficiency.

    > or something like:
    > const char * p = s;
    > while( (p = strstr (p, old)) != NULL )
    > ++count, p += oldlen;
    >
    >>
    >> ret = malloc(i + count * (newlen - oldlen));

    >
    > As already noted, + 1. For my strstr form, strlen(s).

    ^^^^^^^^^

    For that reason I've preferred the strncmp (actually memcmp) version. The
    whole point of the looping construct I used was to avoid calling strlen -
    it seems to be a waste to iterate over the string twice. This could be
    partly avoided by doing something like:

    const char *last = s;
    const char *sorg = s;
    size_t orglen;

    s = strstr(s, old);
    while (s != NULL) {
    count++;
    s += oldlen;
    last = s;
    s = strstr(s, old);
    }
    orglen = last - &sorg[0] + strlen(last);
    ret = malloc(orglen + 1 + count * (newlen - oldlen));

    but this is still iterating redundantly over the string to determine
    strlen(last).

    Two arguments in favour of this approach are that firstly it's possible
    that the library functions will be so much faster than explicit iteration
    that despite the redundancy this approach wins.

    Secondly if the final replacement is near the end of the string and
    the library functions are - even slightly - faster, then it's also likely
    to be a faster approach.

    One would have to benchmark each implementation for typical strings and
    replacements to see to what extent these hold.

    >> if (ret == NULL)
    >> exit(EXIT_FAILURE);
    >>
    >> i = 0;
    >> while (*s) {
    >> if (strstr(s, old) == s) {

    >
    > Similarly to above.


    Yes, again strncmp/memcmp are clearly preferable; and again modifying the
    loop to use strstr less wastefully would introduce redundant iteration
    over the string whilst copying, for which arguments similar to the above
    apply.

    >> strcpy(&ret, new);

    >
    > Could be memcpy ( , , newlen); you don't need a null terminator
    > each time through the loop as you do one below after the loop.


    Good call.

    >> i += newlen;
    >> s += oldlen;
    >> } else
    >> ret[i++] = *s++;
    >> }
    >> ret = '\0';

    >
    > <snip>


    Insightful comments as always Dave.

    Something else that no one has pointed out yet is that the type of i and
    count should be size_t rather than int.

    One other minor optimisation I've made to occasionally save a few function
    calls is to check whether newlen and oldlen differ and if not, simply
    calculate the string's length without looking for matches.

    The other minor optimisation is to use pointer arithmetic instead
    of indexing in the second loop. Whether this is any more optimal in
    practice depends on the compiler/hardware but it's more optimal on the
    generic abstract machine I tend to envisage.

    This doesn't apply to the first loop since "i" must exist regardless -
    by 6.5.6#9 if the result of subtracting two pointers does not fit in a
    ptrdiff_t type the behaviour is undefined.

    Here's a fixed-up version incorporating all fixes suggested in the thread
    so far:

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>

    /* Preconditions:
    * s, old and new are valid non-NULL pointers
    * strlen(old) >= 1
    */
    char *replace(const char *s, const char *old, const char *new)
    {
    char *ret, *sr;
    size_t i, count = 0;
    size_t newlen = strlen(new);
    size_t oldlen = strlen(old);

    if (newlen != oldlen) {
    for (i = 0; s != '\0'; ) {
    if (memcmp(&s, old, oldlen) == 0)
    count++, i += oldlen;
    else
    i++;
    }
    } else
    i = strlen(s);

    ret = malloc(i + 1 + count * (newlen - oldlen));
    if (ret == NULL)
    return NULL;

    sr = ret;
    while (*s) {
    if (memcmp(s, old, oldlen) == 0) {
    memcpy(sr, new, newlen);
    sr += newlen;
    s += oldlen;
    } else
    *sr++ = *s++;
    }
    *sr = '\0';

    return ret;
    }

    int main(void)
    {
    char mystr[] = "##this is##an examp#le";
    char *newstr = NULL;

    puts(mystr);

    newstr = replace(mystr, "##", "****");
    if (newstr == NULL)
    puts("replace returned NULL");
    else
    puts(newstr);

    free(newstr);
    return 0;
    }

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Nov 7, 2005
    #12
  13. Paul

    Skarmander Guest

    Netocrat wrote:
    <snip>
    > The other minor optimisation is to use pointer arithmetic instead
    > of indexing in the second loop. Whether this is any more optimal in
    > practice depends on the compiler/hardware but it's more optimal on the
    > generic abstract machine I tend to envisage.
    >


    <snip>
    > sr = ret;
    > while (*s) {
    > if (memcmp(s, old, oldlen) == 0) {
    > memcpy(sr, new, newlen);
    > sr += newlen;
    > s += oldlen;
    > } else
    > *sr++ = *s++;
    > }
    > *sr = '\0';
    >

    The way you've written it, it actually matters far more how memcmp() is
    implemented and how fast repeated single-byte copying is.

    Alternate version that does not operate on individual bytes and which
    may or may not be faster (but likely is):

    const char* t;

    sr = ret;
    while ((t = strstr(s, old)) != NULL) {
    /* Copy non-matching prefix */
    ptrdiff_t l = t - s;
    memcpy(sr, s, l);
    sr += l;
    s += l;

    /* Copy replacement text */
    memcpy(sr, new, newlen);
    sr += newlen;
    s += oldlen;
    }
    /* Copy non-matching tail */
    strcpy(sr, s);

    AFAICT the standard doesn't actually guarantee a (non-negative)
    ptrdiff_t will fit in a size_t, but this does seem to be a fairly sane
    assumption in this case. Even so, on implementations where ptrdiff_t is
    very small and your strings are very large (or pointer arithmetic is
    just plain inefficient), this may not work. You could always write a
    custom strstr() that returns the offset as a size_t.

    S.
     
    Skarmander, Nov 7, 2005
    #13
  14. Paul

    Netocrat Guest

    On Mon, 07 Nov 2005 18:34:03 +0100, Skarmander wrote:
    > Netocrat wrote:
    > <snip>
    >> The other minor optimisation is to use pointer arithmetic instead
    >> of indexing in the second loop. Whether this is any more optimal in
    >> practice depends on the compiler/hardware but it's more optimal on the
    >> generic abstract machine I tend to envisage.


    Aside from the fact that indexing is not possible using the alternative
    you provide, this quote seems to be unrelated to the rest of your post,
    which is really about library functions vs manual iteration, not indexing
    vs pointer arithmetic.

    > <snip>
    >> sr = ret;
    >> while (*s) {
    >> if (memcmp(s, old, oldlen) == 0) {
    >> memcpy(sr, new, newlen);
    >> sr += newlen;
    >> s += oldlen;
    >> } else
    >> *sr++ = *s++;
    >> }
    >> *sr = '\0';
    >>

    > The way you've written it, it actually matters far more how memcmp() is
    > implemented and how fast repeated single-byte copying is.
    >
    > Alternate version that does not operate on individual bytes and which
    > may or may not be faster (but likely is):


    That's the sort of code I had in mind when I wrote:
    [restored snippage]
    >> Yes, again strncmp/memcmp are clearly preferable; and again modifying
    >> the loop to use strstr less wastefully would introduce redundant
    >> iteration over the string whilst copying, for which arguments similar
    >> to the above apply.


    IOW one reason I decided against this implementation is that it iterates
    over the string redundantly - once for strstr to find a match and once for
    memcpy to copy the searched-over part of the string. The relevant
    argument that I was referencing is:
    [restored snippage]
    >> it's possible that the library functions will be so much faster than
    >> explicit iteration that despite the redundancy this approach wins.


    So we agree - it may or may not be faster. I'll grant that for large
    strings it may well be faster, because the implementation can likely
    make use of implementation-specific hardware to copy large blocks of
    memory. However (and this is the second reason that I decided against
    this approach) for large strings there is the problem that...

    > const char* t;
    >
    > sr = ret;
    > while ((t = strstr(s, old)) != NULL) {
    > /* Copy non-matching prefix */
    > ptrdiff_t l = t - s;
    > memcpy(sr, s, l);
    > sr += l;
    > s += l;
    >
    > /* Copy replacement text */
    > memcpy(sr, new, newlen);
    > sr += newlen;
    > s += oldlen;
    > }
    > /* Copy non-matching tail */
    > strcpy(sr, s);
    >
    > AFAICT the standard doesn't actually guarantee a (non-negative)
    > ptrdiff_t will fit in a size_t, but this does seem to be a fairly sane


    ....the standard doesn't even guarantee that taking the difference of two
    pointers is defined. Restoring more snippage:
    >> by 6.5.6#9 if the result of subtracting two pointers does not fit in a
    >> ptrdiff_t type the behaviour is undefined.


    > assumption in this case. Even so, on implementations where ptrdiff_t is
    > very small and your strings are very large (or pointer arithmetic is
    > just plain inefficient), this may not work. You could always write a
    > custom strstr() that returns the offset as a size_t.


    Coming full-circle, such a custom strstr would require the use of indexing...

    However it wouldn't have the advantage of being a library function and in
    that case I _won't_ grant much likelihood that this approach will be
    faster - the redundancy would not be offset by library function speed.

    Another possibility were one convinced that one's library string functions
    were genuinely fast enough to warrant redundant iteration over the string
    would be to pass in a non-const qualified string, and split it into
    segments of size PTRDIFF_MAX, saving the character overwritten by the '\0'
    terminator and restoring it after processing the segment.

    This also requires some special-case handling for the possibility of a
    match running across a segment. Something like this, although as a
    single function, replace is starting to get too long for my liking - even
    though I've moved the initial buffer creation code into a separate
    function:

    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    #include <stddef.h>
    #include <stdint.h>
    #include <assert.h>

    /* Preconditions:
    * s and old are valid non-NULL pointers
    * strlen(old) == oldlen
    */
    char *getreplbuff(const char *s, const char *old, size_t oldlen,
    size_t newlen, size_t *len)
    {
    size_t i, count = 0;

    if (newlen != oldlen) {
    for (i = 0; s != '\0'; ) {
    if (memcmp(&s, old, oldlen) == 0)
    count++, i += oldlen;
    else
    i++;
    }
    *len = i;
    } else
    *len = strlen(s);

    return malloc(*len + 1 + count * (newlen - oldlen));
    }

    /* Preconditions:
    * s, old and new are valid non-NULL pointers
    * strlen(old) >= 1
    * strlen(old) <= PTRDIFF_MAX
    */
    char *replace(char *s, const char *old, const char *new)
    {
    char *ret, *sr, *segstart = s;
    const char *t, *last;
    size_t len, newlen = strlen(new);
    size_t trailchars, oldlen = strlen(old);

    assert(strlen(old) <= PTRDIFF_MAX); /* inf loop is possible if this fails */

    if ((ret = getreplbuff(s, old, oldlen, newlen, &len)) == NULL)
    return NULL;

    sr = ret;
    for (;;) {
    char restore;

    if (len > PTRDIFF_MAX) {
    /* Oversize string - create a segment */
    restore = s[PTRDIFF_MAX];
    s[PTRDIFF_MAX] = '\0';
    last = NULL;
    }

    while ((t = strstr(s, old)) != NULL) {
    last = t;

    /* Copy non-matching prefix */
    ptrdiff_t l = t - s;
    memcpy(sr, s, l);
    sr += l;
    s += l;

    /* Copy replacement text */
    memcpy(sr, new, newlen);
    sr += newlen;
    s += oldlen;
    }
    /* Copy non-matching tail */
    strcpy(sr, s);

    /* Quit when there are no more segments */
    if (len <= PTRDIFF_MAX)
    break;

    /* Increment dest string pointer by length of non-matching tail */
    sr += segstart + PTRDIFF_MAX - s;

    /* Restore original char at segment termination */
    segstart[PTRDIFF_MAX] = restore;

    /* Calculate the minimum number of char posns that we need to
    * decrement the start of the next segment by to ensure that
    * we don't miss matches */
    if (last == NULL || ((trailchars =
    (&segstart[PTRDIFF_MAX] - last) - oldlen) >= oldlen))
    trailchars = oldlen - 1;


    /* Set the start of the next segment */
    s = &segstart[PTRDIFF_MAX] - trailchars;
    segstart = s;

    /* Decrement the dest str pointer by the same amount as the segment */
    sr -= trailchars;

    /* Update the number of chars unchecked in the original string */
    len -= PTRDIFF_MAX - trailchars;
    }

    return ret;
    }

    int main(void)
    {
    char mystr[] = "##this is##an examp#le";
    char *newstr = NULL;

    puts(mystr);

    newstr = replace(mystr, "#", "****");
    if (newstr == NULL)
    puts("replace returned NULL");
    else
    puts(newstr);

    free(newstr);
    return 0;
    }

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Nov 7, 2005
    #14
  15. Paul

    Skarmander Guest

    Netocrat wrote:
    > On Mon, 07 Nov 2005 18:34:03 +0100, Skarmander wrote:
    >
    >>Netocrat wrote:
    >><snip>
    >>
    >>>The other minor optimisation is to use pointer arithmetic instead
    >>>of indexing in the second loop. Whether this is any more optimal in
    >>>practice depends on the compiler/hardware but it's more optimal on the
    >>>generic abstract machine I tend to envisage.

    >
    >
    > Aside from the fact that indexing is not possible using the alternative
    > you provide, this quote seems to be unrelated to the rest of your post,
    > which is really about library functions vs manual iteration, not indexing
    > vs pointer arithmetic.
    >

    Yes. I probably highlighted the wrong bit. Not to worry, right bits
    coming up.

    >><snip>
    >>
    >>> sr = ret;
    >>> while (*s) {
    >>> if (memcmp(s, old, oldlen) == 0) {
    >>> memcpy(sr, new, newlen);
    >>> sr += newlen;
    >>> s += oldlen;
    >>> } else
    >>> *sr++ = *s++;
    >>> }
    >>> *sr = '\0';
    >>>

    >>
    >>The way you've written it, it actually matters far more how memcmp() is
    >>implemented and how fast repeated single-byte copying is.
    >>
    >>Alternate version that does not operate on individual bytes and which
    >>may or may not be faster (but likely is):

    >
    >
    > That's the sort of code I had in mind when I wrote:
    > [restored snippage]
    >
    >>>Yes, again strncmp/memcmp are clearly preferable; and again modifying
    >>>the loop to use strstr less wastefully would introduce redundant
    >>>iteration over the string whilst copying, for which arguments similar
    >>>to the above apply.

    >
    >
    > IOW one reason I decided against this implementation is that it iterates
    > over the string redundantly - once for strstr to find a match and once for
    > memcpy to copy the searched-over part of the string.


    This conveniently ignores the fact that you issue a call to memcmp() for
    every byte you're iterating over. Of course this call (or macro, or
    inline function, or intrinsic, whatever) will on average terminate
    immediately after comparing the first character, but the overhead for
    this is not factored in.

    The iteration of my version is no more redundant than the fact that you
    have do a memcmp() on the first byte, then a copy if it doesn't match
    (that's two reads, not necessarily optimizable into one, depending on
    how memcmp() works). Arguments for why two iterations of low-level
    functions may have *less* overhead than doing two things in one
    high-level iteration coming up.

    > The relevant argument that I was referencing is: [restored snippage]
    >
    >>>it's possible that the library functions will be so much faster than
    >>>explicit iteration that despite the redundancy this approach wins.

    >
    >
    > So we agree - it may or may not be faster. I'll grant that for large
    > strings it may well be faster, because the implementation can likely
    > make use of implementation-specific hardware to copy large blocks of
    > memory.


    For small strings it may well be faster too, because the cache then
    becomes your friend. Copying blocks is a *very* efficient procedure on
    most implementations. Compare-and-copy-if-not-equal-repeat, as expressed
    in C, typically is not.

    The interesting aspect of this is that I didn't see the redundancy at
    all, because I don't think in terms of how strstr() and memcpy() are
    implemented. Technically, yes, bytes will have to be read twice, once
    for comparing, once for copying. This doesn't inspire me to write a copy
    loop in C, however. I'd go straight to assembler if I thought that was
    an issue.

    > However (and this is the second reason that I decided against
    > this approach) for large strings there is the problem that...
    >
    >
    >> const char* t;
    >>
    >> sr = ret;
    >> while ((t = strstr(s, old)) != NULL) {
    >> /* Copy non-matching prefix */
    >> ptrdiff_t l = t - s;
    >> memcpy(sr, s, l);
    >> sr += l;
    >> s += l;
    >>
    >> /* Copy replacement text */
    >> memcpy(sr, new, newlen);
    >> sr += newlen;
    >> s += oldlen;
    >> }
    >> /* Copy non-matching tail */
    >> strcpy(sr, s);
    >>
    >>AFAICT the standard doesn't actually guarantee a (non-negative)
    >>ptrdiff_t will fit in a size_t, but this does seem to be a fairly sane

    >
    >
    > ...the standard doesn't even guarantee that taking the difference of two
    > pointers is defined. Restoring more snippage:
    >
    >>>by 6.5.6#9 if the result of subtracting two pointers does not fit in a
    >>>ptrdiff_t type the behaviour is undefined.

    >

    Hold on, I did read it, but let's not exaggerate this. Of course the
    standard doesn't guarantee it, but there will be some meaningful range
    of values for which it holds. Otherwise there would be absolutely no
    point to pointer arithmetic ever, no?

    Please note that ptrdiff_t has to be capable of holding values of at
    least 65535. Yes, conceivably, you could run into trouble if matches are
    over 64K bytes away. But I for one would not switch to manually copying
    bytes because of this potential restriction, unless I were made aware of
    an *actual* restriction.

    The restriction in this case is having a ptrdiff_t which is much smaller
    than your size_t, and strings with matches that are more than ptrdiff_t
    bytes apart. There will undoubtedly be applications on DeathStation 9000
    models that have just this (and maybe some 8086 memory models,
    actually), and for those my code might not work -- granted. So the 100%
    portability prize won't be going to me this year, too bad. :)

    >>assumption in this case. Even so, on implementations where ptrdiff_t is
    >>very small and your strings are very large (or pointer arithmetic is
    >>just plain inefficient), this may not work. You could always write a
    >>custom strstr() that returns the offset as a size_t.

    >
    >
    > Coming full-circle, such a custom strstr would require the use of indexing...
    >
    > However it wouldn't have the advantage of being a library function and in
    > that case I _won't_ grant much likelihood that this approach will be
    > faster - the redundancy would not be offset by library function speed.
    >

    You're still underestimating memcpy(). Though I agree you'd get a big
    hit on a custom strstr(). While writing such a strstr() in assembly
    would be easy, this defeats the point of portable code.

    It's actually unfriendly of the standard library not to give us mem*()
    functions that can operate on pointer segments, or str*() functions that
    return indices. There seems to be no good reason for this impedance
    mismatch, other than the historical "oh, everything ought to fit in an
    int, don't worry" assumption.

    > Another possibility were one convinced that one's library string functions
    > were genuinely fast enough to warrant redundant iteration over the string
    > would be to pass in a non-const qualified string, and split it into
    > segments of size PTRDIFF_MAX, saving the character overwritten by the '\0'
    > terminator and restoring it after processing the segment.
    >

    <snip code>

    OK, now, I agree, this goes too far. This reaches the point of "not
    worth it". *If* things are such that ptrdiff_t isn't large enough, a
    custom strstr() should be used. If that slows things down too much, the
    byte-copying variant should be used. This code is too complex, and
    needlessly constrained on platforms that have no ptrdiff_t problem.

    S.
     
    Skarmander, Nov 7, 2005
    #15
  16. Paul

    Netocrat Guest

    On Tue, 08 Nov 2005 00:32:27 +0100, Skarmander wrote:
    > Netocrat wrote:
    >> On Mon, 07 Nov 2005 18:34:03 +0100, Skarmander wrote:
    >>>Netocrat wrote:

    [...]
    >> [O]ne reason I decided against [an implementation of a string
    >> replacement loop using strstr and strcpy] is that it iterates over the
    >> string redundantly - once for strstr to find a match and once for
    >> memcpy to copy the searched-over part of the string.

    >
    > This conveniently ignores the fact that you issue a call to memcmp() for
    > every byte you're iterating over.


    Yes, if you look at it like that, the first version has redundant
    operations too.

    [...]
    >> So we agree - it may or may not be faster. I'll grant that for large
    >> strings it may well be faster, because the implementation can likely
    >> make use of implementation-specific hardware to copy large blocks of
    >> memory.

    >
    > For small strings it may well be faster too, because the cache then
    > becomes your friend. Copying blocks is a *very* efficient procedure on
    > most implementations. Compare-and-copy-if-not-equal-repeat, as expressed
    > in C, typically is not.


    If I get inspired I'll do some benchmarking and post results.

    [...]
    >>>>by 6.5.6#9 if the result of subtracting two pointers does not fit in a
    >>>>ptrdiff_t type the behaviour is undefined.

    >>

    > Hold on, I did read it, but let's not exaggerate this. Of course the
    > standard doesn't guarantee it, but there will be some meaningful range
    > of values for which it holds. Otherwise there would be absolutely no
    > point to pointer arithmetic ever, no?


    Sure, it just means you can't rely on it for objects of arbitrary size
    unless you do a check that PTRDIFF_MAX >= SIZE_MAX (I know I'm not telling
    you anything new, just clarifying it). It's possible for static objects
    to be larger than SIZE_MAX but a program containing such objects is not
    strictly conforming (it exceeds an environmental limit).

    [...]
    > It's actually unfriendly of the standard library not to give us mem*()
    > functions that can operate on pointer segments,


    I'm not sure what you mean by a pointer segment.

    > or str*() functions that return indices.


    Sometimes such as for this function that would be useful, yes.

    [a string replacement function variant that splits the non-const qualified
    source string into segments of PTRDIFF_MAX]
    > OK, now, I agree, this goes too far. This reaches the point of "not
    > worth it". *If* things are such that ptrdiff_t isn't large enough, a
    > custom strstr() should be used.


    Unless the constraint against const qualified strings and string literals
    were a problem, I'd prefer the segmented function to a custom strstr
    since, as you accept, that one is subject to potential performance
    reduction that the segmented one isn't.

    > If that slows things down too much, the byte-copying variant should be
    > used. This code is too complex, and needlessly constrained on platforms
    > that have no ptrdiff_t problem.


    It is complex although the final result is readable (by me at least) and
    understandable by others through commenting (I hope).

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Nov 8, 2005
    #16
  17. Netocrat wrote:
    > On Sun, 30 Oct 2005 19:45:30 +0000, Mike Wahler wrote:
    > > "Paul" <> wrote in message
    > > news:...
    > >> hi, there,
    > >>
    > >> for example,
    > >>
    > >> char *mystr="##this is##a examp#le";
    > >>
    > >> I want to replace all the "##" in mystr with "****". How can I do this?

    >
    > Your code looked fine Mike but I gathered the OP wanted to deal more
    > generally with string rather than character replacement, so here's an
    > extended version.


    There are a couple of problems here. First, this code will malfunction
    on many 16 bit platforms if you pass it strings with length between 32K
    and 64K (ie INT_MAX < length < SIZE_MAX). Your signed 'i' variable
    will overflow (undefined behavior), and most likely wrap to -32768,
    which would cause the subscripting to access memory it should not be
    meddling with.

    Second, this code performs poorly even though it may look at a glance
    to be somewhat optimized. The first red flag is the extensive use of
    subscripting. The second red flag is the unusual modification of a
    loop variable ('i') within a loop, even though it is also modified by
    the loop ('i++'). The third is the unusual use of an equality operator
    (==) with strstr(), and the fourth is the character-by-character
    parsing (which can be quite slow).

    >
    > Output:
    >
    > ##this is##a examp#le
    > ****this is****a examp#le
    >
    > Code:
    >
    > #include <stdio.h>
    > #include <string.h>
    > #include <stdlib.h>
    >
    > char *replace(const char *s, const char *old, const char *new)
    > {
    > char *ret;
    > int i, count = 0;
    > size_t newlen = strlen(new);
    > size_t oldlen = strlen(old);
    >
    > for (i = 0; s != '\0'; i++) {
    > if (strstr(&s, old) == &s) {
    > count++;
    > i += oldlen - 1;
    > }
    > }


    Note that strstr() can be called for nearly every character.

    >
    > ret = malloc(i + count * (newlen - oldlen));
    > if (ret == NULL)
    > exit(EXIT_FAILURE);


    As other posters have noted, just return NULL to the caller.

    >
    > i = 0;
    > while (*s) {
    > if (strstr(s, old) == s) {
    > strcpy(&ret, new);
    > i += newlen;
    > s += oldlen;
    > } else
    > ret[i++] = *s++;
    > }
    > ret = '\0';


    Again, strstr() can be called for nearly every character.

    >
    > return ret;
    > }
    >

    <snip>

    To give an idea of the impact of string parsing missteps can make on
    program performance, I fed "Much Ado About Nothing" (the official play
    of comp.lang.c) through your replace function, then through mine. The
    entire file (121K) was loaded into a buffer, then fed 10 times to the
    replace function as such:

    /* ... */
    fread(str, 1, size, fp);
    str[size] = '\0';
    for(i = 0; i < 10; i++) {
    newstr = replace(str, "DON PEDRO", "NETOCRAT");
    free(newstr);
    }
    /* ... */

    The somewhat naieve implementation I whipped up quickly is:

    char *replace(const char *str, const char *old, const char *new)
    {
    char *ret, *r;
    const char *p, *q;
    size_t len_str = strlen(str);
    size_t len_old = strlen(old);
    size_t len_new = strlen(new);
    size_t count;

    for(count = 0, p = str; (p = strstr(p, old)); p += len_old)
    count++;

    ret = malloc(count * (len_new - len_old) + len_str + 1);
    if(!ret)
    return NULL;

    for(r = ret, p = str; (q = strstr(p, old)); p = q + len_old) {
    count = q - p;
    memcpy(r, p, count);
    r += count;
    strcpy(r, new);
    r += len_new;
    }
    strcpy(r, p);
    return ret;
    }

    Here are the timings on my machine. Both were compiled with the same
    flags on gcc 4 (-Wall -O2 -ansi -pedantic)

    [mark@icepick ~]$ time ./netocrat

    real 0m39.901s
    user 0m39.877s
    sys 0m0.008s

    [mark@icepick ~]$ time ./mark

    real 0m0.033s
    user 0m0.026s
    sys 0m0.007s

    Even though the somewhat naieve program traverses the input string
    multiple times, it's still a 99.9% execution time decrease.


    Mark F. Haigh
     
    Mark F. Haigh, Nov 8, 2005
    #17
  18. Paul

    Netocrat Guest

    On Tue, 08 Nov 2005 05:20:04 -0800, Mark F. Haigh wrote:
    > Netocrat wrote:

    [...]
    > There are a couple of problems here. First, [i should be size_t not int].


    Well spotted - no one else caught that one (I only picked it up after Dave
    Thompson's critique).

    > Second, this code performs poorly even though it may look at a glance to
    > be somewhat optimized.


    Yes; I've since posted code that's an improvement - it doesn't use
    subscripting in the replacement loop and it uses memcmp rather than
    strstr - but I realise that both you and Skarmander are correct that
    character-by-character operations should be avoided. As Skarmander points
    out my attempt to avoid redundant iteration did not in fact do that since
    memcmp is called on every character.

    [...]
    > To give an idea of the impact of string parsing missteps can make on
    > program performance, I fed "Much Ado About Nothing" (the official play
    > of comp.lang.c) through your replace function, then through mine. The
    > entire file (121K) was loaded into a buffer, then fed 10 times to the
    > replace function as such:
    >
    > /* ... */
    > fread(str, 1, size, fp);
    > str[size] = '\0';
    > for(i = 0; i < 10; i++) {
    > newstr = replace(str, "DON PEDRO", "NETOCRAT"); free(newstr);
    > }
    > /* ... */


    I've written a benchmark program, which I've made available here:
    http://members.dodo.com.au/~netocrat/c/source/replacebench.c

    It incorporates my fixed-up function, your function below and Skarmander's
    suggested snippet (which has an identical algorithm to yours although the
    expression is slightly different, and it uses memcpy where you've used
    strcpy). It also incorporates an optimised version which avoids calling
    strlen on the entire original string where possible - I posted a similar
    but less readable snippet in response to Dave T's post and as you and S
    have pointed out, it's a better approach.

    Inspired by the appropriateness of the play, I downloaded a copy. Here
    are some typical results (using the same compilation options as you except
    with -std=c99 and gcc version 3.3.6):

    Filename : /data/doc-dl/literature/Much_Ado_About_Nothing.txt
    Iterations : 1000
    Old : Leonato
    New : Haigh
    Function : replace_net
    Pre-length : 139007
    Post-length: 138883
    Duration : 17.280000 seconds

    Filename : /data/doc-dl/literature/Much_Ado_About_Nothing.txt
    Iterations : 1000
    Old : Leonato
    New : Haigh
    Function : replace_haigh
    Pre-length : 139007
    Post-length: 138883
    Duration : 1.250000 seconds

    Filename : /data/doc-dl/literature/Much_Ado_About_Nothing.txt
    Iterations : 1000
    Old : Leonato
    New : Haigh
    Function : replace_opt
    Pre-length : 139007
    Post-length: 138883
    Duration : 1.160000 seconds

    The fixed-up version of my original function is still about 15 times
    slower, but it's a lot better than it was.

    The optimisation of not calling strlen on the entire string cuts about 7%
    off the running time.

    > The somewhat naieve implementation I whipped up quickly is:


    I like the clarity with which you've used looping expressions.

    > char *replace(const char *str, const char *old, const char *new) {
    > char *ret, *r;
    > const char *p, *q;
    > size_t len_str = strlen(str);

    ^^^^^^^^^^^^^^

    This can be avoided (I know you've already qualified this as a somewhat
    naive implementation).

    > size_t len_old = strlen(old);
    > size_t len_new = strlen(new);
    > size_t count;
    >


    If len_old == len_new, there's no need to run this loop and you can simply
    malloc len_str + 1 bytes.

    > for(count = 0, p = str; (p = strstr(p, old)); p += len_old)
    > count++;
    >
    > ret = malloc(count * (len_new - len_old) + len_str + 1);
    > if(!ret)
    > return NULL;
    >
    > for(r = ret, p = str; (q = strstr(p, old)); p = q + len_old) {
    > count = q - p;


    As discussed elsethread, under C99 this invokes undefined behaviour when q
    - p > PTRDIFF_MAX. I vaguely recall discussions on the semantics under
    C89, but can't recall the conclusion. Also under C99 count would be
    better declared ptrdiff_t but it appears that you're writing to C89.

    > memcpy(r, p, count);
    > r += count;
    > strcpy(r, new);


    As Dave T pointed out memcpy is more appropriate here; but as compiled
    under gcc on my platform strcpy's very slightly faster than using memcpy -
    go figure.

    > r += len_new;
    > }
    > strcpy(r, p);
    > return ret;
    > }
    >
    > Here are the timings on my machine. Both were compiled with the same
    > flags on gcc 4 (-Wall -O2 -ansi -pedantic)

    [results omitted]
    > Even though the somewhat naieve program traverses the input string
    > multiple times, it's still a 99.9% execution time decrease.


    Here's the optimised C99 version of the function used for the benchmark
    results above. The #error seems to me to be the best way of dealing with
    the possibility of UB - it alerts the programmer to it at compile time and
    encourages him/her to examine the comments in the code whilst removing it.

    #include <string.h>
    #include <stdlib.h>
    #include <stddef.h>
    #include <stdint.h>

    # if PTRDIFF_MAX < SIZE_MAX
    /* If the following line causes compilation to fail,
    * comment it out after taking note of its message.
    */
    # error The replace function might invoke undefined \
    behaviour for strings with length greater than \
    PTRDIFF_MAX - see comments in the function body
    # endif

    char *replace(const char *str, const char *old, const char *new)
    {
    char *ret, *r;
    const char *p, *q;
    size_t oldlen = strlen(old);
    size_t count, retlen, newlen = strlen(new);

    if (oldlen != newlen) {
    for (count = 0, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen)
    count++;
    /* this is undefined if p - str > PTRDIFF_MAX */
    retlen = p - str + strlen(p) + count * (newlen - oldlen);
    } else
    retlen = strlen(str);

    ret = malloc(retlen + 1);

    for (r = ret, p = str; (q = strstr(p, old)) != NULL; p = q + oldlen) {
    /* this is undefined if q - p > PTRDIFF_MAX */
    ptrdiff_t l = q - p;
    memcpy(r, p, l);
    r += l;
    memcpy(r, new, newlen);
    r += newlen;
    }
    strcpy(r, p);

    return ret;
    }

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Nov 9, 2005
    #18
  19. Paul

    pete Guest

    Netocrat wrote:

    > > size_t count;


    > > count = q - p;

    >
    > As discussed elsethread,
    > under C99 this invokes undefined behaviour when q - p > PTRDIFF_MAX.
    > I vaguely recall discussions on the semantics under
    > C89, but can't recall the conclusion. Also under C99 count would be
    > better declared ptrdiff_t but it appears that you're writing to C89.


    Whether or not to declare count ptrdiff_t,
    is not really an 89 vs 99 issue.
    The type of the expression, (q - p), is ptrdiff_t.
    If there's going to be undefined behavior,
    it's going to be from that expression.

    While there is no PTRDIFF_MAX macro in C89,
    there is still a maximum limit to the ptrdiff_t type,
    there's just no portable way to tell what is.
    I suppose you can assume that it's at least 127.

    --
    pete
     
    pete, Nov 10, 2005
    #19
  20. Paul

    Netocrat Guest

    On Thu, 10 Nov 2005 13:12:40 +0000, pete wrote:

    [when is the positive result of subtracting pointers UB in C89?]
    > While there is no PTRDIFF_MAX macro in C89,
    > there is still a maximum limit to the ptrdiff_t type,
    > there's just no portable way to tell what is.
    > I suppose you can assume that it's at least 127.


    That's clarified things. I don't have access to a C89 draft right now and
    wasn't sure that the ptrdiff_t type even existed in C89. So for C89 it's
    not possible to perform a specific compile-time check for UB. It seems
    like the best way to go about it is this:

    #if (__STDC_VERSION__ >= 199901L)
    #include <stdint.h>
    #if PTRDIFF_MAX < SIZE_MAX
    # error Under C99 and above the replace function might \
    invoke undefined behaviour for strings with length greater \
    than PTRDIFF_MAX - see comments in the function body
    #endif
    #else
    #ifdef __STDC__
    # error Under C89 the replace function might invoke \
    undefined behaviour for strings with length greater than the \
    maximum positive value of the ptrdiff_t type, but there is no \
    portable way to determine this value (check your compiler \
    documentation) - see comments in the function body
    #endif
    #endif

    --
    http://members.dodo.com.au/~netocrat
     
    Netocrat, Nov 10, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dominique Deleris
    Replies:
    4
    Views:
    498
  2. Replies:
    7
    Views:
    463
    Robbie Hatley
    May 31, 2007
  3. Alun
    Replies:
    3
    Views:
    4,663
    Masudur
    Feb 18, 2008
  4. Replies:
    3
    Views:
    234
    Sherm Pendley
    Aug 3, 2005
  5. SM
    Replies:
    4
    Views:
    229
Loading...

Share This Page