Challenge: tightest code to find-replace a string

Discussion in 'C Programming' started by DFS, Jun 6, 2014.

  1. I tried it on s/a/b/ in input "aaa" and got "bab". From the symptom,
    that looks easy to fix but maybe not (I find your code hard to reason

    Do you really find this sort of test driver to be simpler than one I
    posted? I am trying to imagine your testing procedure. Do you have the
    source file the data file open in an editor and you keep re-compiling?
    How you verify the output? How you build-up a set of regression tests?
    Ben Bacarisse, Jun 9, 2014
    1. Advertisements

  2. it mustn't crash or fail to return on any input. But it probably doesn't matter
    if replace 'abcabc' 'x' called on abcabcabc returns xabc xx or even abcabcabc.
    The requirement for a function can be something like "makes the space invader
    crash realistically". That's not reducible to a pure mathematical specification.
    Malcolm McLean, Jun 9, 2014
    1. Advertisements

  3. Yes, the requirement for *some* functions can be that vague.

    But a string replacement function *can* be specified precisely,
    and there's no excuse for not doing so. And if it doesn't behave
    consistently, I'm not interested in using it; I'll just use a
    different function that's properly defined.

    Are you perhaps assuming that it doesn't matter because
    search-and-replace is performed interactively, and therefore the
    user can immediately see the results? Are you familiar with sed?
    Keith Thompson, Jun 9, 2014
  4. I'm familiar with the name but I've never used it.
    If you're implementing see presumably you have to implement the quirks
    as well, to avoid breaking existing scripts.
    But if "search and replace" is called on overlapping input, then almost
    always it indicates that the caller didn't expect the pattern to be applied
    to that situation. There's no ideal behaviour with a "pattern / replacement"
    interface. Probably you should pass an extra parameter in to indicate whether
    to be greedy, non-greedy, or flag up an error.
    Malcolm McLean, Jun 9, 2014
  5. DFS

    Richard Bos Guest

    On the contrary. If my word processor gave me anything but "xabc" under
    those circumstances, I'd start using another.

    Richard Bos, Jun 9, 2014
  6. DFS

    BartC Guest

    Yes; the next char after a successfully matched sequence is output directly,
    instead of being tested for the start of a new sequence.

    A carefully placed goto fixed that. (Alternatively, replacing
    'outchar(fout,c)' in the following by 'pendcount=0; pendchar=c;' seemed to

    if (*mp==0) { /* Full match obtained ... */

    Probably there are other issues; I just wanted to see what the code might
    look like (ie. not that elegant).
    I deliberately used a non-nested loop with some state (inside=0 or 1) to
    know what it was up to.
    I had several versions of the replace_in_file() call, of which all but one
    were commented out. For this simple program, I just stuck with the same
    input file, although could easily have been switched too within the source.
    (For a change, I used actual C to develop this, and it was the braces that
    caused most problems.)
    By visual examination. I had a last line that was system("ed " outfile);
    which displayed the output.
    If this was a serious project rather than just a diversion (from my compiler
    projects), then I'd use a different approach. Maybe then it's worth the
    effort of setting up the input data from the command line parameters, which
    I've always found fiddly.

    (Or more likely, I would have the test cases in a table of strings within
    the program. The inchar() and outchar() routines lend themselves to being
    adapted for string input and output.)
    BartC, Jun 9, 2014
  7. DFS

    Stefan Ram Guest

    You are responding to:

    |... called on abcabcabc returns xabc or even abcabcabc.

    Malcolm wrote:

    |... called on abcabcabc returns xabc xx or even abcabcabc.

    Malcolm failed to clearly indicate the boundaries of the string.
    Still, taking » xx« to be a part of the string, is the only way
    to parse the English sentence, as long as there is no comma after
    the »xabc«.
    Stefan Ram, Jun 9, 2014
  8. DFS

    Stefan Ram Guest

    #include <stdio.h>
    #include <string.h> /* for memset */
    #include <stdlib.h>

    /* streaming search-and-replace */

    /* this is an internal helper function for the following srp function. */

    static void srp_write( char const * from, size_t length, void( *put )( char ) )
    { char a; while( a = *from++, length-- )put( a ); }

    Copies from input to output replacing occurences of from-string by to-string.

    @param put this must be a putchar-like function representing the
    output to write the result to.

    @param get a getchar-like function representing the
    input to read from.

    @param from the search string. It might contain
    0-characters (untested). The function will search
    for that string in the input.

    @param from_length the length of the search string.
    The function will dynamically allocate two buffers
    with approximately as many bytes as this from_length.

    @param to the replacement string. It might contain
    0-characters (untested). The function will replace
    instances of the search string by this replacement string.

    @param to_length the lenght of the replacement string.

    @return 0 for success, other values to indicate lack of
    dynamic memory. */

    int srp
    ( void( *put )( char ),
    int( *get )(),
    char const * const restrict from, size_t const from_length,
    char const * const restrict to, size_t const to_length )
    { /* algorithm was invented by Stefan Ram while writing this function */
    char * const b = malloc( 2 + from_length ); if( !b )return 1;
    char * const s = malloc( 2 + from_length ); if( !s ){ free( b ); return 2; }
    int a; size_t o = 0; int t = 0;
    while( a =( t > 0 ? s[ --t ]: get() ), a > 0 )
    { if( o >= from_length ){ srp_write( to, to_length, put ); o = 0; }
    { if( a == o[ from ]){ o[ b ]=( char )a; ++o; }
    else { if( o ){ put( b[ 0 ]); s[ t++ ]=( char )a;
    for( size_t i = o - 1; i > 0; -- i )
    s[ t++ ]= i[ b ]; o =( size_t )0; } else put(( char )a ); }}}
    if( o ){ if( o >= from_length )srp_write( to, to_length, put );
    else srp_write( b, o, put ); } free( s ); free( b ); return 0; }
    Stefan Ram, Jun 9, 2014
  9. DFS

    Stefan Ram Guest

    It seems that this was an relict from a previous version and is
    not necessary anymore.
    To make this process NUL characters ('\0'), the condition
    »a > 0« should be changed into »a >= 0«. As I have not tested
    NUL character processing, it might still not work, but this is
    one step into this direction.
    Stefan Ram, Jun 9, 2014
  10. DFS

    Stefan Ram Guest

    The new version now needs just a single buffer:

    int srp( void( *put )( char ), int( *get )(),
    char const * const restrict from, size_t const from_length,
    char const * const restrict to, size_t const to_length )
    { /* algorithm was invented by Stefan Ram while writing this function */
    char * const s = malloc( 2 + from_length ); if( !s )return 1;
    int a; size_t o = 0; int t = 0;
    while( a =( t > 0 ? s[ --t ]: get() ), a >= 0 )
    { if( o >= from_length ){ srp_write( to, to_length, put ); o = 0; }
    { if( a == o[ from ])++o; else { if( o )
    { put( from[ 0 ]); s[ t++ ]=( char )a;
    for( size_t i = o - 1; i > 0; -- i )
    s[ t++ ]= i[ from ]; o =( size_t )0; } else put(( char )a ); }}}
    if( o ){ if( o >= from_length )srp_write( to, to_length, put );
    else srp_write( from, o, put ); } free( s ); return 0; }

    In the meantime, a single test with a NUL character in the
    input and the search string was passed.
    Stefan Ram, Jun 9, 2014
  11. It's a non-interactive tool, common (in fact, nearly universal) on
    UNIX-like systems. The name is derived from the phrase "stream editor".
    It's commonly used in scripts (i.e., programs written in the language
    implemented by a shell).
    I've been using sed for decades. As far as I know, it has no flag to
    indicate different semantics for overlapping input, and I've never
    encountered a need for such a flag.

    The POSIX specification is here:

    The relevant command in this context would be:

    sed 's/foo/bar/g'

    In the documentation, a BRE is a "Basic Regular Expression"; in this
    case, we're considering a simple string. The 'g' suffix causes it to
    replace all non-overlapping instances of "foo"; by default it only
    replaces the first on each line.

    An example:

    $ echo banana | sed 's/ana/Anana/g'

    As far as I can tell, no other output would be consistent with the
    specification (there are two occurrences of "ana", but they overlap so
    only the first is replaced) -- and I would consider any other output

    Though I can't think of a concrete example, it's likely I've used sed on
    text with overlapping occurrences of a substitution pattern and relied
    on the way it behaves.

    The problem you perceive with overlapping input is not a problem
    in practice. There is no real ambiguity in the behavior, and
    therefore no need for extra flags to resolve it.
    Keith Thompson, Jun 9, 2014
  12. DFS

    Stefan Ram Guest

    I should add that the search string's length must be
    greater than 0. (This is a part of the interface

    And here is a collection of test cases, including the
    length of each string, in the order: input, replace_from,
    replace_to, and expected_output:

    test( "aaa", 3, "a", 1, "b", 1, "bbb", 3 );
    test( "abcabcabc", 9, "abcabc", 6, "x", 1, "xabc", 4 );
    test( "alpha", 5, "alpha", 5, "epsilon", 7, "epsilon", 7 );
    test( "ababac", 6, "abac", 4, "ABAC", 4, "abABAC", 6 );
    test( "alphabebetagamma", 16, "bebe", 4, "epsilon", 7, "alphaepsilontagamma", 19 );
    test( "alphabebetagamma", 16, "beta", 4, "epsilon", 7, "alphabeepsilongamma", 19 );
    test( "alphabebebetagamma", 18, "beta", 4, "epsilon", 7, "alphabebeepsilongamma", 20 );
    test( "abcdefghi", 9, "def", 3, "DEF", 3, "abcDEFghi", 9 );
    test( "__01234__", 9, "01234", 5, "!", 1, "__!__", 5 );
    test( "__012301234__", 13, "01234", 5, "!", 1, "__0123!__", 9 );
    test( "__0123012301234__", 17, "01234", 5, "!", 1, "__01230123!__", 13 );
    test( "__0123401234__", 14, "01234", 5, "!", 1, "__!!__", 6 );
    test( "0123401234__", 12, "01234", 5, "!", 1, "!!__", 4 );
    test( "__0123401234", 12, "01234", 5, "!", 1, "__!!", 4 );
    test( "alpha", 5, "alpha", 5, "", 0, "", 0 );
    test( "aalpha", 6, "alpha", 5, "", 0, "a", 1 );
    test( "aaalpha", 7, "alpha", 5, "", 0, "aa", 2 );
    test( "alphaX", 6, "alpha", 5, "", 0, "X", 1 );
    test( "aalphaX", 7, "alpha", 5, "", 0, "aX", 2 );
    test( "aaalphaX", 8, "alpha", 5, "", 0, "aaX", 3 );
    test( "", 0, "alpha", 5, "", 0, "", 0 );
    test( "alpha", 5, "a", 1, "x", 1, "xlphx", 5 );
    test( "abcd\0fghi", 9, "d\0f", 3, "DEF", 3, "abcDEFghi", 9 );

    The last test case includes a NUL character.
    Stefan Ram, Jun 9, 2014
  13. "DFS" wrote in message news:lmrer6$4l3$...
    Can I get the most recent requirements as a summary?

    AFAICT, you want a one-to-many transformation.

    A simple example:

    s = "abcdefgabcdefghijklmnop"

    A possible transformation could be:

    "abcdefg" => "aaaAAaa"


    s_t = "aaaAAaaaaaAAaahijklmnop"


    FWIW, this can be easily rendered into a case-sensitive function.
    Chris M. Thomasson, Jun 9, 2014
  14. It's slightly easier to write a greedy matcher. So probably that's what
    they did, and wrote the specifications afterwards.
    No it's not. If you call a pattern on an input with overlapping
    sequences, almost certainly you haven't thought through the output
    you actually want. The fix is usually to extend the pattern.
    Malcolm McLean, Jun 9, 2014
  15. DFS

    Stefan Ram Guest

    In the special case that I gave, the replacements where also
    overlapping. Only then is it possible to do both replacements
    at the same time.

    Overlapping strings occur rarely. It might make sense when
    information is stored in a compressed manner and strings are
    obtained from a large buffer by offset and length. It would
    be funny if one could find such as case in the DNA for
    protein encoding.

    However, my point was something else:

    Before the implementation of an interface can be
    tackled in source code, its specification in the
    English language must be complete and clear.

    This means that even requirements that might seem
    »self-evident« to some parties still needs to be stated
    in a univocal manner.

    Dennis Ritchie, Donald E. Knuth, were or are they not all
    masters of the English language, too?

    »I've found that some of the best [Software ]developers
    of all are English majors. They'll often graduate with
    no programming experience at all, and certainly without
    a clue about the difference between DRAM and EPROM.

    But they can write. That's the art of conveying
    information concisely and clearly. Software development
    and writing are both the art of knowing what you're going
    to do, and then lucidly expressing your ideas.«

    Paul Potts

    »Besides a mathematical inclination, an exceptionally
    good mastery of one's native tongue is the most vital
    asset of a competent programmer.«

    Edsgar Dijkstra

    »While sloppy writing does not invariably mean sloppy
    thinking, we've generally found the correlation to be
    strong -- and we have no use for sloppy thinkers.
    If you can't yet write competently, learn to.«

    Eric Raymond

    »The narrative measures of conjunction use, event
    content, perspective shift, and mental state reference
    were significantly predictive of later Math scores.« and math.pdf

    »I have never, ever, ever seen a great software developer
    who does not have amazing attention to detail.«
    Stefan Ram, Jun 9, 2014
  16. DFS

    Stefan Ram Guest

    And also:

    Whenever a part of requirements for a program
    is given in the subject line of a Usenet post,
    it should be repeated in the body of the message.
    Stefan Ram, Jun 9, 2014
  17. DFS

    Stefan Ram Guest

    The chromosome of the phi X 174 virus encodes nine different
    proteins. But the virus DNA actually seemed to be too small
    to encode them all!

    In 1977, Fredrick Sanger discovered that the same parts of
    DNA of this virus was encoding more than one protein.

    An online encyclopedia uses the term:

    »its highly overlapping genome«
    Stefan Ram, Jun 9, 2014
  18. "Chris M. Thomasson" wrote in message

    WRT "infinitely" overlapping search string in the to be transformed string:

    s = "01"


    "01" => "0101"

    s[0] = "01"
    s[1] = "0101"
    s[2] = "01010101"
    s[3] = "0101010101010101"

    You would have to set an iteration threshold to help the
    program avoid an infinite loop, and an ever expanding

    It expands like a L-system fractal.
    Chris M. Thomasson, Jun 9, 2014
  19. DFS

    BartC Guest

    Stefan Ram kindly posted a set of test cases (see elsewhere in the thread).

    They all passed in my code (after converting to operating with string i/o),
    except two.

    One involved an embedded nul character. The other was an odd one, because it
    worked with the file version. But I've had to change the code dealing with
    the remaining characters on exiting the main loop, to make that work.

    The new code is here:

    This contains one test case. Stefan's list of test cases can be pasted
    directly into the main() function. (I didn't want to distribute someone
    else's code on that site.)

    (BTW if you think my code is difficult to follow, have a look at his
    find-replace code, and see if you still have the same opinion...)
    BartC, Jun 9, 2014
  20. DFS

    Stefan Ram Guest

    In C, one can write: »while(1);«.

    Not having an iteration threshold in C means that the
    language is simple and fast, albeit not without risks.

    One can as well decide to specify requirements in this
    spirit of C: »If the client calls me to have an infinite
    loop then this be his will.« So, there is no need for an
    iteration threshold, just to avoid infinite loops.
    (In a given context, there still might be other valid
    reasons to specify such a threshold.)
    Stefan Ram, Jun 9, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.