Challenge: tightest code to find-replace a string

Discussion in 'C Programming' started by DFS, Jun 6, 2014.

  1. DFS

    DFS Guest

    I think it's fine.


    changes == replacements

    huh?

    A lot of find-replace programs count and report how many replacements
    were made.



    An incorrect result.

    After the first find-replace of 'abcabc' by 'defdef', there is no more
    'abcabc':

    "012abcabcabc789" would become "012defdefabc789"

    and the program would increment the counter and continue on.

    This is how my original code handles it, and how the other
    implementations found in text apps (LibreOffice Writer, gEdit, Leafpad,
    etc) I looked at handle it.


    Another Stefan Ram fail post, like your first one in this thread.
     
    DFS, Jun 7, 2014
    #21
    1. Advertisements

  2. DFS

    DFS Guest


    So are you talking about a separate function to count the number of
    potential replacements before making them?

    I've only ever seen find-replace functions report how many replacements
    were actually made, after the fact.



    Above you said my original code "reports rather than counts these
    matches." and "I would never write a function with this spec."

    Maybe I'm confused, but it looks like you did exactly that. Your
    program makes the replacements, then reports how many were made.

    From your earlier statements, I was expecting a separate function to
    count the number of matches.





    I definitely like the functionality in main() and the replace-string()
    code, but there might be bugs:

    original input data
    replace 46 with ----

    1: 14513111664214260256543011122553234523520226455552
    2: 41602561064325541006060354620223361346535061545034
    3: 63164621623130051346620535103421535300201464252314
    4: 30013144611120401561305220534605456101542562311260
    5: 30501506124251042546364005110661421500320026101445
    6: 35355334213621124600100142264440253516210400362562
    7: 65140560414014522562466550406113020500531011441421
    8: 60543325410345553336424511333322104440166124450061
    9: 44310321435636412163052026304311532342515351020026
    10: 10536502643531635353214012163164121056142415600245

    should return:

    1: 14513111664214260256543011122553234523520226455552
    2: 4160256106432554100606035----202233613----535061545034
    3: 6316----216231300513----620535103421535300201----4252314
    4: 3001314----1112040156130522053----05456101542562311260
    5: 305015061242510425----364005110661421500320026101445
    6: 3535533421362112----00100142264440253516210400362562
    7: 65140560414014522562----6550406113020500531011441421
    8: 60543325410345553336424511333322104440166124450061
    9: 44310321435636412163052026304311532342515351020026
    10: 10536502643531635353214012163164121056142415600245



    * running your code with stopper EOF removes most occurrences of
    standalone 4, but not all. Looks like repeating characters (eg 44)
    aren't handled correctly, so it missed the first '46' in line 4, which
    probably lead to it undercounting the replacements by one.

    output:
    1: 15131116621260256530111225532352352022655552
    2: 1602561063255100606035----202233613----5350615503
    3: 6316----216231300513----62053510321535300201----425231
    4: 30013146111200156130522053----055610152562311260
    5: 3050150612251025----360051106612150032002610145
    6: 353553321362112----0010012264025351621000362562
    7: 6510560101522562----6550061130205005310114121
    8: 605332510355533362511333322104016612450061
    9: 431032135636121630520263031153232515351020026
    10: 10536502635316353532101216316121056121560025


    * running it with EOF stopper reports all replacements were made in line 1

    [dfs@home files]$ ./find_replace_BenB 46 ---- random.txt randomOut.txt
    Line 1: 9 replacements
    9 replacements


    * running it with the '\n' stopper miscounts the lines, and removes the
    line breaks.

    [dfs@home files]$ ./find_replace_BenB 46 ---- random.txt randomOut.txt
    Line 1: 0 replacements
    Line 2: 2 replacements
    Line 3: 3 replacements
    Line 4: 1 replacements
    Line 5: 1 replacements
    Line 6: 1 replacements
    Line 7: 1 replacements
    Line 8: 0 replacements
    Line 9: 0 replacements
    Line 10: 0 replacements
    Line 11: 0 replacements
    9 replacements

    output:
    151311166212602565301112255323523520226555521602561063255100606035----202233613----53506155036316----216231300513----62053510321535300201----42523130013146111200156130522053----0556101525623112603050150612251025----360051106612150032002610145353553321362112----00100122640253516210003625626510560101522562----655006113020500531011412160533251035553336251133332210401661245006143103213563612163052026303115323251535102002610536502635316353532101216316121056121560025
     
    DFS, Jun 7, 2014
    #22
    1. Advertisements

  3. DFS

    BartC Guest

    I /think/ the point he's trying to make is this:

    Suppose you were trying to replace all "46" strings in a file by "74". With
    an input file of:

    "1466272"

    your code would produce:

    "1746272"

    However, there is now a new "46" in the output! And if changing "46" to
    "64", then an input of:

    "4466" would produce "4646", twice as many as you started with! Processing
    this again gives you "6464", and ocne more, produces "6644". In this
    example, if will eventually stabilise.

    But if the rule is change every "A" to "AA", then an input file of just "A"
    could produce an infinitely long file of "A"s, if you went back and
    reprocessed.

    Or, you might want to eliminate all "BART"s by replacing with "", but this
    input:

    "BABARTRT" will still give you "BART"!

    So the problem is that find and replace hasn't been rigidly defined. What
    you probably intend is that once a substring S in the source has been
    replaced by T, then the process starts afresh at the character following the
    original S; T is not examined at all. This does mean that all S substrings
    will be eliminated from the output.
     
    BartC, Jun 8, 2014
    #23
  4. DFS

    Ian Collins Guest

    Or just write:

    const size_t findSize = strlen(find);
    const char* stop = strstr(start, find);

    while( stop )
    {
    fwrite(start, 1, stop - start, outFile);
    fputs(replace, outFile);
    start = stop + findSize;
    k++;
    stop = strstr(start, find);
    }

    fputs(start, outFile);
     
    Ian Collins, Jun 8, 2014
    #24
  5. DFS

    Richard Bos Guest

    Very limited. Stop posting in a vacuum.

    Richard
     
    Richard Bos, Jun 8, 2014
    #25
  6. Depends.
    If you're providing a "search and replace" tool for use on human-readable text, it hardly
    matters how it handles the overlapping cases, because it will rarely be called for such,
    and any reasonable behaviour is likely to be accepted.
     
    Malcolm McLean, Jun 8, 2014
    #26
  7. No. I was thinking of a direct replacement for your posted code that
    was built with more flexible components.

    fwrite(match, 1, mp - match, fo);
    I forgot the above as pointed out by MM.
    The replace reports nothing and can therefore be used to more flexibly.
    One such use is to duplicate the output you want (I can't imagine why
    you want that output, but that's another matter).

    Yes, I forgot a crucial line. I did not re-post as Malcolm reported it
    within a few hours.

    <snip>
     
    Ben Bacarisse, Jun 8, 2014
    #27
  8. DFS

    DFS Guest


    Added that line, but it's still not working right
    (replace 46 with ----)

    original input

    1: 14513111664214260256543011122553234523520226455552
    2: 41602561064325541006060354620223361346535061545034
    3: 63164621623130051346620535103421535300201464252314
    4: 30013144611120401561305220534605456101542562311260
    5: 30501506124251042546364005110661421500320026101445
    6: 35355334213621124600100142264440253516210400362562
    7: 65140560414014522562466550406113020500531011441421
    8: 60543325410345553336424511333322104440166124450061
    9: 44310321435636412163052026304311532342515351020026
    10: 10536502643531635353214012163164121056142415600245


    run with stopper = EOF
    * now it leaves all the 46s in there
    * still says all replacements were made in line 1
    * still doesn't handle repeating numbers correctly (ie 446), so it
    misses one of the 46s. See Line 4.

    14513111664214260256543011122553234523520226455552
    4160256106432554100606035----46202233613----46535061545034
    6316----46216231300513----46620535103421535300201----464252314
    3001314461112040156130522053----4605456101542562311260
    305015061242510425----46364005110661421500320026101445
    3535533421362112----4600100142264440253516210400362562
    65140560414014522562----466550406113020500531011441421
    60543325410345553336424511333322104440166124450061
    44310321435636412163052026304311532342515351020026
    10536502643531635353214012163164121056142415600245



    run with stopper = \n
    * now it leaves all the 46s in there
    * still miscounts the # of lines
    * still doesn't handle repeating numbers correctly (ie 446), so it
    misses one of the 46s.

    145131116642142602565430111225532345235202264555524160256106432554100606035----46202233613----46535061545036316----46216231300513----46620535103421535300201----464252313001314461112040156130522053----4605456101542562311260305015061242510425----463640051106614215003200261014453535533421362112----460010014226444025351621040036256265140560414014522562----466550406113020500531011441421605433254103455533364245113333221044401661244500614431032143563641216305202630431153234251535102002610536502643531635353214012163164121056142415600245
     
    DFS, Jun 8, 2014
    #28
  9. DFS

    DFS Guest


    I think the point he's trying to make is he doesn't understand how
    find-replace works in real life.



    Yep.

    Run it again and it's 1774272.




    Why would I need to "rigidly define" find-replace (on c.l.c on Usenet of
    all places) when it's been done thousands of times over the last 40
    years, and everyone knows what it means?



    Has there EVER been a find-replace that worked differently?




    You may have to run it again.

    1st run: 99.5% fixed
    2nd run: 99.995 fixed
    etc

    In the real world, I've never seen a find-replace implementation that
    does more than one iteration at a time.
     
    DFS, Jun 8, 2014
    #29
  10. DFS

    DFS Guest


    A voice of reason?

    You may have trouble fitting in on clc.
     
    DFS, Jun 8, 2014
    #30
  11. It matters very much if the handling of overlapping cases can trigger an
    infinite loop.

    And even for use on human-readable text, I fail to see how a precise and
    unambiguous definition of the behavior is unimportant.
     
    Keith Thompson, Jun 8, 2014
    #31
  12. Don't you think that at least some of the people who've implemented
    find-replace thousands of times were working from an unambiguous
    specification?
     
    Keith Thompson, Jun 8, 2014
    #32
  13. Yes, you are right. I should not have been so hasty as to accept
    Malcolm's suggestion. The algorithm more fiddly than that. When the
    match fails, you can't just print what you saw and reset mp = match.
    The test and what do is a bit more subtle.

    Having posted too fast twice already, I am reluctant go for a third, but
    what the hell...

    int replace_string(const char *match, const char *repl, int stopper,
    FILE *fi, FILE *fo)
    {
    int nmatches = 0, c;
    const char *mp = match;
    while ((c = fgetc(fi)) != EOF && c != stopper)
    if (c == *mp) {
    if (!*++mp) {
    ++nmatches;
    fputs(repl, fo);
    mp = match;
    }
    }
    else {
    if (strspn(match, (char[]){c, 0}) < mp - match) {
    fwrite(match, 1, (size_t)(mp - match), fo);
    mp = match;
    }
    fputc(c, fo);
    }
    return nmatches;
    }

    <snip>
     
    Ben Bacarisse, Jun 9, 2014
    #33
  14. DFS

    Siri Crews Guest

    echo $string | sed s/original/changed/

    It's a poor workman that blames his tools.
     
    Siri Crews, Jun 9, 2014
    #34
  15. DFS

    Ike Naar Guest

    Sorry but this fails to match the pattern "abac" in input text "ababac".
     
    Ike Naar, Jun 9, 2014
    #35
  16. DFS

    BartC Guest

    (My original post seems to have disappeared; I can't see it anyway. This is
    another attempt.)
    In real life, it isn't as simple as you seem to think it is either.

    For example, global find & replace in even a simple text editor might ask
    you if you wanted to restrict matching to whole words, and what to do about
    matching case. (And if case isn't matched exactly, and you were translating
    "abc" to "xyz", you'd need to decide whether "ABC" in the input should be
    replaced by "xyz" or "XYZ".)
    You've also chosen to bring lots of other aspects into the same challenge:
    using files (I would have routines working from memory to memory, as that is
    a more typical context of how it might be used); combining the match method
    with the replacement (a separate match would allow fuzzier matching or to
    look for patterns); doing all replacements in one place instead of one at a
    time; etc.

    I've re-stated your task as follows:

    * Copy all bytes from one file to another, except:

    * Any byte-sequence S in the input, should be suppressed, and the
    byte-sequence T should be written to the output instead.

    * After such a replacement, the matching process continues with the byte
    following S in the input; overlapping is ignored. At the end of the whole
    process, there could still be S sequences in the input.

    * S must consist of one or more bytes (otherwise nothing is matched so is a
    straight copy); T can be zero or more bytes, and can be shorter or longer
    than S. The input file can be zero or more bytes. S can match all bytes in
    the input file (so the entire file is replaced by T).

    * The process should be line-oriented, so any S containing end-of-line
    cannot be matched (I think T can contain end-of-line). (You also allow an
    arbitrary upper limit on a line-length, so that any lines exceeding that
    limit might not be processed properly. This means another implementation
    might give different results if they use a different limit, or don't have
    one at all.)

    * Replacements must be reported on a line-by-line basis, and the total for
    the file.
    If you have a very large file such as "4646464646...." and wanted to changed
    all "46" to "64", then it will need a lot of iterations! (I think each one
    will only fix one character, or one at each end.)
     
    BartC, Jun 9, 2014
    #36
  17. Eh? I don't think I'm blaming any tools. Maybe you are making a more
    general remark that I've not understood.
     
    Ben Bacarisse, Jun 9, 2014
    #37
  18. You know, I thought it would! I could not shake the feeling that it was
    more complicated than I'd originally assumed but hope triumphed over
    experience and I posted again!

    I am now pretty sure that you need something like the table from the KMP
    algorithm to know where to reset the failed match. That's not
    surprising and I am disappointed that I did not see that right away. In
    this case, you'd generate the required element on the fly, and it may
    still be worth it, but I'm not going to have a go now!
     
    Ben Bacarisse, Jun 9, 2014
    #38
  19. DFS

    BartC Guest

    I tried something based on your code (but then rewritten). It worked fine,
    until I tried it on "ababac", which didn't work.

    I did manage it eventually, but it needed some ugly extra logic as shown
    below. (The problem, with this stream model, is that doesn't have a record
    of what's been read so can't back up; it has to make use of portions of
    match, plus remember the last char read (pendchar).)

    However this has been little tested beyond the OP's test file, and the
    ababac example.

    (The original used a regular while (c = inchar()... loop, but I needed
    while (1) to make it work. Sorry I can't get away from those types of loops.
    The code isn't 'tight' either, but I deliberately didn't use any built-in
    string searching such as strstr or strspan.)


    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    int inchar(FILE* f) {
    int c;
    c=fgetc(f);
    if (c==EOF) return 0;
    return c;
    }

    void outchar(FILE* f,int c){
    fputc(c,f);
    }

    void outstrn(FILE* f,char* s,int n){
    fprintf(f,"%.*s",n,s); /* (probably an uncounted version would do) */
    }

    int replace_in_file(char *match, char *repl, FILE* fin, FILE* fout) {
    int nmatches=0,c;
    int inside;
    char *mp;
    int repllen=strlen(repl);
    char* pendstr;
    char pendchar;
    int pendcount;

    mp = match;
    inside=0;
    pendstr="";
    pendcount=0;
    pendchar=0;

    while(1) {
    if (pendcount) { /* Ugly hack needed for ababac problem */
    c=*pendstr;
    ++pendstr;
    --pendcount;
    } else if (pendchar) {
    c=pendchar;
    pendchar=0;
    } else {
    c=inchar(fin);
    if (c==0) break;
    }

    if (c!=*mp) {
    if (!inside) { /* Outside of a match string; pass it thru */
    outchar(fout,c);
    } else { /* Currently inside possible matched seq */
    if (*mp==0) { /* Full match obtained, replace that sequence
    */
    outstrn(fout,repl,repllen);
    ++nmatches;
    outchar(fout,c);
    } else { /* Partial match only; must back up */
    outchar(fout,*match);
    pendchar=c;
    pendstr=match+1;
    pendcount=mp-match-1;
    }
    inside=0;
    mp=match;

    }
    } else { /* Char match */
    inside=1;
    ++mp;
    }
    }

    if (inside && *mp==0) {
    outstrn(fout,repl,repllen);
    }
    return nmatches;
    }

    int main(int argc, char **argv) {
    #define infile "kkk1"
    #define outfile "kkk2"
    FILE *fin,*fout;
    int i,c;

    fin=fopen(infile,"r");
    fout=fopen(outfile,"w");
    if (!fin || !fout) exit(1);

    replace_in_file("abac","ABAC",fin,fout);
    fclose(fin);
    fclose(fout);
    }
     
    BartC, Jun 9, 2014
    #39
  20. [...]

    I'd rephrase that. You say that S *must* consist of one or more bytes,
    then describe what happens if S is empty.

    Instead, I'd say:

    If S consists of zero bytes, the input is copied to the output with
    no change.

    Some of the requirements you state follow from other requirements, such
    as "... and can be shorter or longer than S". You might want to
    distinguish such statements from actual requirements, say by putting
    them in parentheses.
     
    Keith Thompson, Jun 9, 2014
    #40
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.