Challenge: tightest code to find-replace a string

D

DFS

I think this is not a sufficient specification of requirements.

I think it's fine.


For example, it mentions »changes« in line two, but before line
two, it was not said that anything should be changed at all. So
it is not clear what »changes« refers to.

changes == replacements

And »to count« something is not behavior that is visible from the
outside.

huh?

A lot of find-replace programs count and report how many replacements
were made.


BTW: When given the task

»replace all the occurences of "abcabc" by "defdef" in
"012abcabcabc789" would
"012defdefdef789" be a correct result? the only correct result?«


An incorrect result.

After the first find-replace of 'abcabc' by 'defdef', there is no more
'abcabc':

"012abcabcabc789" would become "012defdefabc789"

and the program would increment the counter and continue on.

This is how my original code handles it, and how the other
implementations found in text apps (LibreOffice Writer, gEdit, Leafpad,
etc) I looked at handle it.


Another Stefan Ram fail post, like your first one in this thread.
 
D

DFS

It reports rather than counts these matches. I would never write a
function with this spec. because it destroys its usefulness in other
contexts. A function should do one thing well.


So are you talking about a separate function to count the number of
potential replacements before making them?

I've only ever seen find-replace functions report how many replacements
were actually made, after the fact.


I'd write a string match/replace function that returns the number of
matches. If I needed the counts reported by line, I'd write a wrapper
that adds those.

int replace_string(const char *match, const char *repl, int stopper,
FILE *fi, FILE *fo)
{
int nmatches = 0, c;
const char *mp = match;
while ((c = fgetc(fi)) != EOF && c != stopper)
if (c == *mp) {
if (!*++mp) {
++nmatches;
fputs(repl, fo);
}
}
else {
mp = match;
fputc(c, fo);
}
return nmatches;
}

Called with stopper == EOF it processes a whole file. Note how removing
the line buffer actually simplifies the code, whilst also removing an
unnecessary restriction. It's not uncommon for this to happen (there
was a recent thread about this).

Called with stopper == '\n' it processes a line and so this wrapper
prints the report:

void replace_string_report(const char *match, const char *repl,
FILE *fi, FILE *fo)
{
int total_matches = 0, lineno = 0;
while (!feof(fi)) {
int nm = replace_string(match, repl, '\n', fi, fo);
printf("\nLine %d: %d replacements\n", ++lineno, nm);
total_matches += nm;
}
printf("%d replacements\n", total_matches);
}


Above you said my original code "reports rather than counts these
matches." and "I would never write a function with this spec."

Maybe I'm confused, but it looks like you did exactly that. Your
program makes the replacements, then reports how many were made.

From your earlier statements, I was expecting a separate function to
count the number of matches.



Here's the driver for testing.

int main(int argc, char **argv)
{
if (argc > 2) {
FILE *fin = argc > 3 ? fopen(argv[3], "r") : stdin;
FILE *fout = argc > 4 ? fopen(argv[4], "w") : stdout;
if (fin && fout)
replace_string_report(argv[1], argv[2], fin, fout);
}
}

Functions that mix tasks that can be logically separated are best
avoided. Functions with hard-wired file names and strings are, well,
let's just say, sub-optimal. Students used to say "but it's because I'm
just testing" but a simple driver like the one above makes testing
much easier than having the files and strings hard wired.

<snip>



I definitely like the functionality in main() and the replace-string()
code, but there might be bugs:

original input data
replace 46 with ----

1: 14513111664214260256543011122553234523520226455552
2: 41602561064325541006060354620223361346535061545034
3: 63164621623130051346620535103421535300201464252314
4: 30013144611120401561305220534605456101542562311260
5: 30501506124251042546364005110661421500320026101445
6: 35355334213621124600100142264440253516210400362562
7: 65140560414014522562466550406113020500531011441421
8: 60543325410345553336424511333322104440166124450061
9: 44310321435636412163052026304311532342515351020026
10: 10536502643531635353214012163164121056142415600245

should return:

1: 14513111664214260256543011122553234523520226455552
2: 4160256106432554100606035----202233613----535061545034
3: 6316----216231300513----620535103421535300201----4252314
4: 3001314----1112040156130522053----05456101542562311260
5: 305015061242510425----364005110661421500320026101445
6: 3535533421362112----00100142264440253516210400362562
7: 65140560414014522562----6550406113020500531011441421
8: 60543325410345553336424511333322104440166124450061
9: 44310321435636412163052026304311532342515351020026
10: 10536502643531635353214012163164121056142415600245



* running your code with stopper EOF removes most occurrences of
standalone 4, but not all. Looks like repeating characters (eg 44)
aren't handled correctly, so it missed the first '46' in line 4, which
probably lead to it undercounting the replacements by one.

output:
1: 15131116621260256530111225532352352022655552
2: 1602561063255100606035----202233613----5350615503
3: 6316----216231300513----62053510321535300201----425231
4: 30013146111200156130522053----055610152562311260
5: 3050150612251025----360051106612150032002610145
6: 353553321362112----0010012264025351621000362562
7: 6510560101522562----6550061130205005310114121
8: 605332510355533362511333322104016612450061
9: 431032135636121630520263031153232515351020026
10: 10536502635316353532101216316121056121560025


* running it with EOF stopper reports all replacements were made in line 1

[dfs@home files]$ ./find_replace_BenB 46 ---- random.txt randomOut.txt
Line 1: 9 replacements
9 replacements


* running it with the '\n' stopper miscounts the lines, and removes the
line breaks.

[dfs@home files]$ ./find_replace_BenB 46 ---- random.txt randomOut.txt
Line 1: 0 replacements
Line 2: 2 replacements
Line 3: 3 replacements
Line 4: 1 replacements
Line 5: 1 replacements
Line 6: 1 replacements
Line 7: 1 replacements
Line 8: 0 replacements
Line 9: 0 replacements
Line 10: 0 replacements
Line 11: 0 replacements
9 replacements

output:
151311166212602565301112255323523520226555521602561063255100606035----202233613----53506155036316----216231300513----62053510321535300201----42523130013146111200156130522053----0556101525623112603050150612251025----360051106612150032002610145353553321362112----00100122640253516210003625626510560101522562----655006113020500531011412160533251035553336251133332210401661245006143103213563612163052026303115323251535102002610536502635316353532101216316121056121560025
 
B

BartC

DFS said:
On 06/06/2014 02:30 PM, Stefan Ram wrote:



An incorrect result.

After the first find-replace of 'abcabc' by 'defdef', there is no more
'abcabc':

"012abcabcabc789" would become "012defdefabc789"

I /think/ the point he's trying to make is this:

Suppose you were trying to replace all "46" strings in a file by "74". With
an input file of:

"1466272"

your code would produce:

"1746272"

However, there is now a new "46" in the output! And if changing "46" to
"64", then an input of:

"4466" would produce "4646", twice as many as you started with! Processing
this again gives you "6464", and ocne more, produces "6644". In this
example, if will eventually stabilise.

But if the rule is change every "A" to "AA", then an input file of just "A"
could produce an infinitely long file of "A"s, if you went back and
reprocessed.

Or, you might want to eliminate all "BART"s by replacing with "", but this
input:

"BABARTRT" will still give you "BART"!

So the problem is that find and replace hasn't been rigidly defined. What
you probably intend is that once a substring S in the source has been
replaced by T, then the process starts afresh at the character following the
original S; T is not examined at all. This does mean that all S substrings
will be eliminated from the output.
 
I

Ian Collins

Noob said:
This doesn't "feel" very idiomatic.

Perhaps

while ((stop = strstr(start, find)) != NULL)

or even

while (stop = strstr(start, find))

The second one raises warnings with most compilers.

while ((stop = strstr(start, find)))

may shut them up.

Or just write:

const size_t findSize = strlen(find);
const char* stop = strstr(start, find);

while( stop )
{
fwrite(start, 1, stop - start, outFile);
fputs(replace, outFile);
start = stop + findSize;
k++;
stop = strstr(start, find);
}

fputs(start, outFile);
 
M

Malcolm McLean

I think it's fine.
Depends.
If you're providing a "search and replace" tool for use on human-readable text, it hardly
matters how it handles the overlapping cases, because it will rarely be called for such,
and any reasonable behaviour is likely to be accepted.
 
B

Ben Bacarisse

DFS said:
So are you talking about a separate function to count the number of
potential replacements before making them?

No. I was thinking of a direct replacement for your posted code that
was built with more flexible components.

fwrite(match, 1, mp - match, fo);
I forgot the above as pointed out by MM.
Above you said my original code "reports rather than counts these
matches." and "I would never write a function with this spec."

Maybe I'm confused, but it looks like you did exactly that. Your
program makes the replacements, then reports how many were made.

The replace reports nothing and can therefore be used to more flexibly.
One such use is to duplicate the output you want (I can't imagine why
you want that output, but that's another matter).

I definitely like the functionality in main() and the replace-string()
code, but there might be bugs:

Yes, I forgot a crucial line. I did not re-post as Malcolm reported it
within a few hours.

<snip>
 
D

DFS

Yes, I forgot a crucial line. I did not re-post as Malcolm reported it
within a few hours.

<snip>


Added that line, but it's still not working right
(replace 46 with ----)

original input

1: 14513111664214260256543011122553234523520226455552
2: 41602561064325541006060354620223361346535061545034
3: 63164621623130051346620535103421535300201464252314
4: 30013144611120401561305220534605456101542562311260
5: 30501506124251042546364005110661421500320026101445
6: 35355334213621124600100142264440253516210400362562
7: 65140560414014522562466550406113020500531011441421
8: 60543325410345553336424511333322104440166124450061
9: 44310321435636412163052026304311532342515351020026
10: 10536502643531635353214012163164121056142415600245


run with stopper = EOF
* now it leaves all the 46s in there
* still says all replacements were made in line 1
* still doesn't handle repeating numbers correctly (ie 446), so it
misses one of the 46s. See Line 4.

14513111664214260256543011122553234523520226455552
4160256106432554100606035----46202233613----46535061545034
6316----46216231300513----46620535103421535300201----464252314
3001314461112040156130522053----4605456101542562311260
305015061242510425----46364005110661421500320026101445
3535533421362112----4600100142264440253516210400362562
65140560414014522562----466550406113020500531011441421
60543325410345553336424511333322104440166124450061
44310321435636412163052026304311532342515351020026
10536502643531635353214012163164121056142415600245



run with stopper = \n
* now it leaves all the 46s in there
* still miscounts the # of lines
* still doesn't handle repeating numbers correctly (ie 446), so it
misses one of the 46s.

145131116642142602565430111225532345235202264555524160256106432554100606035----46202233613----46535061545036316----46216231300513----46620535103421535300201----464252313001314461112040156130522053----4605456101542562311260305015061242510425----463640051106614215003200261014453535533421362112----460010014226444025351621040036256265140560414014522562----466550406113020500531011441421605433254103455533364245113333221044401661244500614431032143563641216305202630431153234251535102002610536502643531635353214012163164121056142415600245
 
D

DFS

I /think/ the point he's trying to make is this:


I think the point he's trying to make is he doesn't understand how
find-replace works in real life.


Suppose you were trying to replace all "46" strings in a file by "74". With
an input file of:

"1466272"

your code would produce:

"1746272"

However, there is now a new "46" in the output!


Yep.

Run it again and it's 1774272.



And if changing "46" to
"64", then an input of:

"4466" would produce "4646", twice as many as you started with! Processing
this again gives you "6464", and ocne more, produces "6644". In this
example, if will eventually stabilise.

But if the rule is change every "A" to "AA", then an input file of just "A"
could produce an infinitely long file of "A"s, if you went back and
reprocessed. > Or, you might want to eliminate all "BART"s by replacing with "", but this
input:

"BABARTRT" will still give you "BART"!

So the problem is that find and replace hasn't been rigidly defined.


Why would I need to "rigidly define" find-replace (on c.l.c on Usenet of
all places) when it's been done thousands of times over the last 40
years, and everyone knows what it means?


What
you probably intend is that once a substring S in the source has been
replaced by T, then the process starts afresh at the character following
the original S;T is not examined at all.


Has there EVER been a find-replace that worked differently?




This does mean that all S substrings
will be eliminated from the output.

You may have to run it again.

1st run: 99.5% fixed
2nd run: 99.995 fixed
etc

In the real world, I've never seen a find-replace implementation that
does more than one iteration at a time.
 
D

DFS

Depends.
If you're providing a "search and replace" tool for use on human-readable text, it hardly
matters how it handles the overlapping cases, because it will rarely be called for such,
and any reasonable behaviour is likely to be accepted.


A voice of reason?

You may have trouble fitting in on clc.
 
K

Keith Thompson

Malcolm McLean said:
Depends.
If you're providing a "search and replace" tool for use on
human-readable text, it hardly matters how it handles the overlapping
cases, because it will rarely be called for such, and any reasonable
behaviour is likely to be accepted.

It matters very much if the handling of overlapping cases can trigger an
infinite loop.

And even for use on human-readable text, I fail to see how a precise and
unambiguous definition of the behavior is unimportant.
 
K

Keith Thompson

DFS said:
On 06/08/2014 04:57 AM, BartC wrote: [...]
So the problem is that find and replace hasn't been rigidly defined.

Why would I need to "rigidly define" find-replace (on c.l.c on Usenet of
all places) when it's been done thousands of times over the last 40
years, and everyone knows what it means?

Don't you think that at least some of the people who've implemented
find-replace thousands of times were working from an unambiguous
specification?
 
B

Ben Bacarisse

DFS said:
Added that line, but it's still not working right
(replace 46 with ----)

Yes, you are right. I should not have been so hasty as to accept
Malcolm's suggestion. The algorithm more fiddly than that. When the
match fails, you can't just print what you saw and reset mp = match.
The test and what do is a bit more subtle.

Having posted too fast twice already, I am reluctant go for a third, but
what the hell...

int replace_string(const char *match, const char *repl, int stopper,
FILE *fi, FILE *fo)
{
int nmatches = 0, c;
const char *mp = match;
while ((c = fgetc(fi)) != EOF && c != stopper)
if (c == *mp) {
if (!*++mp) {
++nmatches;
fputs(repl, fo);
mp = match;
}
}
else {
if (strspn(match, (char[]){c, 0}) < mp - match) {
fwrite(match, 1, (size_t)(mp - match), fo);
mp = match;
}
fputc(c, fo);
}
return nmatches;
}

<snip>
 
S

Siri Crews

echo $string | sed s/original/changed/

It's a poor workman that blames his tools.
 
I

Ike Naar

Having posted too fast twice already, I am reluctant go for a third, but
what the hell...

int replace_string(const char *match, const char *repl, int stopper,
FILE *fi, FILE *fo)
{
int nmatches = 0, c;
const char *mp = match;
while ((c = fgetc(fi)) != EOF && c != stopper)
if (c == *mp) {
if (!*++mp) {
++nmatches;
fputs(repl, fo);
mp = match;
}
}
else {
if (strspn(match, (char[]){c, 0}) < mp - match) {
fwrite(match, 1, (size_t)(mp - match), fo);
mp = match;
}
fputc(c, fo);
}
return nmatches;
}

Sorry but this fails to match the pattern "abac" in input text "ababac".
 
B

BartC

DFS said:
On 06/08/2014 04:57 AM, BartC wrote:

(My original post seems to have disappeared; I can't see it anyway. This is
another attempt.)
I think the point he's trying to make is he doesn't understand how
find-replace works in real life.

In real life, it isn't as simple as you seem to think it is either.

For example, global find & replace in even a simple text editor might ask
you if you wanted to restrict matching to whole words, and what to do about
matching case. (And if case isn't matched exactly, and you were translating
"abc" to "xyz", you'd need to decide whether "ABC" in the input should be
replaced by "xyz" or "XYZ".)
Why would I need to "rigidly define" find-replace (on c.l.c on Usenet of
all places) when it's been done thousands of times over the last 40 years,
and everyone knows what it means?

You've also chosen to bring lots of other aspects into the same challenge:
using files (I would have routines working from memory to memory, as that is
a more typical context of how it might be used); combining the match method
with the replacement (a separate match would allow fuzzier matching or to
look for patterns); doing all replacements in one place instead of one at a
time; etc.

I've re-stated your task as follows:

* Copy all bytes from one file to another, except:

* Any byte-sequence S in the input, should be suppressed, and the
byte-sequence T should be written to the output instead.

* After such a replacement, the matching process continues with the byte
following S in the input; overlapping is ignored. At the end of the whole
process, there could still be S sequences in the input.

* S must consist of one or more bytes (otherwise nothing is matched so is a
straight copy); T can be zero or more bytes, and can be shorter or longer
than S. The input file can be zero or more bytes. S can match all bytes in
the input file (so the entire file is replaced by T).

* The process should be line-oriented, so any S containing end-of-line
cannot be matched (I think T can contain end-of-line). (You also allow an
arbitrary upper limit on a line-length, so that any lines exceeding that
limit might not be processed properly. This means another implementation
might give different results if they use a different limit, or don't have
one at all.)

* Replacements must be reported on a line-by-line basis, and the total for
the file.
This does mean that all S substrings
will [not] be eliminated from the output.

You may have to run it again.

1st run: 99.5% fixed
2nd run: 99.995 fixed
etc

If you have a very large file such as "4646464646...." and wanted to changed
all "46" to "64", then it will need a lot of iterations! (I think each one
will only fix one character, or one at each end.)
 
B

Ben Bacarisse

Siri Crews said:
echo $string | sed s/original/changed/ g (for the problem in hand)

It's a poor workman that blames his tools.

Eh? I don't think I'm blaming any tools. Maybe you are making a more
general remark that I've not understood.
 
B

Ben Bacarisse

Ike Naar said:
Having posted too fast twice already, I am reluctant go for a third, but
what the hell...

int replace_string(const char *match, const char *repl, int stopper,
FILE *fi, FILE *fo)
{
int nmatches = 0, c;
const char *mp = match;
while ((c = fgetc(fi)) != EOF && c != stopper)
if (c == *mp) {
if (!*++mp) {
++nmatches;
fputs(repl, fo);
mp = match;
}
}
else {
if (strspn(match, (char[]){c, 0}) < mp - match) {
fwrite(match, 1, (size_t)(mp - match), fo);
mp = match;
}
fputc(c, fo);
}
return nmatches;
}

Sorry but this fails to match the pattern "abac" in input text
"ababac".

You know, I thought it would! I could not shake the feeling that it was
more complicated than I'd originally assumed but hope triumphed over
experience and I posted again!

I am now pretty sure that you need something like the table from the KMP
algorithm to know where to reset the failed match. That's not
surprising and I am disappointed that I did not see that right away. In
this case, you'd generate the required element on the fly, and it may
still be worth it, but I'm not going to have a go now!
 
B

BartC

I am now pretty sure that you need something like the table from the KMP
algorithm to know where to reset the failed match. That's not
surprising and I am disappointed that I did not see that right away. In
this case, you'd generate the required element on the fly, and it may
still be worth it, but I'm not going to have a go now!

I tried something based on your code (but then rewritten). It worked fine,
until I tried it on "ababac", which didn't work.

I did manage it eventually, but it needed some ugly extra logic as shown
below. (The problem, with this stream model, is that doesn't have a record
of what's been read so can't back up; it has to make use of portions of
match, plus remember the last char read (pendchar).)

However this has been little tested beyond the OP's test file, and the
ababac example.

(The original used a regular while (c = inchar()... loop, but I needed
while (1) to make it work. Sorry I can't get away from those types of loops.
The code isn't 'tight' either, but I deliberately didn't use any built-in
string searching such as strstr or strspan.)


#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int inchar(FILE* f) {
int c;
c=fgetc(f);
if (c==EOF) return 0;
return c;
}

void outchar(FILE* f,int c){
fputc(c,f);
}

void outstrn(FILE* f,char* s,int n){
fprintf(f,"%.*s",n,s); /* (probably an uncounted version would do) */
}

int replace_in_file(char *match, char *repl, FILE* fin, FILE* fout) {
int nmatches=0,c;
int inside;
char *mp;
int repllen=strlen(repl);
char* pendstr;
char pendchar;
int pendcount;

mp = match;
inside=0;
pendstr="";
pendcount=0;
pendchar=0;

while(1) {
if (pendcount) { /* Ugly hack needed for ababac problem */
c=*pendstr;
++pendstr;
--pendcount;
} else if (pendchar) {
c=pendchar;
pendchar=0;
} else {
c=inchar(fin);
if (c==0) break;
}

if (c!=*mp) {
if (!inside) { /* Outside of a match string; pass it thru */
outchar(fout,c);
} else { /* Currently inside possible matched seq */
if (*mp==0) { /* Full match obtained, replace that sequence
*/
outstrn(fout,repl,repllen);
++nmatches;
outchar(fout,c);
} else { /* Partial match only; must back up */
outchar(fout,*match);
pendchar=c;
pendstr=match+1;
pendcount=mp-match-1;
}
inside=0;
mp=match;

}
} else { /* Char match */
inside=1;
++mp;
}
}

if (inside && *mp==0) {
outstrn(fout,repl,repllen);
}
return nmatches;
}

int main(int argc, char **argv) {
#define infile "kkk1"
#define outfile "kkk2"
FILE *fin,*fout;
int i,c;

fin=fopen(infile,"r");
fout=fopen(outfile,"w");
if (!fin || !fout) exit(1);

replace_in_file("abac","ABAC",fin,fout);
fclose(fin);
fclose(fout);
}
 
K

Keith Thompson

BartC said:
* S must consist of one or more bytes (otherwise nothing is matched so is a
straight copy); T can be zero or more bytes, and can be shorter or longer
than S. The input file can be zero or more bytes. S can match all bytes in
the input file (so the entire file is replaced by T).
[...]

I'd rephrase that. You say that S *must* consist of one or more bytes,
then describe what happens if S is empty.

Instead, I'd say:

If S consists of zero bytes, the input is copied to the output with
no change.

Some of the requirements you state follow from other requirements, such
as "... and can be shorter or longer than S". You might want to
distinguish such statements from actual requirements, say by putting
them in parentheses.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top