substring finding problem!

blmblm · Feb 25, 2010

I don't think it is.

Ah, but that's because you never took any CS classes, in which
you would have been exposed to this sort of thing ....

Then again, my experience teaching undergrad CS suggests that
instructors can lead the students to formalism but can't always
get them to like it.

I would have specced it as:
echo "$STRING" | sed -e "s/$OLD/$NEW/g"

With the caveat that you have to overlook questions like "what if $OLD has
slashes in it".

Yeah. That works too -- for people who know what sed is and does.

But yeah, that seems to be the same spec. And really, the overlapping
strings question is just plain stupid. The output of a replacement isn't
rescanned, and the replacement of a given substring removes its content
overlapping string. Looking for "aba" in "dababa", you find "d*aba*ba",
you replace it with whatever, and what's left is "ba". Nothing special
or complicated.

Sure. I just thought it might be fun to try to come up with a
semi-formal specification that *doesn't* involve narratives about
what the program is doing. I like that sort of thing, but I guess
not everyone does. "Whatever." [*]

[*] Normally I would write <shrug> here, but that usage now has
unfortunate associations.

blmblm · Feb 25, 2010

<0.0ddf9db924489b8a3417.20100222180827GMT.87wry5b0tg.fsf@bsb.me.uk>, [...]

I've been posting total times only to save space and haven't been
looking carefully at whether code that's faster overall is also
faster for each individual test.

Results with gcc 4.4.1, as described above .... :

bacarisse (O2) 1.74 seconds
bacarisse (O3) 1.75 seconds
blmblm (O2) 2.52 seconds
blmblm (O3) 2.48 seconds
harter-1 (O2) 2.58 seconds
harter-1 (O3) 2.49 seconds
harter-2 (O2) 2.27 seconds
harter-2 (O3) 2.24 seconds
io_x (O2) 9.82 seconds
nilges (O2) 2.36 seconds
nilges (O3) 2.35 seconds
thomasson (O2) 1.69 seconds
thomasson (O3) 1.68 seconds
willem (O2) 2.77 seconds
willem (O3) 4.16 seconds [ can this be right?! ]

Click to expand...

FWIW, I always try to "prime" the program by making several dummy runs the
same data set. After that, I time several iterations of it on the same data
set, and report the average.

Yes .... What my little test program actually does is repeat each
sequence of tests N times (N is a command-line parameter) and report
all N results; my thinking is that if each set of N timings is pretty
consistent (meaning all times are roughly equal -- and they have been),
then the total time should be reasonably meaningful.

blmblm · Feb 25, 2010

Ben Bacarisse <[email protected]> wrote:

My results posted thus far were with gcc 4.0.1 (and glibc 2.3.5).
When I repeat the experiment, on a newer/faster system, with
gcc 4.4.1 (and glibc 2.10.1), I also find that -O3 is *almost*
always faster (and the one exception -- perhaps I need to rerun
the test, in case there was something else going on on the machine
that skewed results). Results below.

Click to expand...

If that odd -O3 result goes away, it means we can largely ignore
-O2/-O3 differences. That halves the data.

No, this is with my full test suite (such as it is):

4004-byte input, 2 short replacements (2/2)
4020-byte input, 10 short replacements (2/2)
4100-byte input, 50 short replacements (2/2)
4500-byte input, 50 slightly-longer replacements (10/10)

Click to expand...

Ah, I see. I've made up data files with these properties and re-run
my tests. I've scaled the per call times up by 20,000 (I remember you
used 20,000 calls) and I've aggregated the times for these four tests
for all of the methods I can time.

(Differences in input length are because I was lazy in coding
Results with gcc 4.4.1, as described above .... :

bacarisse (O2) 1.74 seconds
bacarisse (O3) 1.75 seconds
blmblm (O2) 2.52 seconds
blmblm (O3) 2.48 seconds
harter-1 (O2) 2.58 seconds
harter-1 (O3) 2.49 seconds
harter-2 (O2) 2.27 seconds
harter-2 (O3) 2.24 seconds
io_x (O2) 9.82 seconds
nilges (O2) 2.36 seconds
nilges (O3) 2.35 seconds
thomasson (O2) 1.69 seconds
thomasson (O3) 1.68 seconds
willem (O2) 2.77 seconds
willem (O3) 4.16 seconds [ can this be right?! ]

Click to expand...

Looks wacky to me! Is it repeatable?

*Yes*. Weird, isn't it?! And I don't observe anything like that
on the older/slower system.

Here are my times (also gcc 4.4.1 and libc 2.10.1). I seem to have a
faster machine. The first number are your times (for reference) and
the second are mine (in seconds). The third column is the ratio of
the two. You can see that there is more going on than just the speed
of the machine.

Indeed. Should we compare hardware configurations? What would be
relevant to include? information that in Linux can be found by
listing /proc/cpuinfo?

bacarisse (O2) 1.74 0.426 4.08
bacarisse (O3) 1.75 0.400 4.38
blmblm (O2) 2.52 0.540 4.67
blmblm (O3) 2.48 0.501 4.95
harter-1 (O2) 2.58 0.857 3.01
harter-1 (O3) 2.49 0.803 3.10
harter-2 (O2) 2.27 0.780 2.91
harter-2 (O3) 2.24 0.722 3.10
io_x No data
nilges (O2) 2.36 0.881 2.68
nilges (O3) 2.35 0.861 2.73
thomasson (O2) 1.69 0.380 4.45
thomasson (O3) 1.68 0.364 4.62
willem (O2) 2.77 0.813 3.41
willem (O3) 4.16 0.885 4.70

If we are now measuring the same things, it seems that some code is
favoured by my system (yours for example) and some does not do so
well. I suspect interactions with the various caches but that is a
huge guess.

Oh, caching is almost always a good guess ....

My other tests show that the rank can be switched round a lot by using
shorter strings. To get a more balanced view, the tests would have to
range over various string lengths, but I doubt that there can be any
idea of the "best" way to do this based on speed. Clear simple code
will always win out for me in these situations, despite my love of
generating (ultimately pointless) data like this.

I *am* getting the impression that you also might be having too much
fun with this little problem.

Agreed, anyway, that in almost all circumstances clear simple code is
to be preferred. Indeed -- my original naive solution is almost as
fast as the "improved" version, as long as I use the C-library versions
of the string.h functions, and it took less time to write and debug
than the more complicated solution.

blmblm · Feb 25, 2010

spinoza1111 said:
spinoza1111 said:

"Stefan Ram" <[email protected]> wrote in message

Click to expand...

news:[email protected]...

Click to expand...

[ snip ]

And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:

Click to expand...

replace("banana", "ana", "ono")

Click to expand...

IF you restart one position after the find point, and not at its end.

Click to expand...

Why would you do that, though? only if you *wanted* to detect

Click to expand...

Search me. But that's what the code I was discussing actually did.

What code is that? I've traced back through predecessor posts, and
the only one that comes close to including code is the one in which
Chris Thomasson references his code in

http://clc.pastebin.com/f62504e4c

which on a quick skim doesn't seem to me to be looking for
overlapping strings.

overlapping strings, and -- if you did detect them, what would
you do with them? I can't think of any sensible definition of
"replace" that does anything with overlapping strings [*], so

Click to expand...

replace(banana, ana, ono) could equal

bonona going left to right without overlap
banono going right to left without overlap
bonono going both ways with overlap

There's a semi-sane answer here in the last case, but isn't
that because there are some factors at work that won't be
generally true? What about replace(banana, ana, xno)?
Should that be bxnono or bxnxno?

The third case could arise in natural language processing.

Suppose that in some language, the ana sound is transformed into the
ono sound to transform present into past tense (weirder things
happen), and suppose speakers do this to ALL occurences of the three
tones a, voiced n, a. When the sounds are adjacent they are
nonetheless distinct in speech but not in writing.

Now, the response of most garden-variety "break room" programmers is
"that's bullshit, and can never happen". But we know that in
programming, many strange things can happen, and that as Hamlet
admonished Horatio, we must "as a stranger give it welcome". Many more
strange things can happen outside programming, and programmers, even
of the Hooters or break room ilk, better realize this when programming
is used to solve problems.

"Whatever." I'm not convinced yet that it's possible to come
up with a sensible specification for what it means to replace
overlapping occurrences of selected text. Absent such a
specification -- eh, whatever.

when I wrote my first solution to this problem, I of course used
strstr and of course started scanning for the next match after
the end of the previous match.

[*] Chris Thomasson's reply to your post points out the ambiguities.

Moral: don't let the library do your thinking for you.

Click to expand...

Mostly I'm replying to this rather old post, though, because it
seems as good a place as any to attempt a more-or-less formal
specification of the problem, which I'm not sure we have, and
which might be interesting. (Apologies if I missed one somewhere.)

Here's my proposed specification, in which "is not a substring of"
and "concat" have what I hope are obvious meanings, and names
beginning s_ denote strings:

replace(s_input, s_old, s_new) yields

if s_old is not a substring of s_input

s_input

else

concat(s_input_prefix, s_new, replace(s_input_tail, s_old, s_new))

where s_input_prefix and s_input_tail are such that

s_input = concat(s_input_prefix, s_old, s_input_tail)

and

s_old is not a substring of s_input_prefix

Click to expand...

This is fine as long as we understand your concat as NOT specifying
left to right or right to left direction.

How could it imply a direction? String concatenation seems to me to
be a pretty straightforward and well-defined operation, which could
even be written as an associative binary operator, no?

The bonono problem would
have to be handled by preprocessing (translating banana into banaana,
perhaps using a rule that "vowels" (in the language) can always be
duplicated because their sounds don't break.

Well, now you seem to moving in a direction that might eventually
lead to a sensible specification. "Carry on", maybe.

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

Click to expand...

How about "does not use string.H and gets rid of the Obscene
Excrescence: the use of Nul to terminate a string, thereby creating a
new C that is almost fit for use".

Well, even my first naive solution to this problem didn't use
string.H .... But we've already had the discussion about case
sensitivity, so no need to do that again.

In my opinion the functions declared in string.h include some
that are very useful in writing a replace() function as specified
here -- I think of them as useful abstractions for the problem
domain. Could one define similar functions if strings were *not*
represented as null-terminated contiguous sequences of characters ....

Well, certainly one could, but whether they'd be unacceptably
inefficient might depend on how strings were represented.
C's approach to representing strings allows one to have multiple
strings in a single character array and to easily regard a suffix
of a string as a string in its own right, both of which strike me
as useful in context. One could (I think) get a similar effect
by defining strings to be objects consisting of a length and a
pointer into an array of characters. If strings were represented
as a length immediately followed by a sequence of characters --
not so much.

I'm not sure I'm explaining this very well or thinking it through
carefully, but perhaps it will advance the discussion a bit anyway.

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.

Click to expand...

No. The API locks us into bad thoughts.

I'd say "how so?" but I'm not optimistic about getting an answer
I'd find useful.

Nick Keighley · Feb 25, 2010

And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:

Click to expand...

replace("banana", "ana", "ono")

Click to expand...

IF you restart one position after the find point, and not at its end.

Click to expand...

Why would you do that, though? only if you *wanted* to detect
overlapping strings, and -- if you did detect them, what would
you do with them? I can't think of any sensible definition of
"replace" that does anything with overlapping strings [*], so
when I wrote my first solution to this problem, I of course used
strstr and of course started scanning for the next match after
the end of the previous match.

[*] Chris Thomasson's reply to your post points out the ambiguities.

Moral: don't let the library do your thinking for you.

Click to expand...

sometimes a sensible behaviour drops out quite neatly.
I used to think that was a sign of good code if the naive code also
covered the "corner-cases" naturally.

Mostly I'm replying to this rather old post, though, because it
seems as good a place as any to attempt a more-or-less formal
specification of the problem, which I'm not sure we have, and
which might be interesting. (Apologies if I missed one somewhere.)

well I missed it to! A programming competition seemed to break out
without it being clear (to me at least) what the program was supposed
to do. By the time it had been sort of clarified I'd seen a fair
amount of code.

I'm just sorry I didn't think of doing it recursivly! That was
impressive.

Here's my proposed specification, in which "is not a substring of"
and "concat" have what I hope are obvious meanings, and names
beginning s_ denote strings:

replace(s_input, s_old, s_new) yields

if s_old is not a substring of s_input

s_input

else

concat(s_input_prefix, s_new, replace(s_input_tail, s_old, s_new))

where s_input_prefix and s_input_tail are such that

s_input = concat(s_input_prefix, s_old, s_input_tail)

and

s_old is not a substring of s_input_prefix

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.

and of course like most formal specifications it's damn nearly code!
Though you do get a better class of primitive.

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.

there were some proposals for better APIs, ones that avoided
repeatedly rescanning the string. The virtue of using string.h is that
you get some abstraction. Pages of inlined string.h-like functions is
hard to comprehend, hard to debug and hard to test.

Nick Keighley · Feb 25, 2010

Nick Keighley said:
Nick Keighley said:

On 17 Feb, 21:14, Richard Heathfield <[email protected]> wrote:

Click to expand...

a former clc regular once posted

Click to expand...

***
The fscanf equivalent of fgets is so simple
that it can be used inline whenever needed:-
char s[NN + 1] = "", c;
int rc = fscanf(fp, "%NN[^\n]%1[\n]", s, &c);
if (rc == 1) fscanf("%*[^\n]%*c);
if (rc == 0) getc(fp);
***

Click to expand...

That sounds suspiciously like either Dan Pop, who had a bee in his
bonnet regarding scanf() (amongst others), or like someone employing
irony at said Mr. Pop.

it was Mr.Pop. Perhaps being ironical himself?

Oh, and there's at least one bug in the above (most likely inserted by
me).

It's a very bad idea. sscanf() can be a reasonable way to read simple
_validated_ input. Unvalidated, none of that family is useable.

yes but if they're at the "write a program to convert farenheit to
centigrade" scanf() might be just good enough. I think giving them a
read_number() primitive would be a better idea though.

Seebs · Feb 25, 2010

As best I can tell, he doesn't read anything you're posting here
unless someone else quotes it -- or at least he doesn't reply,
so your taunting .... ah well, I suppose now that I've quoted
you he may choose to reply. Or not.

Exactly. He's plonked, because that way I see only the funny parts.

I would be very surprised to hear that anyone who talks about
code as he does would *not* be aware of this particular trick.
Inventing it for oneself might require "thinking outside the box";
being aware that it exists -- not so much.

Yeah. I still remember:
i << 3 + i + i
but that doesn't mean I'd ever use it.

It appears that he meant, not matrix multiplication, but multiplication of
every element in a matrix by a fixed value. For some reason, he seems to
think that multiplying everything in a matrix by a compile-time constant is
a likely operation. For some other reason, probably a less plausible one,
he seems to think that the shift-as-multiply trick actually buys you
something.

*But this is not matrix multiplication* -- or at least, not as I
define the term. Curiously enough, however, Wikipedia seems to
agree with you that "matrix multiplication" can mean multiplying
a matrix by either another matrix or a scalar. So my attempt to
nitpick here may be off-base. Of course, elsethread you don't
seem to find Wikipedia a credible source. "Whatever."

Interesting. I've never heard the term used that way.

gcc seems to be reasonably smart about generating shifts rather
than multiplies if n is something that's known at compile time,
even though (as far as I know) one must use the "pow" function
to express exponentiation in C.
Yup.

And I have to say, your code certainly did a great job of undermining the
idea that strings were predictably null-terminated. At least in your code.

....

Wait, does this imply that he thinks the << thing is some kind of secret
that not everyone knows? As an idle curiousity, I've asked my coworkers
to see whether any of them *haven't* seen that.

-s

Seebs · Feb 25, 2010

Sure. I just thought it might be fun to try to come up with a
semi-formal specification that *doesn't* involve narratives about
what the program is doing. I like that sort of thing, but I guess
not everyone does. "Whatever." [*]

Actually, that's a good point. It's clearer in that it tells you the
answer without requiring you to think it through. Thanks to spinoza1111,
we are now aware that at least some people can implement a solution to
the problem without ever thinking about how it works enough to realize
that there is no question of overlapping strings. Clarifying it explicitly
is probably beneficial.

-s

Seebs · Feb 25, 2010

Agreed, anyway, that in almost all circumstances clear simple code is
to be preferred. Indeed -- my original naive solution is almost as
fast as the "improved" version, as long as I use the C-library versions
of the string.h functions, and it took less time to write and debug
than the more complicated solution.

Idle curiousity: How's mine do? I haven't checked to see what the official
interface is, but I'm pretty sure this is adequately obvious. It presumably
suffers from double-scanning, but I don't know how much that matters. It
doesn't do a lot of mallocs.

char *
rep(const char *in, const char *out, const char *target) {
char *result = 0;
const char *s;
char *t, *u;
size_t inlen, outlen, targetlen, resultlen;
int count;

if (!in || !out || !target || !*in)
return 0;
inlen = strlen(in);
outlen = strlen(out);
targetlen = strlen(target);

for (count = 0, t = strstr(target, in); t && *t; t = strstr(t, in)) {
++count;
t += inlen;
}
resultlen = targetlen + (outlen * count) - (inlen * count);
result = malloc(resultlen + 1);
if (!result)
return result;
u = result;
*u = '\0';
s = target;
for (t = strstr(target, in); t && *t; t = strstr(t, in)) {
memcpy(u, s, t - s);
u += t - s;
memcpy(u, out, outlen + 1);
u += outlen;
t += inlen;
s = t;
}
strcpy(u, s);
#ifdef DEBUG
fprintf(stderr, "replaced %d occurrences, new length %d, actual %d\n",
count, (int) resultlen, (int) strlen(result));
#endif

return result;
}

-s

spinoza1111 · Feb 25, 2010

Exactly. He's plonked, because that way I see only the funny parts.

No, you're too cowardly to talk to me. You haven't plonked jackshit.

Yeah. I still remember:
i << 3 + i + i
but that doesn't mean I'd ever use it.

It appears that he meant, not matrix multiplication, but multiplication of
every element in a matrix by a fixed value. For some reason, he seems to
think that multiplying everything in a matrix by a compile-time constant is
a likely operation. For some other reason, probably a less plausible one,
he seems to think that the shift-as-multiply trick actually buys you
something.

Yes, I do, Mr No Compsci.

Interesting. I've never heard the term used that way.

Your lack of intellectual curiosity isn't an argument.

And I have to say, your code certainly did a great job of undermining the
idea that strings were predictably null-terminated. At least in your code.

...

Wait, does this imply that he thinks the << thing is some kind of secret
that not everyone knows? As an idle curiousity, I've asked my coworkers
to see whether any of them *haven't* seen that.

Sure, if you ask them. But you weren't able to answer my initial
question, where you had to call a possibility to mind. Which argues
against your debugging skills.

spinoza1111 · Feb 25, 2010

spinoza1111 said:
spinoza1111 said:

[ snip ]
And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:
replace("banana", "ana", "ono")
IF you restart one position after the find point, and not at its end.
Why would you do that, though? only if you *wanted* to detect

Click to expand...

Click to expand...

Search me. But that's what the code I was discussing actually did.

Click to expand...

What code is that? I've traced back through predecessor posts, and
the only one that comes close to including code is the one in which
Chris Thomasson references his code in

http://clc.pastebin.com/f62504e4c

which on a quick skim doesn't seem to me to be looking for
overlapping strings.

Click to expand...

My code handled string overlap after the bug was pointed out to me,
BEFORE any other code. I'm too sick of the shit that goes on here to
make a collection of all solutions and find what probably are many
failures, but one of my contributions was to pass on the test case.

There's a lot of claims and counterclaims here and at least two
discussants are complete shitheads. However, we KNOW that other
posters used the test suite I created AFTER my code worked with that
test data.

overlapping strings, and -- if you did detect them, what would
you do with them? I can't think of any sensible definition of
"replace" that does anything with overlapping strings [*], so

Click to expand...

Click to expand...

replace(banana, ana, ono) could equal

Click to expand...

bonona going left to right without overlap
banono going right to left without overlap
bonono going both ways with overlap

Click to expand...

There's a semi-sane answer here in the last case, but isn't

Click to expand...

HOW.DARE.YOU. How DARE you start talking about sanity? It isn't
collegial, and it is libel and completely insensitive. It's talking
like those thugs and shitheads here, Seebach and Heathfield.

My late uncle was a personal physician of Lyndon Johnson who was
misdiagnosed as having a mental disorder and died alone in a Veteran's
Administration hospital.

I saw John Nash mistreated at Princeton when he was almost fully
recovered, and I was assigned to him when other programmers refused to
work with him. I saw his points dismissed by an arrogant John Horton
Conway at the John von Neumann center in 1989. When he won the Nobel
prize, of course, people clamored to kiss his ass, but it was a very
small group of people who DON'T reason, as Seebach seems to, from
quirks! Who supported him during his dark years! These were not the
sort of people that call ideas "insane", any more than they call
people "insane".

HOW.DARE.YOU.

that because there are some factors at work that won't be
generally true? What about replace(banana, ana, xno)?
Should that be bxnono or bxnxno?

Click to expand...

The fact that there is a group of answers does not make the question a
question of a crazy man! In fact, it makes it a good scientific
question, albeit over the heads of the creeps here.

Jesus H. ****, I'm glad I no longer have to work with programmers!

"Whatever." I'm not convinced yet that it's possible to come
up with a sensible specification for what it means to replace
overlapping occurrences of selected text. Absent such a
specification -- eh, whatever.

Click to expand...

I gave it to you: a hypothetical but possible natural language in
which adjacent lexemes must be split and modified. And what's this
"whatever"?

when I wrote my first solution to this problem, I of course used
strstr and of course started scanning for the next match after
the end of the previous match.
[*] Chris Thomasson's reply to your post points out the ambiguities.
Moral: don't let the library do your thinking for you.
Mostly I'm replying to this rather old post, though, because it
seems as good a place as any to attempt a more-or-less formal
specification of the problem, which I'm not sure we have, and
which might be interesting. (Apologies if I missed one somewhere.)
Here's my proposed specification, in which "is not a substring of"
and "concat" have what I hope are obvious meanings, and names
beginning s_ denote strings:
replace(s_input, s_old, s_new) yields
if s_old is not a substring of s_input
s_input
else
concat(s_input_prefix, s_new, replace(s_input_tail, s_old, s_new))
where s_input_prefix and s_input_tail are such that
s_input = concat(s_input_prefix, s_old, s_input_tail)
and
s_old is not a substring of s_input_prefix

Click to expand...

Click to expand...

This is fine as long as we understand your concat as NOT specifying
left to right or right to left direction.

Click to expand...

How could it imply a direction? String concatenation seems to me to
be a pretty straightforward and well-defined operation, which could
even be written as an associative binary operator, no?

Click to expand...

Well, my greater experience with object oriented development in C
Sharp and VB has taught me that given either an adequate OO language,
or sufficient intelligence and patience, concat can work either way
without much drama. In C, the direction has to be a crummy parameter
that is easy to get wrong.

Well, now you seem to moving in a direction that might eventually
lead to a sensible specification. "Carry on", maybe.

Click to expand...

Don't patronize me.

Somewhat more formal than the "scans left to right and ignores
overlapping strings", though whether it's actually clearer might
be a matter of opinion.
And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

Click to expand...

Click to expand...

How about "does not use string.H and gets rid of the Obscene
Excrescence: the use of Nul to terminate a string, thereby creating a
new C that is almost fit for use".

Click to expand...

Well, even my first naive solution to this problem didn't use
string.H .... But we've already had the discussion about case
sensitivity, so no need to do that again.

In my opinion the functions declared in string.h include some
that are very useful in writing a replace() function as specified
here -- I think of them as useful abstractions for the problem
domain. Could one define similar functions if strings were *not*
represented as null-terminated contiguous sequences of characters ....

Well, certainly one could, but whether they'd be unacceptably
inefficient might depend on how strings were represented.
C's approach to representing strings allows one to have multiple
strings in a single character array and to easily regard a suffix
of a string as a string in its own right, both of which strike me
as useful in context. One could (I think) get a similar effect
by defining strings to be objects consisting of a length and a
pointer into an array of characters. If strings were represented
as a length immediately followed by a sequence of characters --
not so much.

I'm not sure I'm explaining this very well or thinking it through
carefully, but perhaps it will advance the discussion a bit anyway.

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.

Click to expand...

Click to expand...

No. The API locks us into bad thoughts.

Click to expand...

I'd say "how so?" but I'm not optimistic about getting an answer
I'd find useful.

Click to expand...

I've already told you. Strings terminated ON THE RIGHT with a Nul is a
bad thought for two reasons:

* It prevents Nul from occuring in a string
* It mandates Eurocentric left to right processing

My respect o meter in your case is diminishing.

As to a length code limiting string length, do the math. 2^63 - 1 or
2^64 - 1 is a big number, and run length codes can be used, especially
in OO programming. Furthermore, even if the string is longer, you can
still process it with an unknown length, which OO programming handles
quite nicely.

Click to expand...

Seebs · Feb 25, 2010

Hmmm, I was wondering how long it would take. You oil up to women
posters here until they find you out and then you lose it with them too.
It becomes clearer and clearer how Richard Heathfield's list so aptly
describes you.

More generally, I don't recall ever seeing "how dare you" not be coming from
someone who was consistently and demonstrably more abusive than the people
he/she/it was complaining about.

This gets back to the general question of heuristics in evaluating code or
coders. There are things that tell you that Something Is Wrong. They're
not totally reliable, but they're good bets.

-s

Ben Bacarisse · Feb 25, 2010

Seebs said:
Idle curiousity: How's mine do? I haven't checked to see what the official
interface is, but I'm pretty sure this is adequately obvious. It presumably
suffers from double-scanning, but I don't know how much that matters. It
doesn't do a lot of mallocs.

It all depends on the length of the string and probably on the quality
of the library. Yours is ps_replace. I put it next to my first one,
bb_replace, because they are structurally the same (as is Willem's
first which I have not timed since I expected it to be the same as
mine).

replace(`cat d4004`, "[]", "xx"):
blm_replace 4086300 calls in 4.998s is 1.223e-06 s/call 1.223 Âµs/call
rh_replace2 647200 calls in 5.000s is 7.726e-06 s/call 7.726 Âµs/call
cmt_replace 5229000 calls in 5.000s is 9.562e-07 s/call 956.2 ns/call
w_replace 617700 calls in 5.001s is 8.096e-06 s/call 8.096 Âµs/call
en_replace 627100 calls in 4.999s is 7.972e-06 s/call 7.972 Âµs/call
fast_replace 1424600 calls in 5.000s is 3.51e-06 s/call 3.51 Âµs/call
bb_replace 3141500 calls in 5.000s is 1.592e-06 s/call 1.592 Âµs/call
ps_replace 3138400 calls in 5.000s is 1.593e-06 s/call 1.593 Âµs/call
replace(`cat d4004`, "{}", "xx"):
blm_replace 4630500 calls in 4.999s is 1.08e-06 s/call 1.08 Âµs/call
rh_replace2 648700 calls in 5.001s is 7.709e-06 s/call 7.709 Âµs/call
cmt_replace 5258400 calls in 4.999s is 9.507e-07 s/call 950.7 ns/call
w_replace 620100 calls in 5.001s is 8.064e-06 s/call 8.064 Âµs/call
en_replace 640500 calls in 4.999s is 7.806e-06 s/call 7.806 Âµs/call
fast_replace 1300000 calls in 5.000s is 3.846e-06 s/call 3.846 Âµs/call
bb_replace 2985300 calls in 5.000s is 1.675e-06 s/call 1.675 Âµs/call
ps_replace 2991000 calls in 5.000s is 1.672e-06 s/call 1.672 Âµs/call
replace(`cat wap.txt`, "and", "xx"):
blm_replace 200 calls in 5.143s is 0.02571 s/call 25.71 ms/call
rh_replace2 500 calls in 5.540s is 0.01108 s/call 11.08 ms/call
cmt_replace 200 calls in 4.537s is 0.02269 s/call 22.69 ms/call
w_replace 500 calls in 5.205s is 0.01041 s/call 10.41 ms/call
en_replace 400 calls in 4.612s is 0.01153 s/call 11.53 ms/call
fast_replace 600 calls in 5.214s is 0.008689 s/call 8.689 ms/call
bb_replace 200 calls in 8.728s is 0.04364 s/call 43.64 ms/call
ps_replace 100 calls in 4.397s is 0.04397 s/call 43.97 ms/call
replace(`cat wap.txt`, "ZZZ", "xx"):
blm_replace 300 calls in 6.698s is 0.02233 s/call 22.33 ms/call
rh_replace2 700 calls in 4.446s is 0.006351 s/call 6.351 ms/call
cmt_replace 300 calls in 6.549s is 0.02183 s/call 21.83 ms/call
w_replace 700 calls in 4.673s is 0.006675 s/call 6.675 ms/call
en_replace 800 calls in 5.167s is 0.006459 s/call 6.459 ms/call
fast_replace 1100 calls in 4.465s is 0.004059 s/call 4.059 ms/call
bb_replace 200 calls in 8.509s is 0.04255 s/call 42.55 ms/call
ps_replace 200 calls in 8.541s is 0.04271 s/call 42.71 ms/call
replace("abzzefzzijlmzzpqrzzuvzzyz", "zz", "xx"):
blm_replace 7714500 calls in 4.999s is 6.48e-07 s/call 648 ns/call
rh_replace2 31172100 calls in 5.000s is 1.604e-07 s/call 160.4 ns/call
cmt_replace 8842600 calls in 5.000s is 5.654e-07 s/call 565.4 ns/call
w_replace 26458800 calls in 5.000s is 1.89e-07 s/call 189 ns/call
en_replace 12170600 calls in 5.000s is 4.108e-07 s/call 410.8 ns/call
fast_replace 22489800 calls in 5.000s is 2.223e-07 s/call 222.3 ns/call
bb_replace 6505100 calls in 5.000s is 7.686e-07 s/call 768.6 ns/call
ps_replace 6331000 calls in 5.000s is 7.898e-07 s/call 789.8 ns/call
replace("abcdefghijlmnopqrstuvwxyz", "zz", "xx"):
blm_replace 43166400 calls in 4.999s is 1.158e-07 s/call 115.8 ns/call
rh_replace2 43100600 calls in 5.000s is 1.16e-07 s/call 116 ns/call
cmt_replace 22713000 calls in 5.000s is 2.201e-07 s/call 220.1 ns/call
w_replace 53059200 calls in 5.000s is 9.423e-08 s/call 94.23 ns/call
en_replace 35975500 calls in 5.000s is 1.39e-07 s/call 139 ns/call
fast_replace 49671300 calls in 5.000s is 1.007e-07 s/call 100.7 ns/call
bb_replace 31664100 calls in 5.000s is 1.579e-07 s/call 157.9 ns/call
ps_replace 31432800 calls in 5.000s is 1.591e-07 s/call 159.1 ns/call

The tests are explained elsewhere, but I will repeat the details if
you like. The first two are modelled on B L Massingill's test data
(for comparison).

<snip code>

spinoza1111 · Feb 25, 2010

[snip]

There's a semi-sane answer here in the last case, but isn't

Click to expand...

Click to expand...

HOW.DARE.YOU. How DARE you start talking about sanity? It isn't
collegial, and it is libel and completely insensitive. It's talking
like those thugs and shitheads here, Seebach and Heathfield.

Click to expand...

Hmmm, I was wondering how long it would take. You oil up to women
posters here until they find you out and then you lose it with them too.
It becomes clearer and clearer how Richard Heathfield's list so aptly
describes you.

Click to expand...

No, I treat all posters, male or female, with decency and respect
until they foolishly or ignorantly or thoughtlessly use their relative
anonymity to start introducing off-topic personal abuse.

You see, it's been impossible for anyone to narrate this type of
relation, since the 1960s, as anything but the weakness they feel in
themselves.

I don't give a flying **** about blm's gender, and I ain't lookin' for
a date. Are you? Had she been a male, acting with relative restraint
and collegiality as she has, I would have said the same thing. Had
this male then started in, calling good questions "insane" (as John
Conway treated Nash) I would have responded in exactly the same way.

Men have learned self-protectively to narrate lives without honor, but
I missed the class.

I hope you also realise that by screaming that you want Mr Heathfield
"out of this ng", IOW by appointing yourself as clc monitor, you become
the chief "reg", and your brown-nosers "Richard", "Kenny", and Twinky
become your co-conspirators.

Click to expand...

Gee, I have fans? I wasn't aware of that. I don't think Richard, Kenny
or da Twink count. I think they are instead independent spirits who
think for themselves.

Nick Keighley · Feb 26, 2010

HOW.DARE.YOU. How DARE you start talking about sanity?

Click to expand...

this is talking about the /solution/ not the person. Just as when I
referred to another posters solution to a problemn as "value based" I
wasn't validating his morals and ethics. A sane solution is one that
isn't crazy.

Are you paranoid?

<snip naming dropping rant>

I could sware you told me there were no bad books. And yet there are
bad thoughts? double plus ungood.

I've already told you. Strings terminated ON THE RIGHT with a Nul is a
bad thought for two reasons:

* It prevents Nul from occuring in a string

Click to expand...

a count preceeded string is bad because it limits the maximum size of
the string. I suppose a chain of blocks doesn't have this limitation.

* It mandates Eurocentric left to right processing

Click to expand...

nope. You are confusing representation and presentation. The nul isn't
at the right hand end it's at the largest address. If the display
device chooses to print r-to-l instead of l-to-r it makes not he
blindest bit of difference.

[...]

As to a length code limiting string length, do the math. 2^63 - 1 or
2^64 - 1 is a big number, and run length codes can be used, especially
in OO programming. Furthermore, even if the string is longer, you can
still process it with an unknown length, which OO programming handles
quite nicely.

Click to expand...

oh, 2^64 is a vast number (16x10^18 or 16 quintillion). It also takes
8 bytes to store.

spinoza1111 · Feb 26, 2010

this is talking about the /solution/ not the person. Just as when I
referred to another posters solution to a problemn as "value based" I
wasn't validating his morals and ethics. A sane solution is one that
isn't crazy.

Are you paranoid?

Click to expand...

You say you're talking about the "solution", but a spectre is haunting
c.l.c: normalized deviance and the strong possibility that because of
normalized deviance, most people here are batshit, and I'm not, which
could mean the opposite as well.

<snip naming dropping rant>

I could sware you told me there were no bad books. And yet there are
bad thoughts? double plus ungood.

Click to expand...

Sure. Freedom of speech is completely consistent with criticism of
published thought, but NOT by poseurs. Seebach, in my opinion, is a
poseur. Therefore, my freedom of speech enables me to call him a
poseur.

Your own thought seems to proceed in true regular guy mode: Chomsky 3
substitution of strings, but I can with perfect consistence say that
"because there are no bad books, the only proper critic of a book is
someone at least as well-qualified as the author in a free society: we
need not credit, and should not support, the 'free' expression of
thought by clearly unqualified people. In particular, public sites
such as Amazon should be strictly moderated."

"Once a book has made it to publication, we can be reasonably certain
that it's of minimal quality. But this is not possible in a public
access site. In view of the damage done to people at such sites, they
need to be controlled."

"The only way we can do this is by means of formal certification from
accredited universities, or its equivalent in the form of
certification exams administered by a government or nonprofit agency,
since the corporation is corrupted of necessity by its fiduciary duty
to make a profit before all else."

a count preceeded string is bad because it limits the maximum size of
the string. I suppose a chain of blocks doesn't have this limitation.

Click to expand...

I suppose not. However, the length may be inexpressible if it exceeds
what in C is called "long long" precision. OO systems handle this
cleanly: it gives C the willies.

nope. You are confusing representation and presentation. The nul isn't
at the right hand end it's at the largest address. If the display
device chooses to print r-to-l instead of l-to-r it makes not he
blindest bit of difference.

Click to expand...

What on EARTH does the word "print" mean? This is an issue I've raised
elsethread: that programmers, as ancillary and supernumerary people
now that data processing is a cost center, are collectively in
developed countries using a language from a dreamtime.

I'm aware that left to right can be reversed by layering software,
something at which C programmers suck because C sucks at it.

Your solution forces the wogs to get wog devices that print backwards.
However, it still forces developers to think eurocentrically with the
result that the non-Latin output is necessarily a second choice.

[...]

As to a length code limiting string length, do the math. 2^63 - 1 or
2^64 - 1 is a big number, and run length codes can be used, especially
in OO programming. Furthermore, even if the string is longer, you can
still process it with an unknown length, which OO programming handles
quite nicely.

Click to expand...

oh, 2^64 is a vast number (16x10^18 or 16 quintillion). It also takes
8 bytes to store.

Click to expand...

Boo hoo. Like I said, people are thinking in comic book terms:

Alice: But Dr Nilges, you fool! A long long takes eight bytes to
store!
Nilges: (Evil laugh) I care not! Storage is almost free! I vill use 8
bytes per string and rule zee verld! Nyahh ha ha! Nice jugs! Nyah ha
ha!
Timmy: Mommy I'm scared!
Ruff: Woof woof!

After the waste of spirit in an expense of shame that is normalized
deviance, I'm game for Weird Science. Comic book thinking, like Peter
Seebach's invalid reasoning about quirkiness, is an excellent way of
producing The Usual Crap that we see here.

spinoza1111 · Feb 26, 2010

More generally, I don't recall ever seeing "how dare you" not be coming from
someone who was consistently and demonstrably more abusive than the people
he/she/it was complaining about.

Somehow, that doesn't surprise me. No sound ever comes from the gates
of Eden, and my guess is that you grew up in some sort of addictive
and enabling system, one in which nobody ever raised their voices or
spoke truth to power.

A lot of people do.

And sure, you might as well use this factoid against me: some of my
biggest online fans, but not all, are homeless men accessing the
Internet at the public library, along with a striking number of women
of color.

Why don't you use that fact against me? I need a laugh. The fact is
that some of my most brilliant coworkers from Silicon Valley of the
1980s are dying in motels without health insurance, and chances R you
don't give a rat's ass.

This gets back to the general question of heuristics in evaluating code or
coders. There are things that tell you that Something Is Wrong. They're
not totally reliable, but they're good bets.

Something is Wrong, all right. It's that you shit on people and pay
your way onto standards boards without any qualifications that are
independently verifiable of whatever makes your current employer a
quick buck. You've used the freedom and anonymity of the Internet to
make yourself seem more qualified than you are.

Furthermore, it's unethical to evaluate PEOPLE based on heuristics.
The only ethical procedure is to send them private email, or respond
to their private email and use this medium (or Skype video
conferencing) to verify what you so pompously call "heuristics", a big
word you don't understand, poser. But you're too much of a sissy,
girlie-man, wimp, coward and boor to do this, we have found, since I
sent you email asking for such a conference.

You could have saved clc a lot of bandwidth, you who so seem to value
machines and foolish programming languages and ideas over people and
their reputations, by installing Skype (it's free) and contacting
Edward.G.Nilges at a mutually agreeable time.

But you prefer, being addicted to a false but threatened self-image of
competence, discoursing with shadows.

More generally, Skype could be used to set up a public videoconference
for people to attend, I believe. I think clc needs this to clear the
air, and if possible, to (1) agree on a better behavioral charter and
(2) agree to disagree on undecidable issues.

But given your refusal to work with McGraw Hill, I don't think you
have the guts for this.

spinoza1111 · Feb 26, 2010

HOW.DARE.YOU. How DARE you start talking about sanity?

Click to expand...

this is talking about the /solution/ not the person. Just as when I
referred to another posters solution to a problemn as "value based" I
wasn't validating his morals and ethics. A sane solution is one that
isn't crazy.

Are you paranoid?

<snip naming dropping rant>

I could sware you told me there were no bad books. And yet there are
bad thoughts? double plus ungood.

I've already told you. Strings terminated ON THE RIGHT with a Nul is a
bad thought for two reasons:

Click to expand...

* It prevents Nul from occuring in a string

Click to expand...

a count preceeded string is bad because it limits the maximum size of
the string. I suppose a chain of blocks doesn't have this limitation.

* It mandates Eurocentric left to right processing

Click to expand...

nope. You are confusing representation and presentation. The nul isn't
at the right hand end it's at the largest address. If the display
device chooses to print r-to-l instead of l-to-r it makes not he
blindest bit of difference.

[...]

As to a length code limiting string length, do the math. 2^63 - 1 or
2^64 - 1 is a big number, and run length codes can be used, especially
in OO programming. Furthermore, even if the string is longer, you can
still process it with an unknown length, which OO programming handles
quite nicely.

Click to expand...

oh, 2^64 is a vast number (16x10^18 or 16 quintillion). It also takes
8 bytes to store.

Click to expand...

Nit picking, it takes nine. You're thinking of 2**64-1 at least in
twos complement, which Herb Schildt knows is in common use. But I
understand your point.

Of course, a length of a string is never negative (barring something
Richard Heathfield's code or linked list might do to strings: the mind
boggles) so if it's unsigned long long, then yes, that's 8 bytes,
idn't it.

Nit picking and brain farting merrily along, you could use run length
encoding or something more complicated to reduce the length of the
length to the sum of the length of segments.

In OO code, you can in fact store the length in whatever is
appropriate for the string. You can do the same with unions in C, but
many managers get the willies when programmers code the word union,
har har.

Ben Bacarisse · Feb 26, 2010

Ben Bacarisse <[email protected]> wrote:

Indeed. Should we compare hardware configurations?

I'll do this by email unless anyone else says they are still
interested in this (the differences in times, rather than the times
themselves).

<snip>

blmblm · Feb 26, 2010

[ snip ]

well I missed it to! A programming competition seemed to break out
without it being clear (to me at least) what the program was supposed
to do. By the time it had been sort of clarified I'd seen a fair
amount of code.

I'm just sorry I didn't think of doing it recursivly! That was
impressive.

Hm! All of my solutions have involved recursion in some form, though
perhaps not as elegantly as some of the more-cryptic-to-me solutions
posted.

and of course like most formal specifications it's damn nearly code!
Though you do get a better class of primitive.

"A better class of primitive"? for code, you mean?

And then perhaps we could replace the "does not use string.h"
constraint with something more meaningful [*], though I'm not
sure what.

[*] My objection to this constraint is that any minimally competent
programmer should be able to write functions that implement the
same API, so just avoiding use of the library functions doesn't
seem to me to make the problem more interesting.

Click to expand...

there were some proposals for better APIs, ones that avoided
repeatedly rescanning the string.

?

But here's another question .... Should the question of whether the
string is rescanned even be part of the API? I guess it *could* be;
I mean, I've encountered APIs that make guarantees about performance,
at the big-O level of abstraction.

The virtue of using string.h is that
you get some abstraction. Pages of inlined string.h-like functions is
hard to comprehend, hard to debug and hard to test.

Agreed!

Fibonacci	0	May 13, 2023
Blue J Ciphertext Program	2	Nov 22, 2023
RSA implementation issues in public key pem loader function	0	May 21, 2025
Adding adressing of IPv6 to program	1	Feb 16, 2023
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026
Need help finding Segmentation fault C++	0	Apr 16, 2022
Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
Python code problem	2	Apr 23, 2023

substring finding problem!

blmblm

blmblm

blmblm

blmblm

Nick Keighley

Nick Keighley

Seebs

Seebs

Seebs

spinoza1111

spinoza1111

Seebs

Ben Bacarisse

spinoza1111

Nick Keighley

spinoza1111

spinoza1111

spinoza1111

Ben Bacarisse

blmblm

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads