automated sub-string search/replace algorithm...

Keith Thompson · Mar 1, 2010

Eric Sosman said:
Read Bentley and McIlroy's "Engineering a Sort Function."
(Note that their final version of qsort() sometimes fails to
satisfy 7.20.5p2. I don't see the reason for 7.20.5p2, and the
Rationale is unenlightening, but it's there nonetheless.)

C99 7.20.5p2 says:

The implementation shall ensure that the second argument of
the comparison function (when called from bsearch), or both
arguments (when called from qsort), are pointers to elements
of the array. The first argument when called from bsearch
shall equal key.

That seems like a reasonable and necessary requirement. What am I
missing?

Eric Sosman · Mar 1, 2010

[...]
for each i in array
create random index k between 0 and array size
swap array with array[k]

Click to expand...

Note that this scrambling algorithm is "random," but biased:
Some rearrangements are more likely than others.

Click to expand...

Can you explain what is biased? The bias from the rng as a quality of
implementation, or the method of iterating over the array?

(Drifting away from C here; maybe comp.programming would
be a better forum. Follow-ups set.)

The latter. There could be additional problems with the
generator, but that's another matter: The shuffle as given is
biased even with a perfectly uniform generator.

Consider the case of an array holding the three elements
A,B,C. You'll call the generator three times, getting one of
0,1,2 each time. Three outcomes for the first call, three for
the second, three for the third, 3**3 = 27 possible shuffles
in all, and if the generator is good they'll be equiprobable.

Each of the 27 shuffles leads to some permutation of A,B,C,
and there are 3! = 6 such permutations. But 27 is not divisible
by 6, so it is not possible to have every permutation produced
by the same number of shuffles. The best you might hope for would
be three permutations produced by four shuffles each and the other
three by five shuffles, so if the shuffles are equiprobable
three permutations will have probability 4/27 ~= 14.8% and the
others will have probability 5/27 ~= 18.5%.

More generally, the method is biased whenever N**N is not
divisible by N! -- that is, whenever N > 2. Fortunately, a very
simple modification to your algorithm rescues the situation; you
may find it instructive to figure it out for yourself. (Hint:
The only change required is in the second line.)

Eric Sosman · Mar 1, 2010

C99 7.20.5p2 says:

The implementation shall ensure that the second argument of
the comparison function (when called from bsearch), or both
arguments (when called from qsort), are pointers to elements
of the array. The first argument when called from bsearch
shall equal key.

That seems like a reasonable and necessary requirement. What am I
missing?

For "convenient" element sizes, Bentley & McIlroy will
copy the pivot element to an outside-the-array temporary and
call the comparator with a pointer to the temporary and a
pointer to an in-the-array element. IIRC, this makes some of
their array rearrangements simpler.

As far as I can see, the only thing a comparator can do with
the 7.20.5p2 guarantee that it cannot do with B&Mc is calculate
the array indices of the two arguments, or use <,<=,>=,> on the
pointers, or equivalent. But I can't imagine a comparator that
could do so usefully, given that the array may get rearranged both
before and after each comparator call. So I don't understand why
7.20.5p2 is a requirement. For bsearch(), yes, just possibly, but
for qsort()? I don't get it.

Eric Sosman · Mar 1, 2010

On Mar 1, 8:51 am, Malcolm McLean<[email protected]>
wrote:

Click to expand...

On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?

Click to expand...

for each i in array
create random index k between 0 and array size
swap array with array[k]

Click to expand...

I have a recursive (possibly naive) implementation of quicksort that
I've been using, but I haven't exercised it on data sets large enough
to generate stack overflows (yet). However, I do run into the use
case where quicksort is passed almost sorted data. I'm just wondering
if this is a viable (if hackish) method to encourage better average
performance for almost sorted data sets.

Click to expand...

to write shuffle (untested)

void shuffle(void *data, size_t N, size_t width)
{
size_t i;
size_t target;

for(i=0;i<N-1;i++)
[...]
Note deliberate bug in this code. Anyone spot it?
[...]

Click to expand...

I didn't notice the bug from just observation of the code.

Consider the case N==0. McLean is size_t-phobic, and
never passes up an opportunity to prove how bad size_t is by
finding ways to misuse it.

(There might be other bugs, too, but I stopped reading
after the first blatant gaffe.)

Chris M. Thomasson · Mar 1, 2010

Ben Bacarisse said:
No. The bug is the last one reported: that an attempt to match in an
empty string ends up by having the code use an indeterminate pointer.

Ahhh! Well, the piece of shi% test does not create empty source strings, so
no wonder why it failed to catch it. Stupid me!!!!!

DAMN IT!!! GRRR!

Okay... I will fix the random string generation so that it can create zero
length strings and post it here when I get some time.

BTW, thanks a lot Ben! I appreciate all of your input.

Ben Bacarisse · Mar 1, 2010

ImpalerCore said:
On Mar 1, 8:51 am, Malcolm McLean <[email protected]>
wrote:

Click to expand...

On a side note, if you know that qsort has issues with getting fed
almost sorted data, can't you just do a O(n) random permutation step
on the data to mitigate those problems?

Click to expand...

for each i in array
create random index k between 0 and array size
swap array with array[k]

Click to expand...

I have a recursive (possibly naive) implementation of quicksort that
I've been using, but I haven't exercised it on data sets large enough
to generate stack overflows (yet). However, I do run into the use
case where quicksort is passed almost sorted data. I'm just wondering
if this is a viable (if hackish) method to encourage better average
performance for almost sorted data sets.

Click to expand...

to write shuffle (untested)

void shuffle(void *data, size_t N, size_t width)
{
size_t i;
size_t target;

for(i=0;i<N-1;i++)
{
target = (size_t) ((rand()/ (RAND_MAX + 1.0)) * (N-i) + i);
memswap( (unsigned char *)data + (i*width), (unsigned char *)data
+(target *width), width);
}

}

Note deliberate bug in this code. Anyone spot it?
Another snag is that memswap isn't a standard library function. In
fact it's hard to provide an efficient, portable implementation.

Click to expand...

I didn't notice the bug from just observation of the code.

rand() - number between 0 and RAND_MAX
RAND_MAX + 1.0 - convert to floating point
x = rand() / (RAND_MAX + 1.0) - should be floating point range between
0 <= x < 1

You can't be sure of this. If RAND_MAX is very large the calculation
can overcome the precision of the floating point system and you can
get x == 1.0. All you need is a 64 bit int and a conventional double.

if i == 0
target = (x) * N, range 0 <= target < N

if i == N-2
target = (x) * 2 + N-2, range N-2 <= target < N

It seems ok, so I'm obviously missing the error.

Click to expand...

The other two bugs (if they can be called that) are (a) the bias
introduced by the random number calculation (small, but it might
matter) and (b) the fact that you can't shuffle no data (N == 0).

It's possible that none of these is the one planted, in which case
I've missed it too.

<snip>

Chris M. Thomasson · Mar 1, 2010

Chris M. Thomasson said:
Ahhh! Well, the piece of shi% test does not create empty source strings,
so no wonder why it failed to catch it. Stupid me!!!!!

DAMN IT!!! GRRR!

Okay... I will fix the random string generation so that it can create zero
length strings and post it here when I get some time.

BTW, thanks a lot Ben! I appreciate all of your input.

Here is the updated test:

http://clc.pastebin.com/tqdPpa8w

Now, this makes Edwards code seg-fault. Also, it reports an error in your
algorithm on the following input:

ben_replace("", "", "xxx");

It spits out a zero-length string. This fails because my testing code
expects it to result in a string equal to "xxx". Did you design it that way
on purpose? My reasoning is that if you have an empty source and empty
comparand, then you got a match.

Any thoughts?

Seebs · Mar 1, 2010

Any thoughts?

Mine returns a null pointer for an empty needle, because there's no obvious
reason it should be "xxx" rather than "xxxxxx", because immediately after
the first empty string, there's a second. And come to think of it, a third
right after that. And a fourth...

So IMHO, an empty needle should be rejected as invalid. I don't think it
makes sense to special case an empty source string, because it isn't any
emptier than the space between a and b in "ab".

(And I don't think we want to go to "but the empty string is identified by
the NUL at the end", because that's still a special case -- in other cases,
we don't look for or match the NUL.)

-s

Chris M. Thomasson · Mar 1, 2010

Seebs said:
Mine returns a null pointer for an empty needle, because there's no
obvious
reason it should be "xxx" rather than "xxxxxx", because immediately after
the first empty string, there's a second. And come to think of it, a
third
right after that. And a fourth...

So IMHO, an empty needle should be rejected as invalid. I don't think it
makes sense to special case an empty source string, because it isn't any
emptier than the space between a and b in "ab".

I really need to think about this. One reason why I did it is because in
Microsoft Word, if I do a search for any character (e.g., the "^?" code) and
replace it with "xxx" on a completely empty document, the result is "xxx".
Then if I repeat the action the result is "xxxxxxxxxxxx". The first action
is identical to how my replace works... However, the second one differs
because my replace will treat a non-empty source and a empty comparand as a
non-match. This is inconsistent behavior.

Do you think it would be pointless/stupid to try and emulate the behavior of
MS Word?

Interesting... Humm...

Chris M. Thomasson · Mar 1, 2010

Chris M. Thomasson said:
I really need to think about this. One reason why I did it is because in
Microsoft Word, if I do a search for any character (e.g., the "^?" code)
and replace it with "xxx" on a completely empty document, the result is
"xxx". Then if I repeat the action the result is "xxxxxxxxxxxx". The first
action is identical to how my replace works... However, the second one
differs because my replace will treat a non-empty source and a empty
comparand as a non-match. This is inconsistent behavior.

Do you think it would be pointless/stupid to try and emulate the behavior
of MS Word?

Interesting... Humm...

Ahh fuc% it! I will not special case it. I will alter my replace algorithm
and my test in order to change it's expectations on empty comparands.

Nick · Mar 1, 2010

Chris M. Thomasson said:
I really need to think about this. One reason why I did it is because
in Microsoft Word, if I do a search for any character (e.g., the "^?"
code) and replace it with "xxx" on a completely empty document, the
result is "xxx". Then if I repeat the action the result is
"xxxxxxxxxxxx". The first action is identical to how my replace
works... However, the second one differs because my replace will treat
a non-empty source and a empty comparand as a non-match. This is
inconsistent behavior.

Do you think it would be pointless/stupid to try and emulate the
behavior of MS Word?

Interesting... Humm...

My view is that an empty match string should return the target string
unaltered, whatever it is. Or should be an error.

You can treat both empty as a special case I suppose, but my argument is
that replace("abc","","xx") should clearly return "abc" (or be an
error). So by extension, although replace("","","xx") could return
"xx", it ought to be "". Or an error - I'm coming rapidly to that
conclusion.

Keith Thompson · Mar 1, 2010

Seebs said:
Mine returns a null pointer for an empty needle, because there's no obvious
reason it should be "xxx" rather than "xxxxxx", because immediately after
the first empty string, there's a second. And come to think of it, a third
right after that. And a fourth...

So IMHO, an empty needle should be rejected as invalid. I don't think it
makes sense to special case an empty source string, because it isn't any
emptier than the space between a and b in "ab".

(And I don't think we want to go to "but the empty string is identified by
the NUL at the end", because that's still a special case -- in other cases,
we don't look for or match the NUL.)

It might make sense to look at existing practice.

In sed and vim, an empty needle isn't even supported; "s/foo/bar/"
replaces "foo" by "bar", but "s//bar/" replaces the previous search
pattern by "bar". This doesn't apply to the current problem, so
there's no help there.

In Perl, an empty needle is found once at each possible position.
So, for example, this:

$s = "hello";
$s =~ s//FOO/g;

sets $s to "FOOhFOOeFOOlFOOlFOOoFOO", which I suppose makes as much
sense as anything.

Seebs · Mar 1, 2010

Do you think it would be pointless/stupid to try and emulate the behavior of
MS Word?

So far as I know, yes.

-s

Ben Bacarisse · Mar 1, 2010

Nick said:
My view is that an empty match string should return the target string
unaltered, whatever it is. Or should be an error.

I don't like the idea of signalling an error when the error is
trivially testable by the caller. In a language with exceptions, I'd
raise one but, in C, there is little to be gained by returning NULL
(say) for anything but a failure that can't be tested for in advance
(such as running out of memory).

The caller can always wrap the function to do what they want in any of
the cases that one might be tempted to signal as errors (NULL pointers
or empty strings).

Ben Bacarisse · Mar 1, 2010

Keith Thompson said:
Seebs <[email protected]> writes:

It might make sense to look at existing practice.

In sed and vim, an empty needle isn't even supported; "s/foo/bar/"
replaces "foo" by "bar", but "s//bar/" replaces the previous search
pattern by "bar". This doesn't apply to the current problem, so
there's no help there.

In Perl, an empty needle is found once at each possible position.
So, for example, this:

$s = "hello";
$s =~ s//FOO/g;

sets $s to "FOOhFOOeFOOlFOOlFOOoFOO", which I suppose makes as much
sense as anything.

In one way this makes more sense for an regular expression replace
(such a Perl's) than for a string replace like the one under
discussion. The reason is that an RE match is usually defined as
being the "longest" one. It is therefore reasonable that an empty RE
should match all of the infinite number of consecutive match positions
at each character. Conversely, a string replacer should try to
replace them all, one by one.

Ike Naar · Mar 1, 2010

So IMHO, an empty needle should be rejected as invalid. I don't think it
makes sense to special case an empty source string, because it isn't any
emptier than the space between a and b in "ab".

Perhaps an empty needle need not be rejected if the replacement
is also empty; in that case the result string should be the same
as the source string (replacing an arbitrary number of ""-s by
""-s is a no-op).

Seebs · Mar 1, 2010

Perhaps an empty needle need not be rejected if the replacement
is also empty; in that case the result string should be the same
as the source string (replacing an arbitrary number of ""-s by
""-s is a no-op).

True, but I think that feels like even more of a special case to me.

I think I'd rather just say "you have to have something in mind to
search for".

-s

bartc · Mar 1, 2010

Seebs said:
Mine returns a null pointer for an empty needle, because there's no
obvious
reason it should be "xxx" rather than "xxxxxx", because immediately after
the first empty string, there's a second. And come to think of it, a
third
right after that. And a fourth...

If the replacement algorithm starts with a quick check like this:

'if inputstring == matchstring then return replacementstring',

Then ("abc","abc","xyz) results in "xyz";

("ab","ab","xyz") results in "xyz";

("a","a","xyz") results in "xyz"; and you then expect:

("","","xyz") to also result in "xyz" (in other words, (S,S,U) always gives
U))

So IMHO, an empty needle should be rejected as invalid. I don't think it
makes sense to special case an empty source string, because it isn't any
emptier than the space between a and b in "ab".

Searching for empty strings is meaningless, but comparing them isn't (imo).

Ben Bacarisse · Mar 2, 2010

Perhaps an empty needle need not be rejected if the replacement
is also empty; in that case the result string should be the same
as the source string (replacing an arbitrary number of ""-s by
""-s is a no-op).

.... but maybe it should take an unbounded amount of time to do it

Smileys aside, good point.

Nick Keighley · Mar 2, 2010

Ahhh! Well, the piece of shi% test does not create empty source strings, so
no wonder why it failed to catch it. Stupid me!!!!!

DAMN IT!!! GRRR!

Okay... I will fix the random string generation so that it can create zero
length strings and post it here when I get some time.

BTW, thanks a lot Ben! I appreciate all of your input.- Hide quoted text -

this rather confirms the misgivings about random testing...

automated test for sub-string search algorithms...	3	Feb 20, 2010
Fast String search algorithm	10	Dec 16, 2011
Search algorithm for HTML files	2	Sep 29, 2010
Solution to "One Algorithm Problem"	8	Apr 20, 2009
Boyer-Moore string search algorithm in ruby	2	Jul 22, 2007
Search and replace algorithm for string	13	Aug 2, 2005
Synchronization Algorithm Verificator for C++0x	31	Aug 2, 2008
GURU NEEDED : macro SQUARE(x) for any type x	12	Jan 14, 2011

automated sub-string search/replace algorithm...

Keith Thompson

Eric Sosman

Eric Sosman

Eric Sosman

Chris M. Thomasson

Ben Bacarisse

Chris M. Thomasson

Seebs

Chris M. Thomasson

Chris M. Thomasson

Nick

Keith Thompson

Seebs

Ben Bacarisse

Ben Bacarisse

Ike Naar

Seebs

bartc

Ben Bacarisse

Nick Keighley

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads