substring finding problem!

fedora · Feb 14, 2010

Hi group!

Reading all posts about Spinozas efforts to create string substitute
program, i wanted to code mine too and specs was that not to use the
<string.h> library. But already problem in finding all places where
substring occurs in a string. i'm looking for long time but not able to see
where error is. Any help on where i made mistake is appreciated. TIA

Code is :-

#include<stdlib.h>
#include<stdio.h>
#include<string.h>

unsigned strLength(char *s)
{
unsigned idx;
for (idx = 0; s[idx] != '\0'; idx++)
;
return idx;
}

char *strSubstr(char *s, char *t, unsigned ls, unsigned lt)
{
char *substr = 0;
unsigned end = ls - lt, i, j;

for (i = 0; i <= end; i++) {
if (s == t[0] && s[i + (lt-1)] == t[lt-1]) {
for (j = 1; j < lt && s[i + j] == t[j]; j++)
;
if (j == lt) {
substr = s + i;
i = end;
}
}
}
return substr;
}

unsigned findSubstr(char *s, char *t, unsigned ls, unsigned lt, char ***sp)
{
unsigned n, m, lu;
char *u;

for (n = 0, u = s, lu = ls;
lu >= lt && (u = strSubstr(u, t, lu, lt));
n++, u += lt, lu = ((s+ls) - u))
;
if (sp && (*sp = malloc(n * sizeof **sp))) {
for (m = 0, u = s, lu = ls; m < n; m++, u += lt, lu = ((s+ls) - u))
sp[0][m] = strSubstr(u, t, lu, lt);
}
return n;
}

int main()
{
char **p;
unsigned found, i;

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hee e\n");
found = findSubstr("hee", "e", strlen("hee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hhhh h\n");
found = findSubstr("hhhh", "h", strlen("hhhh"), strlen("h"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

}

Output :-

heee e
3 times
0x400a66, e
0x400a66, e
0x400a67, e
hee e
2 times
0x400a84, e
0x400a84, e
hhhh h
4 times
0x400a90, h
0x400a91, h
0x400a92, h
0x400a93, h

It is correct when substr starts as first character of string, but if not
then always it is repeated twice...

fedora · Feb 14, 2010

Richard said:
fedora said:

Hi group!

Reading all posts about Spinozas efforts to create string substitute
program, i wanted to code mine too and specs was that not to use the
<string.h> library. But already problem in finding all places where
substring occurs in a string. i'm looking for long time but not able to
see where error is. Any help on where i made mistake is appreciated.
TIA

Code is :-

#include<stdlib.h>
#include<stdio.h>
#include<string.h>

Click to expand...

Better:

#include <stdlib.h>
#include <stdio.h>

#include said:

unsigned strLength(char *s)
{
unsigned idx;
for (idx = 0; s[idx] != '\0'; idx++)
;
return idx;
}

Click to expand...

To find the length of a string, use strlen unless you have a compelling
reason not to. There is no evidence above of a compelling reason not to
use strlen.

Not using string.h was the rule i thought.

char *strSubstr(char *s, char *t, unsigned ls, unsigned lt)
{
char *substr = 0;
unsigned end = ls - lt, i, j;

for (i = 0; i <= end; i++) {
if (s == t[0] && s[i + (lt-1)] == t[lt-1]) {
for (j = 1; j < lt && s[i + j] == t[j]; j++)
;
if (j == lt) {
substr = s + i;
i = end;

Click to expand...

What are you trying to do in this function?

It returns same value as strstr in string.h. It gives pointer to 1st found
occurance of string t in string s or null ptr other wise. ls is length of s
and lt is length of t, same value that strlen gives.

unsigned findSubstr(char *s, char *t, unsigned ls, unsigned lt, char
***sp)
{
unsigned n, m, lu;
char *u;

for (n = 0, u = s, lu = ls;
lu >= lt && (u = strSubstr(u, t, lu, lt));
n++, u += lt, lu = ((s+ls) - u))
;
if (sp && (*sp = malloc(n * sizeof **sp))) {
for (m = 0, u = s, lu = ls; m < n; m++, u += lt, lu = ((s+ls) - u))
sp[0][m] = strSubstr(u, t, lu, lt);
}
return n;
}

Click to expand...

To find a substring, use strstr unless you have a compelling reason not
to. There is no evidence above of a compelling reason not to use strstr.

If you want help debugging your code, your first step is to choose more
meaningful names for your objects, so that we can more easily see what
you think you're doing.

Click to expand...

findSubstr returns the no. of times t occurs in s (not over lapping) and if
sp is not null ptr, it sets *sp to point to list of pointers that point to
each start of t in s. i got this method from some one else in another
thread.. I think Ben Bacarrisse.

Thanks for your comments! I'll post another version using strlen and better
var names shotly.

fedora · Feb 14, 2010

Posting program again with longer var names. i prefer short names since long
names run out of 80x25 screen. Also now using strlen from string.h.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

char *strSubstr(char *str, char *subStr, unsigned lstr, unsigned lsubStr)
{
char *substr = 0;
unsigned lastIdx = lstr - lsubStr, firstIdx, subStrIdx;

// locate first char of subStr in str
for (firstIdx = 0; firstIdx <= lastIdx; firstIdx++) {
if (str[firstIdx] == subStr[0] &&
str[firstIdx + (lsubStr-1)] == subStr[lsubStr-1]) {

// check if complete subStr occurs at this pos in str
for (subStrIdx = 1; subStrIdx < lsubStr &&
str[firstIdx + subStrIdx] == subStr[subStrIdx]; subStrIdx++)
;
if (subStrIdx == lsubStr) {
// subStr found, so return its start and break frm loop
substr = str + firstIdx;
firstIdx = lastIdx;
}
}
}
return substr;
}

unsigned findSubstr(
char *str,
char *subStr,
unsigned lstr,
unsigned lsubStr,
char ***sp)
{
unsigned found, ctr, lu;
char *u;

// find how many times subStr is in str
for (found = 0, u = str, lu = lstr;
lu >= lsubStr && (u = strSubstr(u, subStr, lu, lsubStr));
found++, u += lsubStr, lu = ((str + lstr) - u))
;

// alloc space and copy the start of all subStr in str
if (sp && (*sp = malloc(found * sizeof **sp))) {
for (ctr = 0, u = str, lu = lstr;
ctr < found;
ctr++, u += lsubStr, lu = ((str + lstr) - u))
sp[0][ctr] = strSubstr(u, subStr, lu, lsubStr);
}
return found;
}

int main()
{
char **p;
unsigned found, i;

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hee e\n");
found = findSubstr("hee", "e", strlen("hee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hhhh h\n");
found = findSubstr("hhhh", "h", strlen("hhhh"), strlen("h"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

}

Output is :-

heee e
3 times
0x400a46, e
0x400a46, e
0x400a47, e
hee e
2 times
0x400a64, e
0x400a64, e
hhhh h
4 times
0x400a70, h
0x400a71, h
0x400a72, h
0x400a73, h

So if start of substring is not at first position in string, then always its
repeated twice in output. And i cant see where the error is for this. Any
help is appreciatad. TIA

spinoza1111 · Feb 14, 2010

Hi group!

Reading all posts about Spinozas efforts to create string substitute

It is not an effort. I produced one in two hours and worked
collaboratively with a couple of posters other than the regs to find
the bugs in a few more hours. It now works, as far as I know, and I am
adding a stress test and further improvements.

It is being deliberately renarrated as an "effort" by the regs owing
to their inability to solve the problem.

program, i wanted to code mine too and specs was that not to use the
<string.h> library. But already problem in finding all places where
substring occurs in a string. i'm looking for long time but not able to see
where error is. Any help on where i made mistake is appreciated. TIA

Code is :-

#include<stdlib.h>
#include<stdio.h>
#include<string.h>

unsigned strLength(char *s)
{
unsigned idx;
for (idx = 0; s[idx] != '\0'; idx++)
;
return idx;

}

char *strSubstr(char *s, char *t, unsigned ls, unsigned lt)
{
char *substr = 0;
unsigned end = ls - lt, i, j;

for (i = 0; i <= end; i++) {
if (s == t[0] && s[i + (lt-1)] == t[lt-1]) {
for (j = 1; j < lt && s[i + j] == t[j]; j++)
;
if (j == lt) {
substr = s + i;
i = end;
}
}
}
return substr;

}

unsigned findSubstr(char *s, char *t, unsigned ls, unsigned lt, char ***sp)
{
unsigned n, m, lu;
char *u;

for (n = 0, u = s, lu = ls;
lu >= lt && (u = strSubstr(u, t, lu, lt));
n++, u += lt, lu = ((s+ls) - u))
;
if (sp && (*sp = malloc(n * sizeof **sp))) {
for (m = 0, u = s, lu = ls; m < n; m++, u += lt, lu = ((s+ls) - u))
sp[0][m] = strSubstr(u, t, lu, lt);
}
return n;

}

int main()
{
char **p;
unsigned found, i;

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hee e\n");
found = findSubstr("hee", "e", strlen("hee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hhhh h\n");
found = findSubstr("hhhh", "h", strlen("hhhh"), strlen("h"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

}

Output :-

heee e
3 times
0x400a66, e
0x400a66, e
0x400a67, e
hee e
2 times
0x400a84, e
0x400a84, e
hhhh h
4 times
0x400a90, h
0x400a91, h
0x400a92, h
0x400a93, h

It is correct when substr starts as first character of string, but if not
then always it is repeated twice...

Ben Bacarisse · Feb 14, 2010

fedora said:
Posting program again with longer var names. i prefer short names since long
names run out of 80x25 screen.

Long is not the same as good! Choosing good names is very hard and to
a large extent it is not culturally neutral. For example, when
searching for a sub-string, I'd call the first location "anchor"
because I am used to the term "anchored search".

Also now using strlen from string.h.

That's a good idea, but there is not reason to abandon the other
string functions. If you want an exercise, then you could try to
write a fast search and replace. That will, most likely, lead to you
looking for an alternative to using strstr rather than simply
re-writing it (your strSubstr is very similar to strstr) but you will
learn practical things along the way, such as how to find where you
code is spending time.

Note, there will never be a "fastest" version because what is fast
will depend on all sorts of variables such as the quality of your C
implementation and the kind of search and replace calls you do. For
example, my simple version is still the fastest I can write for very
long strings with only a few replacements because the strstr in glibc
is very good at searching long strings.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

char *strSubstr(char *str, char *subStr, unsigned lstr, unsigned lsubStr)
{
char *substr = 0;
unsigned lastIdx = lstr - lsubStr, firstIdx, subStrIdx;

You should use size_t for sizes like this -- it is the type returned
by strlen, but because it is an unsigned type you will need to think
about your method. You can't subtract the lengths like you do.

// locate first char of subStr in str
for (firstIdx = 0; firstIdx <= lastIdx; firstIdx++) {
if (str[firstIdx] == subStr[0] &&
str[firstIdx + (lsubStr-1)] == subStr[lsubStr-1]) {

I am not sure it is worth doing this second test.

// check if complete subStr occurs at this pos in str
for (subStrIdx = 1; subStrIdx < lsubStr &&
str[firstIdx + subStrIdx] == subStr[subStrIdx]; subStrIdx++)
;
if (subStrIdx == lsubStr) {
// subStr found, so return its start and break frm loop
substr = str + firstIdx;
firstIdx = lastIdx;

There is a statement to do that: break. Altering the loop variable to
break out of the loop makes the code harder to change.

}
}
}
return substr;
}

unsigned findSubstr(
char *str,
char *subStr,
unsigned lstr,
unsigned lsubStr,
char ***sp)
{
unsigned found, ctr, lu;
char *u;

// find how many times subStr is in str
for (found = 0, u = str, lu = lstr;
lu >= lsubStr && (u = strSubstr(u, subStr, lu, lsubStr));
found++, u += lsubStr, lu = ((str + lstr) - u))
;

You have picked up an odd style. Using lots of variables in a for
loop is not very clear. How about:

size_t matches = 0, remaining_len = lstr;
char *u = str;
while (u = strSubstr(u, subStr, remaining_len, lsubStr)) {
matches++;
u += lsubStr;
remaining_len = str + lstr - u;
}

This is untested (and almost certainly wrong) but it shows another way
to write such loops.

Note that I've removed the lu >= lsubStr test. In effect this is the
first thing that strSubstr tests so there is no obvious need to
repeat it here.

// alloc space and copy the start of all subStr in str
if (sp && (*sp = malloc(found * sizeof **sp))) {
for (ctr = 0, u = str, lu = lstr;
ctr < found;
ctr++, u += lsubStr, lu = ((str + lstr) - u))
sp[0][ctr] = strSubstr(u, subStr, lu, lsubStr);

I'd rewrite that rather packing all the work into the for loop
controls. Loops are clearer when you can see what is being done in
the loop.

}
return found;
}

int main()
{
char **p;
unsigned found, i;

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hee e\n");
found = findSubstr("hee", "e", strlen("hee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hhhh h\n");
found = findSubstr("hhhh", "h", strlen("hhhh"), strlen("h"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

}

A general point. Lots of people seem to pack tests into main. I
never do. I make main use its arguments so I have a general-purpose
test program. The tests I want to run often can them be put into
script but I can quickly test new cases simply by typing a command.

<snip>

fedora · Feb 14, 2010

Ben said:
Long is not the same as good! Choosing good names is very hard and to
a large extent it is not culturally neutral. For example, when
searching for a sub-string, I'd call the first location "anchor"
because I am used to the term "anchored search".

Good point

i know my longer names are terrible but i wanted to post the
code quickly. I think the code is readable with shorter names but giving
names with good meaning is very hard!

That's a good idea, but there is not reason to abandon the other
string functions. If you want an exercise, then you could try to
write a fast search and replace. That will, most likely, lead to you
looking for an alternative to using strstr rather than simply
re-writing it (your strSubstr is very similar to strstr) but you will
learn practical things along the way, such as how to find where you
code is spending time.

Note, there will never be a "fastest" version because what is fast
will depend on all sorts of variables such as the quality of your C
implementation and the kind of search and replace calls you do. For
example, my simple version is still the fastest I can write for very
long strings with only a few replacements because the strstr in glibc
is very good at searching long strings.

Yes. my aim was to write a prog for the Spinoza contest ongoing but am still
stuck with the same error.

You should use size_t for sizes like this -- it is the type returned
by strlen, but because it is an unsigned type you will need to think
about your method. You can't subtract the lengths like you do.

Can you explain in which case (for length of str and subStr) the subtraction
will be wrong? i considered both strings of same length but cant see any
error.

lsubStr has to be <= lstr. i just assume that condition from calling code!

// locate first char of subStr in str
for (firstIdx = 0; firstIdx <= lastIdx; firstIdx++) {
if (str[firstIdx] == subStr[0] &&
str[firstIdx + (lsubStr-1)] == subStr[lsubStr-1]) {

Click to expand...

I am not sure it is worth doing this second test.

I just thought if last position also matched then chance that it is the sub-
striong is much higher, so testing for both 1st & last positions would
decrease the no. of times the inner loop has to run...

// check if complete subStr occurs at this pos in str
for (subStrIdx = 1; subStrIdx < lsubStr &&
str[firstIdx + subStrIdx] == subStr[subStrIdx]; subStrIdx++)
;
if (subStrIdx == lsubStr) {
// subStr found, so return its start and break frm loop
substr = str + firstIdx;
firstIdx = lastIdx;

Click to expand...

There is a statement to do that: break. Altering the loop variable to
break out of the loop makes the code harder to change.

i read somewhere that break and continue are almost as bad as go to and
loops should terminate by their test expressions for readable elegant code.

You have picked up an odd style. Using lots of variables in a for
loop is not very clear. How about:

size_t matches = 0, remaining_len = lstr;
char *u = str;
while (u = strSubstr(u, subStr, remaining_len, lsubStr)) {
matches++;
u += lsubStr;
remaining_len = str + lstr - u;
}

This is untested (and almost certainly wrong) but it shows another way
to write such loops.

Ok! it's just so hard to be sure that all these lenghts and offsets will
always behave right for all cases of string and substrings! i thought
through my loops for cases of equal lengths etc, and i just cant spot where
mistake is...

Note that I've removed the lu >= lsubStr test. In effect this is the
first thing that strSubstr tests so there is no obvious need to
repeat it here.

hmm okay. any minute change and i'm not sure where all it'd affect the code.
pointers+strings is tricky!

// alloc space and copy the start of all subStr in str
if (sp && (*sp = malloc(found * sizeof **sp))) {
for (ctr = 0, u = str, lu = lstr;
ctr < found;
ctr++, u += lsubStr, lu = ((str + lstr) - u))
sp[0][ctr] = strSubstr(u, subStr, lu, lsubStr);

Click to expand...

I'd rewrite that rather packing all the work into the for loop
controls. Loops are clearer when you can see what is being done in
the loop.

}
return found;
}

int main()
{
char **p;
unsigned found, i;

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hee e\n");
found = findSubstr("hee", "e", strlen("hee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hhhh h\n");
found = findSubstr("hhhh", "h", strlen("hhhh"), strlen("h"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

}

Click to expand...

A general point. Lots of people seem to pack tests into main. I
never do. I make main use its arguments so I have a general-purpose
test program. The tests I want to run often can them be put into
script but I can quickly test new cases simply by typing a command.

<snip>

Yeah, i wrote a version where main will accept strings from user in a loop
and feed them to findSubstr and return the results and keep looping till
CTRL-C but that version is sometimes giving segfault and sometimes working
okay for same string pairs!!

right now i cant understand my own code. now i want to rewrite everything
but in very simple statements and loops so i can figure out where i'm going
wrong.

Ben Bacarisse · Feb 14, 2010

fedora said:
Yes. my aim was to write a prog for the Spinoza contest ongoing but am still
stuck with the same error.

Sorry, I missed you have an error you were stuck on. The problem is
that the first and second loops in findSubstr don't do the same
thing. The first correctly advances u by the match length *from that
last match*. The second loop advances u by only the match length
(from it's last value).

The fix (using your style):

if (sp && (*sp = malloc(found * sizeof **sp))) {
for (ctr = 0, u = str, lu = lstr;
ctr < found;
ctr++, lu = ((str + lstr) - u)) {
sp[0][ctr] = strSubstr(u, subStr, lu, lsubStr);
u = sp[0][ctr] + lsubStr;
}
}

This could be written much more clearly. I don't want to bang on
about the same thing all the time, but bugs are the compiler's way of
telling you that you code need to be clearer!

Can you explain in which case (for length of str and subStr) the subtraction
will be wrong? i considered both strings of same length but cant see any
error.

lsubStr has to be <= lstr. i just assume that condition from calling
code!

Then there is no error! I think, though, that this is a rather string
condition to place on the calling code. That the pointers are not
NULL seems a reasonable condition; even that the match string is not
zero length; but that you can't search for "ab" in "a" seems rather
too restrictive.

BTW, you can document these "caller contract" restrictions in the code
by including assert calls:

assert(str && subStr && lsubStr <= lstr && lsubStr > 0);

(#include said:
// locate first char of subStr in str
for (firstIdx = 0; firstIdx <= lastIdx; firstIdx++) {
if (str[firstIdx] == subStr[0] &&
str[firstIdx + (lsubStr-1)] == subStr[lsubStr-1]) {

Click to expand...

I am not sure it is worth doing this second test.

Click to expand...

I just thought if last position also matched then chance that it is the sub-
striong is much higher, so testing for both 1st & last positions would
decrease the no. of times the inner loop has to run...

.... at the expense of more code. I'd put this in only after testing
that, in general, it pays off.

// check if complete subStr occurs at this pos in str
for (subStrIdx = 1; subStrIdx < lsubStr &&
str[firstIdx + subStrIdx] == subStr[subStrIdx]; subStrIdx++)
;
if (subStrIdx == lsubStr) {
// subStr found, so return its start and break frm loop
substr = str + firstIdx;
firstIdx = lastIdx;

Click to expand...

There is a statement to do that: break. Altering the loop variable to
break out of the loop makes the code harder to change.

Click to expand...

i read somewhere that break and continue are almost as bad as go to and
loops should terminate by their test expressions for readable elegant code.

Yes, some people are of that opinion. I am not, but I doubt that even
people with that opinion would advocate terminating the loop by
setting the loop variable. It is likely that they'd re-write the loop
in some new way but maybe if there is anyone of that opinion here they
could chip in. I don't like speaking for views I don't hold!

Ok! it's just so hard to be sure that all these lenghts and offsets will
always behave right for all cases of string and substrings! i thought
through my loops for cases of equal lengths etc, and i just cant spot where
mistake is...

hmm okay. any minute change and i'm not sure where all it'd affect the code.
pointers+strings is tricky!

Ah, you need to free yourself from that fear. Experience helps, but
striving to write the clearest code you can makes it much simpler to
be sure of your code. Do you ever reason about your code? For
example, do you assert the negation of a loop at the end to see what
it really means for the code that follows? Over the years, I've found
more bugs doing this than by any other method.

// alloc space and copy the start of all subStr in str
if (sp && (*sp = malloc(found * sizeof **sp))) {
for (ctr = 0, u = str, lu = lstr;
ctr < found;
ctr++, u += lsubStr, lu = ((str + lstr) - u))
sp[0][ctr] = strSubstr(u, subStr, lu, lsubStr);

Click to expand...

I'd rewrite that rather packing all the work into the for loop
controls. Loops are clearer when you can see what is being done in
the loop.

}
return found;
}

int main()
{
char **p;
unsigned found, i;

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hee e\n");
found = findSubstr("hee", "e", strlen("hee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

printf("hhhh h\n");
found = findSubstr("hhhh", "h", strlen("hhhh"), strlen("h"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

}

Click to expand...

A general point. Lots of people seem to pack tests into main. I
never do. I make main use its arguments so I have a general-purpose
test program. The tests I want to run often can them be put into
script but I can quickly test new cases simply by typing a command.

<snip>

Click to expand...

Yeah, i wrote a version where main will accept strings from user in a loop
and feed them to findSubstr and return the results and keep looping till
CTRL-C but that version is sometimes giving segfault and sometimes working
okay for same string pairs!!

I would not use input, I'd use argc and argv. For example, here is
what I wrote to investigate your bug:

int main(int argc, char **argv)
{
if (argc == 3) {
char **p;
const char *s = argv[1], *m = argv[2];
size_t mlen = strlen(m);
unsigned i, found = findSubstr(s, m, strlen(s), mlen, &p);

printf("In \"%s\" find \"%s\"\n%u times:\n", s, m, found);
for (i = 0; i < found; i++) {
int off = p - s;
printf("\t%d, %.*s<%.*s>%s\n",
off, off, s, (int)mlen, p, s + off + mlen);
}
}
return 0;
}

right now i cant understand my own code. now i want to rewrite everything
but in very simple statements and loops so i can figure out where i'm going
wrong.

Click to expand...

It will come. BTW, kudos for your (void *) cast when printing with %p!

bartc · Feb 14, 2010

Reading all posts about Spinozas efforts to create string substitute
program, i wanted to code mine too and specs was that not to use the
<string.h> library. But already problem in finding all places where
substring occurs in a string. i'm looking for long time but not able to
see
where error is. Any help on where i made mistake is appreciated. TIA

#include<string.h>

You won't need this then...

char *strSubstr(char *s, char *t, unsigned ls, unsigned lt)

unsigned findSubstr(char *s, char *t, unsigned ls, unsigned lt, char
***sp)

Some comments about what each of these do wouldn't be amiss.

printf("heee e\n");
found = findSubstr("heee", "e", strlen("heee"), strlen("e"), &p);
printf("%u times\n", found);
for (i = 0; i < found; i++)
printf("\t%p, %c\n", (void*)p, p[0]);

....
You're repeating code here that's best in a loop or in a function.

I've put together some code that also counts substrings, and also avoids
string.h, although it allows them to overlap so the results may not be the
same (so that "AA" is 3 substrings of "AAAA", not 2):

#include <stdio.h>

/* return how many times t occurs in s */
int findsubstrings(char *s, char*t){
int count=0;
char *p,*q;

if (*s==0 || *t==0) return 0;

while (*s) {
p=s;
q=t;
while (*p && *q && *p++==*q)++q;
if (*q==0) ++count;
++s;
}
return count;
}

void test(char *s,char *t){
printf("\"%s\" occurs %d times in \"%s\"\n",t,findsubstrings(s,t),s);
}

int main(void){
test("sisisisisisis","sis");
}

Seebs · Feb 14, 2010

Not using string.h was the rule i thought.

No, that was just Nilges being contrary to the point of stupidity.

That said, it could be a good exercise. You've already had one of the
key insights: The thing replacing strstr() should be written as a function
which is called by other functions, so as not to overcomplicate a single
gigantic function.

-s

Seebs · Feb 14, 2010

Another insight is that if one is not using strstr then its
replacement should be more helpful that strstr is.

That's a point.

The trouble is that strstr(x, y); returns NULL when y is not in x. It
scans x and then tells you almost nothing. If I were re-doing this
I'd at least make my strstr replacement act like GNU's strchrnul
(there is no strstrnul in GNU's library). I.e. str_str_nul should
return a pointer to the end of its first argument string when the
search fails. At the least this would allow one to avoid re-scanning
just to find the length[1].

And you'd replace the if (ptr) with if (*ptr), which would be fine. Hmm,
I like that.

I'd argue that strtsr should have been defined this way from the
start, but such is the C library.

Hmm.

Here's my thought: The C library functions we currently have are defined
in a way that's simple and well-defined; "return a pointer to X", and if
you can't, don't return a valid pointer. I think that's easier to specify
or discuss than "return a pointer to X, or possibly a pointer to Y".

[1] At the expense of limiting the code to strings no longer than
PTRDIFF_MAX characters. I think it is quite fiddly to avoid this
restriction so I am not too bothered by that.

Yeah.

-s

Ben Bacarisse · Feb 14, 2010

Seebs said:
No, that was just Nilges being contrary to the point of stupidity.

That said, it could be a good exercise. You've already had one of the
key insights: The thing replacing strstr() should be written as a function
which is called by other functions, so as not to overcomplicate a single
gigantic function.

Another insight is that if one is not using strstr then its
replacement should be more helpful that strstr is.

The trouble is that strstr(x, y); returns NULL when y is not in x. It
scans x and then tells you almost nothing. If I were re-doing this
I'd at least make my strstr replacement act like GNU's strchrnul
(there is no strstrnul in GNU's library). I.e. str_str_nul should
return a pointer to the end of its first argument string when the
search fails. At the least this would allow one to avoid re-scanning
just to find the length[1].

Even when the search string /is/ present, the final call always scans
that tail with no valuable data being returned.

I'd argue that strtsr should have been defined this way from the
start, but such is the C library.

[1] At the expense of limiting the code to strings no longer than
PTRDIFF_MAX characters. I think it is quite fiddly to avoid this
restriction so I am not too bothered by that.

Chris M. Thomasson · Feb 14, 2010

Here is my humble little entry that took me around a half an hour or so to
create:

http://clc.pastebin.com/f62504e4c

If you want to avoid using `string.h' then you are going to have to implment
the following functions:
_________________________________________________
#define xstrstr strstr
#define xstrlen strlen
#define xstrcmp strcmp
#define xmemcpy memcpy
_________________________________________________

I personally don't see any need to do that unless you want to go through a
learning experience. Or perhaps if you just "know" that those functions are
very poorly implemented on your platform. Anyway, this code pre-computes all
of the substring matches and stores them in a linked-list. This gets around
having to scan the source string twice. It's fairly good at reducing the
number of list nodes by allowing a single node to hold multiple offsets into
the source string. So, in the code as-is, `malloc()/free()' is completely
avoided on list nodes _if_ there are less than or equal to 256 matches.

Any questions?

;^)

Stefan Ram · Feb 14, 2010

Chris M. Thomasson said:
Or perhaps if you just "know" that those functions are
very poorly implemented on your platform.

Often, one can indeed assume that »stdlib.h:rand()« is
implemented in such a manner that it is not very random
in the less significant bits - although this knowledge
has nothing to do with the language C but only with the
culture of C implementations.

Chris M. Thomasson · Feb 14, 2010

Stefan Ram said:
Often, one can indeed assume that »stdlib.h:rand()« is
implemented in such a manner that it is not very random
in the less significant bits - although this knowledge
has nothing to do with the language C but only with the
culture of C implementations.

good point. Humm... I would hope that `strstr()' does not commonly use a
naive algorithm to search for substrings.

spinoza1111 · Feb 15, 2010

No, that was just Nilges being contrary to the point of stupidity.

That said, it could be a good exercise. You've already had one of the
key insights: The thing replacing strstr() should be written as a function
which is called by other functions, so as not to overcomplicate a single
gigantic function.

Hee hee. Sure is taking you tomatoes a long time to "ketchup" with me.
Peter, shouldn't you be checking my code like I told you? If you could
find a bug in the latest version I posted (and only that version, dear
heart), it would be a real feather in your little cap. In fact, I'll
send you a check for 25 Hong Kong dollars.

spinoza1111 · Feb 15, 2010

Here is my humble little entry that took me around a half an hour or so to
create:

http://clc.pastebin.com/f62504e4c

If you want to avoid using `string.h' then you are going to have to implment
the following functions:
_________________________________________________
#define xstrstr strstr
#define xstrlen strlen
#define xstrcmp strcmp
#define xmemcpy memcpy
_________________________________________________

I personally don't see any need to do that unless you want to go through a
learning experience. Or perhaps if you just "know" that those functions are
very poorly implemented on your platform. Anyway, this code pre-computes all
of the substring matches and stores them in a linked-list. This gets around
having to scan the source string twice. It's fairly good at reducing the
number of list nodes by allowing a single node to hold multiple offsets into
the source string. So, in the code as-is, `malloc()/free()' is completely
avoided on list nodes _if_ there are less than or equal to 256 matches.

Any questions?

Yeah, Chris. I have a question. Why did you call it an "entry" when
this (to me, anyway) implied that it was a contest entry to the
Spinoza challenge? Please don't be corrupted by the dishonesty and
brutality of these newsgroups.

Sure, you do say that I have to implement FOUR (4) non-trivial library
functions.

But by saying "it took me a half hour" I read an implied, perhaps
unintended slight at the approximately six hours I took ... where only
in dysfunctional corporations is it a bad thing to take a little extra
care and a little extra time in anticipation of difficulty.

A week late, in this thread, Seebach, Bacarisse et al. seem to be
running into confusion trying to help the OP meet the original
challenge. But I note nobody harassing them or the original poster,
targeting them for abuse.

In a sense, "I have only myself to blame" for this, because a year or
so ago, I jumped all over Seebach for his attacks on Schildt. I feel I
was right to do so, *et je ne regrette rien*. Nonetheless, I'm tired
of his lies.

If your "slight" was unintended, I apologize.

Without knowing as much about C as the regs, esp. postmodern C and the
standards (I'll be the first to concede this), I've left them in the
dust as regards my challenge. I've completed my solution, although I
am refining it by adding a stress test and may post a proof elsethread
proving that the code matches the algorithm statement I've posted
elsethread, and in so doing I may find a bug.

This is because "knowing C" is different from "knowing how to program"
and given the serious design flaws of C, there could be "knowing too
much about C".

But as far as I can see, no-ones mastered the problem to the same
extent, Seebach perhaps least of all, because he wasted too much time
last week attacking me. He may redeem himself by helping the OP of
this thread, but he's never even tried to write his own solution.

spinoza1111 · Feb 15, 2010

good point. Humm... I would hope that `strstr()' does not commonly use a
naive algorithm to search for substrings.

I'd say that it better. It cannot use Boyer Moore or Knuth Morris
Pratt IF they use tables, and I believe they do, since that implies
(as far as I can tell) state (like malloc) or else an extra parameter
in the call.

cf. Donald Hennessy's books from Morgan Kaufman on computer
architecture: I'm willing to bet that strstr uses a straightforward
algorithm, that runs best on RISC, and RISC-influenced chips
(including modern Intel chips, influenced as they are), without
microcoded or hardwired special-purpose scan instructions...just
optimized character operations.

But take this cum grano salis. Although I took Computer Architecture
in grad school (me got an A) and worked on an architecture team in
Silicon Valley, I'm an English teacher today, and may not be *au
courant*.

And note that "using strstr" has its own dangers. IT FINDS OVERLAPPING
STRINGS. If you use it to construct a table of replace points you're
gonna have an interesting bug-o-rama:

replace("banana", "ana", "ono")

IF you restart one position after the find point, and not at its end.

Moral: don't let the library do your thinking for you.

Tom St Denis · Feb 15, 2010

To find the length of a string, use strlen unless you have a compelling
reason not to. There is no evidence above of a compelling reason not to
use strlen.

In production code perhaps, if a student is trying to learn comp.sci
through expressing ideas in C they are actually BETTER served by
writing their own versions of algorithms we take for granted.

Tom

spinoza1111 · Feb 15, 2010

Yeah, Chris. I have a question. Why did you call it an "entry" when
this (to me, anyway) implied that it was a contest entry to the
Spinoza challenge? Please don't be corrupted by the dishonesty and
brutality of these newsgroups.

Sure, you do say that I have to implement FOUR (4) non-trivial library
functions.

But by saying "it took me a half hour" I read an implied, perhaps
unintended slight at the approximately six hours I took ... where only
in dysfunctional corporations is it a bad thing to take a little extra
care and a little extra time in anticipation of difficulty.

A week late, in this thread, Seebach, Bacarisse et al. seem to be
running into confusion trying to help the OP meet the original
challenge. But I note nobody harassing them or the original poster,
targeting them for abuse.

In a sense, "I have only myself to blame" for this, because a year or
so ago, I jumped all over Seebach for his attacks on Schildt. I feel I
was right to do so, *et je ne regrette rien*. Nonetheless, I'm tired
of his lies.

If your "slight" was unintended, I apologize.

Without knowing as much about C as the regs, esp. postmodern C and the
standards (I'll be the first to concede this), I've left them in the
dust as regards my challenge. I've completed my solution, although I
am refining it by adding a stress test and may post a proof elsethread
proving that the code matches the algorithm statement I've posted
elsethread, and in so doing I may find a bug.

This is because "knowing C" is different from "knowing how to program"
and given the serious design flaws of C, there could be "knowing too
much about C".

But as far as I can see, no-ones mastered the problem to the same
extent, Seebach perhaps least of all, because he wasted too much time
last week attacking me. He may redeem himself by helping the OP of
this thread, but he's never even tried to write his own solution.

Update: Willem posted an exciting, if apparently buggy, solution in
the thread where Peter mis-spelled "efficiency" in the name, and I
just had a chance to look at it. It uses recursion in place of any
data structure whatsoever. And in the same thread another poster seems
to claim he did a working (if for me hard to read) solution on day
one: I have asked him to use my test suite.

And yes. A solution WITH BUGS can be more intelligent than a clean
one. A solution that TAKES A LONG TIME can likewise be better than one
that doesn't. I was also prepared to concede victory to Willem despite
his one character identifiers because of the beauty of his idea: use
recursion and not a data structure.

Corporate thinking is applied Positivism and reductionism. Instead of
the intolerable to many effort of thinking, everything becomes a
reductionistic saw or maxim applied without feeling in the manner of
the performance review: so and so uses one character identifiers, or
likes to take more time, or is "verbose", or doesn't have comments, or
has too many comments, so therefore he's "reduced" to the incompetent
cipher, the infinitesimal that everyone in capitalism feels himself to
be, and fears himself to be.

Whereas I was, while setting up a test of Willem's code, rooting for
him. I was willing to hand the gold medal over to him, one character
identifiers and all, because of the beauty of his idea.

"The Good" in capitalist society is always reduced in the Positivist
spirit to something else because of the cash nexus and its alienation
of us from our selves and each other, which leads people around in a
(recursive) ring, chasing goals that in all cases are subgoals of
another goal...on the model of the toxic derivative which is found to
point back to itself.

Whereas if we could just say that The Good is a simple, recognizable
thing, whether a piece of code or a symphony...if we could just trust
our common humanity...imagine there's no Heaven.

But this I know. There are people in this newsgroup driven mad by
reductionism, who have been told that if they follow an external,
alienated code, whether a bunch of cargo-cult programming maxims or
"Jesus", they will be "saved", and it is these people who are starting
the fights when they are confronted with their own alienation.

spinoza1111 · Feb 15, 2010

In production code perhaps, if a student is trying to learn comp.sci
through expressing ideas in C they are actually BETTER served by
writing their own versions of algorithms we take for granted.

I think you're talking to a "wall, wondrous high, and covered with
serpent shapes" (the Wanderer): I don't think Heathfield wants people
to learn, at least as free, critical human beings. I think he wants
them to listen, and repeat rote maxims. If he's a teacher, as he seems
to want to be, he's the gym coach in the History Boys:

"Jesus didn't ask to be excused!"

"Actually, sir, he did."

Fibonacci	0	May 13, 2023
Blue J Ciphertext Program	2	Nov 22, 2023
RSA implementation issues in public key pem loader function	0	May 21, 2025
Adding adressing of IPv6 to program	1	Feb 16, 2023
Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026
Need help finding Segmentation fault C++	0	Apr 16, 2022
Need Helping adding Square root code to an existing calculator. (Absolute begginer?)	0	Jan 12, 2025
Python code problem	2	Apr 23, 2023

substring finding problem!

fedora

fedora

fedora

spinoza1111

Ben Bacarisse

fedora

Ben Bacarisse

bartc

Seebs

Seebs

Ben Bacarisse

Chris M. Thomasson

Stefan Ram

Chris M. Thomasson

spinoza1111

spinoza1111

spinoza1111

Tom St Denis

spinoza1111

spinoza1111

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads