An Example for Discussion

S

Stan Milam

I've been away from C programming since 1999 and I've recently taken it
up again. Below is a string function I wrote this morning that will be
going into my home-grown library of string functions. I brought the
idea of it from a function of the same name in PL/SQL. I thought it
would be handy in my C library. Mine differs in that the source string
argument is modified and returned from the function.

Working with C strings has always been problematic because you can crash
a program faster than you can say "Jack's you're uncle!" if you are not
careful. Some of my string functions I take the strcpy() approach and
copy in the modified string into a destination pointer that you can only
hope is pointing to something big enough and in your program's data
space. With others I modify the source string in place and return a
pointer to it. With this function I felt it safe to modify the string
in place because at worst one character would be changed to another and
at best the character would be removed from the source string, so there
is no chance of overflowing the buffer. Of course, someone will try to
use a string constant for the sorce string (i.e. "This is a string").
I thought I would post the code and see what the experts had to say. For
reference "strtools.h" is my header file for my homegrown string
functions, and chrsubst() is used to substitute characters in a string,
and str_rmchars() removes a set of characters from a string.

/**FUNCTION************************************************************/
/* Name: */
/* translate(). */
/* */
/* Synopsis: */
/* #include <strings.h> */
/* #include "strtools.h" */
/* char *translate(char *src, char *fromstr, char *tostr); */
/* */
/* Description: */
/* The translate function will translate characters that appear */
/* in both the src and fromstr strings. If the fromstr character */
/* has a corresponding value in the tostr string all characters */
/* in the src string will be translated to that of the */
/* corresponding tostr character. If there is no corresponding */
/* character in the tostr string the character will be removed */
/* from the source string. */
/* */
/* Arguments: */
/* char *source - The string to be examined and translated. */
/* char *fromstr- The characters to translate in the source. */
/* char *tostr - The corresponding characters the source */
/* characters are translated to. */
/* */
/* Return Value: */
/* The starting address of the source string. */
/* */
/************************************************************FUNCTION**/

char *
translate ( char *p_source, char *p_from_string, char *p_to_string )
{
unsigned x_sub = 0, l_len = strlen( p_to_string );

while ( *p_from_string ) {

if ( strchr( p_source, *p_from_string ) != NULL ) {
if ( x_sub < l_len )
chrsubst( p_source, *p_from_string, p_to_string[x_sub] );
else {
char l_ch[] = "";
l_ch[0] = *p_from_string;
str_rmchars( p_source, l_ch );
}
}

x_sub++;
p_from_string++;

}

return p_source;
}
 
J

Jens.Toerring

Stan Milam said:
I've been away from C programming since 1999 and I've recently taken it
up again. Below is a string function I wrote this morning that will be
going into my home-grown library of string functions. I brought the
idea of it from a function of the same name in PL/SQL. I thought it
would be handy in my C library. Mine differs in that the source string
argument is modified and returned from the function.
Working with C strings has always been problematic because you can crash
a program faster than you can say "Jack's you're uncle!" if you are not
careful. Some of my string functions I take the strcpy() approach and
copy in the modified string into a destination pointer that you can only
hope is pointing to something big enough and in your program's data
space. With others I modify the source string in place and return a
pointer to it. With this function I felt it safe to modify the string
in place because at worst one character would be changed to another and
at best the character would be removed from the source string, so there
is no chance of overflowing the buffer. Of course, someone will try to
use a string constant for the sorce string (i.e. "This is a string").
I thought I would post the code and see what the experts had to say. For
reference "strtools.h" is my header file for my homegrown string
functions, and chrsubst() is used to substitute characters in a string,
and str_rmchars() removes a set of characters from a string.
/**FUNCTION************************************************************/
/* Name: */
/* translate(). */
/* */
/* Synopsis: */
/* #include <strings.h> */
/* #include "strtools.h" */
/* char *translate(char *src, char *fromstr, char *tostr); */
/* */
/* Description: */
/* The translate function will translate characters that appear */
/* in both the src and fromstr strings. If the fromstr character */
/* has a corresponding value in the tostr string all characters */
/* in the src string will be translated to that of the */
/* corresponding tostr character. If there is no corresponding */
/* character in the tostr string the character will be removed */
/* from the source string. */
/* */
/* Arguments: */
/* char *source - The string to be examined and translated. */
/* char *fromstr- The characters to translate in the source. */
/* char *tostr - The corresponding characters the source */
/* characters are translated to. */
/* */
/* Return Value: */
/* The starting address of the source string. */
/* */
/************************************************************FUNCTION**/
char *
translate ( char *p_source, char *p_from_string, char *p_to_string )
{
unsigned x_sub = 0, l_len = strlen( p_to_string );

Since strlen() returns a size_t type I would think that it's
prudent to keep that type for both 'x_sub' and 'l_len'.
while ( *p_from_string ) {
if ( strchr( p_source, *p_from_string ) != NULL ) {
if ( x_sub < l_len )
chrsubst( p_source, *p_from_string, p_to_string[x_sub] );
else {
char l_ch[] = "";

Here you create an array with 1 element.
l_ch[0] = *p_from_string;

Now 'l_ch' isn't a string anymore (the final '\0' is missing) but
just an array (with on char)
str_rmchars( p_source, l_ch );

Since you pass an array as the second argument but you can't determine
in the called function how long it is (you neither pass the length nor
has what you pass to it a final '\0') this smells extremely fishy, at
least if I assume correctly that you wouldn't pass it an array of chars
when the function only operates on a single char.
}



return p_source;
}

A few remarks concerning efficiency:

a) You can stop with your main loop immediately when 'x_sub' is equal
to l_len' - all left in this case is calling str_rmchar with what
'p_from_string' then points to (you will have to test if the chars
are in 'p_source' within the function anyway in order to find them,
so it doesn't make too much sense to do that also in the callin
function).

b) Since you also need to find in chrsubstr() where the function to be
replaced is located you automatically test if it's there at all, so
also for this you don't need the strstr() call in the caller. Actu-
ally, what chrsubst() might be that simple that you can avoid the
fucntion call overhead.

So I guess you can reduce the whole function to (untested!)

char *
translate( char *p_source, char *p_from_string, char *p_to_string )
{
size_t x_sub = 0,
from_len = strlen( p_from_string ),
to_len = strlen( p_to_string );

if ( from_len > to_len )
{
str_rmchars( p_source, p_from_string + to_len );
from_len = to_len;
}

for ( ; x_sub < from_len; x_sub++, p_from_string++ )
{
char *where = p_source;
while ( ( where = strchr( where, *p_from_string ) ) != NULL )
*where++ = p_to_string[ x_sub ];
}

return p_source;
}

again assuming that str_rmchar() removes all chars that are passed via
the second argument.
Regards, Jens
 
A

Arthur J. O'Dwyer

I've been away from C programming since 1999 and I've recently taken it up
again. Below is a string function I wrote this morning that will be going
into my home-grown library of string functions. I brought the idea of it
from a function of the same name in PL/SQL.

This being a C group, and not an Oracle group, one shouldn't expect the
readers to know the PL/SQL library. I went and tracked it down. It turns
out that the 'TRANSLATE' function in PL/SQL does basically what the 'tr'
command in *nix does by default: given a list of "from" characters and a
matching list of "to" characters, replace each "from" with "to" in
the given string.
Working with C strings has always been problematic because you can crash a
program faster than you can say "Jack's you're uncle!" if you are not
careful.

True enough.
I thought I would post the code and see what the experts had to say. For
reference "strtools.h" is my header file for my homegrown string functions,
and chrsubst() is used to substitute characters in a string, and
str_rmchars() removes a set of characters from a string.

That's not very descriptive. Moreover, I think your aggressive
"modularization" ends up hurting both clarity and performance --- and
not just because of the weird function names!

(All criticism above and below is intended in the best of spirits, of
course. I hope it's useful to you. Now, on with the dissection!)

/**FUNCTION************************************************************/
/* Name: */
/* translate(). */

Seriously, if the reader can't tell this much from the code, do you
/really/ want him reading it? This entire comment block is much too
verbose; it's almost longer than the function, and it /will/ be longer
than the function once I get through with it!
/* */
/* Synopsis: */
/* #include <strings.h> */

You never explained what the above header is. Is it a typo, perhaps?
And why should the reader care what dependencies this function has? That's
the implementor's job --- to make sure that the source file that defines
/* #include "strtools.h" */
/* char *translate(char *src, char *fromstr, char *tostr); */

This is doubly weird: first, because again the reader ought to be able
to read a prototype himself without seeing it twice; and second, because
the prototype above uses completely different variable names from the
one you actually use in the source code. They should match; there's no
reason they shouldn't.
/* */
/* Description: */
/* The translate function will translate characters that appear */
/* in both the src and fromstr strings. If the fromstr character */
/* has a corresponding value in the tostr string all characters */
/* in the src string will be translated to that of the */
/* corresponding tostr character. If there is no corresponding */
/* character in the tostr string the character will be removed */
/* from the source string. */

This is both dense and vague. I didn't understand what it was trying
to say until I read the appropriate page in the SQL/PL manual. Surely
there's a better way to describe the procedure! (In fact, part of your
problem is that the reader can't immediately tell from the source code
what's going on. Once we clarify the code, the algorithm will be
obvious, and thus will need less prose explanation.)
/* Arguments: */
/* char *source - The string to be examined and translated. */
/* char *fromstr- The characters to translate in the source. */
/* char *tostr - The corresponding characters the source */
/* characters are translated to. */
/* */
/* Return Value: */
/* The starting address of the source string. */
/* */
/************************************************************FUNCTION**/

char *
translate ( char *p_source, char *p_from_string, char *p_to_string )

Hungarian notation. And done wrong. Since when does "p" stand for
"string"? None of those three parameters is a "pointer" to any data
type in the abstract sense; they're all strings. If you were going to
use any Hungarian prefix, you'd be using "s": 's_source', 's_from',
and so on. ('s_from_string' would be a stupid name, as I'm sure you can
see.)
Drop the obfuscatory notation.
{
unsigned x_sub = 0, l_len = strlen( p_to_string );

while ( *p_from_string ) {

(I disagree with your whitespace style, too, but that's a personal
religious issue. I mention it only because I'll be using my own style
later on, so if you want to see my suggested style, you should scroll
down now.)
if ( strchr( p_source, *p_from_string ) != NULL ) {
if ( x_sub < l_len )
chrsubst( p_source, *p_from_string, p_to_string[x_sub] );

I have no idea what this is doing, but I bet it scans through the
entire string 'p_source', replacing each instance of '*p_from_string'
with 'p_to_string[x_sub]'. And you do this 'l_len' times. That's a
lot of looping for little reason. And I've already mentioned the poor
naming scheme: 'chrsubst'!?
else {
char l_ch[] = "";

This is equivalent to
char l_ch[1] = {0};
The following two lines lead me to believe that perhaps this is another
typo, that you meant 'char l_ch[2]' instead. Otherwise, why bother making
a copy of a single character?
l_ch[0] = *p_from_string;
str_rmchars( p_source, l_ch );
}
}

x_sub++;
p_from_string++;

Any time you see a loop, in C, which ends with an increment or a
pointer-chasing assignment, that should send up warning flags that
something hasn't been thought out properly. Here, it indicates that
you're missing a 'for' loop. See below.
}

return p_source;
}


So that was the code. Now here's my overall analysis.
You end a 'while' loop with an increment. That's the sign of a 'for'
loop struggling to get out. So we rewrite the loop:

[...]
unsigned x_sub;

for (x_sub=0; *p_from_string; ++x_sub, ++p_from_string) {
if ( strchr( p_source, *p_from_string ) != NULL ) { [...]
}
}
[...]

The next thing I noticed was that now we have a 'for' loop with two
increments in the increment clause. That's a sign that we have dependent
loop variables that need refactoring --- and indeed, we are really using
'x_sub' as a loop counter, and merely incrementing 'p_from_string' to keep
it pointing at the 'x_sub'th element of that array. So we merge the two
dependent variables; that yields
char *translate(char *p_source, char *p_from_string, char *p_to_string)
{
unsigned l_len = strlen( p_to_string );
unsigned x_sub;

for (x_sub=0; p_from_string[x_sub]; ++x_sub) {
if (strchr(p_source, p_from_string[x_sub]) != NULL) {
if ( x_sub < l_len )
chrsubst(p_source, p_from_string[x_sub], p_to_string[x_sub]);
else {
char l_ch[1];
l_ch[0] = p_from_string[x_sub];
str_rmchars(p_source, l_ch);
}
}
}

return p_source;
}

Suddenly the parallel structure in the call to 'chrsubst' stands out!
We have de-obfuscated a critical part of the algorithm: We now clearly
see that characters from 'p_from_string' are matched only with the
corresponding characters from 'p_to_string'.

Next, I prefer to remove one level of indentation by changing that
first gigantic 'if' to a conditional 'continue'; at the same time, to
conserve even more space, I'll switch to a saner naming convention.
The two input-only parameters are properly const-qualified. This yields
what I consider the "clean" version of the procedure you wrote:

/*
Translates characters that appear in the |from| string to their
corresponding characters in the |to| string. If |to| is shorter
than |from|, the unmatched |from| characters are deleted from |src|.
Overwrites |src| with the output string, and returns the new |src|.
*/
char *translate(char *src, const char *from, const char *to)
{
size_t len_to = strlen(to);
size_t i;

for (i=0; from != '\0'; ++i)
{
if (strchr(src, from) == NULL)
continue;
if (i < len_to) {
chrsubst(src, from, to);
}
else {
char ch[1]; /* Should probably be char ch[2] = {0}; */
ch[0] = from;
str_rmchars(src, ch);
}
}

return src;
}

But we're not done! Remember, I wanted to get rid of those weird
library-routine calls for two reasons: first, for clarity, and second,
for efficiency. There's no reason to be scanning so many times through
the input string! What's more, we can really elucidate the algorithm
much better with "simple" code like this:

char *translate(char *src, const char *from, const char *to)
{
size_t in, out;
size_t len_to = strlen(to);
for (in=out=0; src[in] != '\0'; ++in)
{
/* Is the current character in the |from| array? */
const char *t = strchr(from, src[in]);
/* No; just echo it to the output string */
if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
}

src[out] = '\0';
return src;
}


A version of this routine using the 'strpbrk' library function
might be interesting to see, also, though I doubt it would be any
cleaner or quicker.

Anyway, I hope you get the idea --- "clean and simple" beats pretty
much any other strategy, any day. I highly recommend "The Elements of
Programming Style" (Kernighan & Plauger) to anyone who wants to learn
to write (or critique!) source code.

HTH,
-Arthur
 
S

Stan Milam

Arthur said:
On Fri, 4 Mar 2005, Stan Milam wrote:

This being a C group, and not an Oracle group, one shouldn't expect the
readers to know the PL/SQL library. I went and tracked it down. It turns
out that the 'TRANSLATE' function in PL/SQL does basically what the 'tr'
command in *nix does by default: given a list of "from" characters and a
matching list of "to" characters, replace each "from" with "to" in
the given string.


I have used the 'tr' command very little in my UNIX career, but I found
the PL/SQL translate function very handy. I thought it would be fun to
implement one in C now that I am writing with it again (along with
PL/SQL). Pretty much all the more-or-less standard C functions were
conceived to solve some problem. And a lot of them have funny names
too. But who could beat IEBR14 (I think that is what it was called) and
IEBGENR from the IBM mainframe world?
That's not very descriptive. Moreover, I think your aggressive
"modularization" ends up hurting both clarity and performance --- and
not just because of the weird function names!

Well, I thought it much better than what is many times provided by other
posters, which is nothing.
(All criticism above and below is intended in the best of spirits, of
course. I hope it's useful to you. Now, on with the dissection!)
OK, fair enough.
Seriously, if the reader can't tell this much from the code, do you
/really/ want him reading it? This entire comment block is much too
verbose; it's almost longer than the function, and it /will/ be longer
than the function once I get through with it!

There is a method to my madness. I write the documentation when I write
the function. This is pretty much the documentation. I use a shell
script to extract these comment blocks an reformat them for printing.
You never explained what the above header is. Is it a typo, perhaps?
And why should the reader care what dependencies this function has? That's
the implementor's job --- to make sure that the source file that defines
this function #includes <strings.h> (or <string.h>) at the top. There's
no reason to repeat any of this stuff here.



This is doubly weird: first, because again the reader ought to be able
to read a prototype himself without seeing it twice; and second, because
the prototype above uses completely different variable names from the
one you actually use in the source code. They should match; there's no
reason they shouldn't.



No this is what is in the user docs. They do not need to know what I
called them internally.




This is both dense and vague. I didn't understand what it was trying
to say until I read the appropriate page in the SQL/PL manual. Surely
there's a better way to describe the procedure! (In fact, part of your
problem is that the reader can't immediately tell from the source code
what's going on. Once we clarify the code, the algorithm will be
obvious, and thus will need less prose explanation.)

Thanks. I'll rewrite it. I've been reading too much IBM documentation!
Hungarian notation. And done wrong. Since when does "p" stand for
"string"? None of those three parameters is a "pointer" to any data
type in the abstract sense; they're all strings. If you were going to
use any Hungarian prefix, you'd be using "s": 's_source', 's_from',
and so on. ('s_from_string' would be a stupid name, as I'm sure you can
see.)
Drop the obfuscatory notation.

Ah, this is a holdover from my PL/SQL days. PL/SQL does not have
pointers so that did not even occur to me. No, it is not Hungarian
notation, it is Milam notation. I've tried Hungarian and I have come to
truly despise it. The p_ prefix simply means it is a PARAMETER
variable, and the l_ prefix means it is a LOCAL variable. I use the
x_sub, y_sub etc... name for subscripts. Just a glance and I know what
I am dealing with.
(I disagree with your whitespace style, too, but that's a personal
religious issue. I mention it only because I'll be using my own style
later on, so if you want to see my suggested style, you should scroll
down now.)

I like mine. It has been a long time evolving. But thanks for the offer.
if ( strchr( p_source, *p_from_string ) != NULL ) {
if ( x_sub < l_len )
chrsubst( p_source, *p_from_string, p_to_string[x_sub] );


I have no idea what this is doing, but I bet it scans through the
entire string 'p_source', replacing each instance of '*p_from_string'
with 'p_to_string[x_sub]'. And you do this 'l_len' times. That's a
lot of looping for little reason. And I've already mentioned the poor
naming scheme: 'chrsubst'!?

When I wrote that function I was working with a compiler that only
allowed a length of 8 on external function names. So, that's been
awhile and I never changed the name. There are a lot of hold-over names
in software where the operating environment dictated constraints. I'll
create an alias name for chrsubst and mark it as deprecated.
else {
char l_ch[] = "";


This is equivalent to
char l_ch[1] = {0};
The following two lines lead me to believe that perhaps this is another
typo, that you meant 'char l_ch[2]' instead. Otherwise, why bother making
a copy of a single character?

Hey, you are right! Thanks! (I slap myself on the head and say "What
were you thinking?")
l_ch[0] = *p_from_string;
str_rmchars( p_source, l_ch );
}
}

x_sub++;
p_from_string++;


Any time you see a loop, in C, which ends with an increment or a
pointer-chasing assignment, that should send up warning flags that
something hasn't been thought out properly. Here, it indicates that
you're missing a 'for' loop. See below.

Why should there be warning flags? Has the while keyword fallen into
disfavor? I see no advantage of one over the other. It is a matter of
preference.
}

return p_source;
}



So that was the code. Now here's my overall analysis.
You end a 'while' loop with an increment. That's the sign of a 'for'
loop struggling to get out. So we rewrite the loop:

[...]
unsigned x_sub;

for (x_sub=0; *p_from_string; ++x_sub, ++p_from_string) {
if ( strchr( p_source, *p_from_string ) != NULL ) {

[...]

See my comments above. This is like saying we should all write at the
same level, use the style. C is a language and it allows enough
latitude to express ourselves - however we feel on a given day.

[...]

The next thing I noticed was that now we have a 'for' loop with two
increments in the increment clause. That's a sign that we have dependent
loop variables that need refactoring --- and indeed, we are really using
'x_sub' as a loop counter, and merely incrementing 'p_from_string' to
keep it pointing at the 'x_sub'th element of that array. So we merge the
two
dependent variables; that yields
char *translate(char *p_source, char *p_from_string, char *p_to_string)
{
unsigned l_len = strlen( p_to_string );
unsigned x_sub;

for (x_sub=0; p_from_string[x_sub]; ++x_sub) {
if (strchr(p_source, p_from_string[x_sub]) != NULL) {
if ( x_sub < l_len )
chrsubst(p_source, p_from_string[x_sub],
p_to_string[x_sub]);
else {
char l_ch[1];
l_ch[0] = p_from_string[x_sub];
str_rmchars(p_source, l_ch);
}
}
}

return p_source;
}

Yes that is an improvement.
Suddenly the parallel structure in the call to 'chrsubst' stands out!
We have de-obfuscated a critical part of the algorithm: We now clearly
see that characters from 'p_from_string' are matched only with the
corresponding characters from 'p_to_string'.

Next, I prefer to remove one level of indentation by changing that
first gigantic 'if' to a conditional 'continue'; at the same time, to
conserve even more space, I'll switch to a saner naming convention.
The two input-only parameters are properly const-qualified. This yields
what I consider the "clean" version of the procedure you wrote:

I'll go for that.
/*
Translates characters that appear in the |from| string to their
corresponding characters in the |to| string. If |to| is shorter
than |from|, the unmatched |from| characters are deleted from |src|.
Overwrites |src| with the output string, and returns the new |src|.
*/
char *translate(char *src, const char *from, const char *to)
{
size_t len_to = strlen(to);
size_t i;

for (i=0; from != '\0'; ++i)
{
if (strchr(src, from) == NULL)
continue;
if (i < len_to) {
chrsubst(src, from, to);
}
else {
char ch[1]; /* Should probably be char ch[2] = {0}; */
ch[0] = from;
str_rmchars(src, ch);
}
}

return src;
}

But we're not done! Remember, I wanted to get rid of those weird
library-routine calls for two reasons: first, for clarity, and second,
for efficiency. There's no reason to be scanning so many times through
the input string! What's more, we can really elucidate the algorithm
much better with "simple" code like this:

char *translate(char *src, const char *from, const char *to)
{
size_t in, out;
size_t len_to = strlen(to);
for (in=out=0; src[in] != '\0'; ++in)
{
/* Is the current character in the |from| array? */
const char *t = strchr(from, src[in]);
/* No; just echo it to the output string */
if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
}

src[out] = '\0';
return src;
}


A version of this routine using the 'strpbrk' library function
might be interesting to see, also, though I doubt it would be any
cleaner or quicker.

Anyway, I hope you get the idea --- "clean and simple" beats pretty
much any other strategy, any day. I highly recommend "The Elements of
Programming Style" (Kernighan & Plauger) to anyone who wants to learn
to write (or critique!) source code.

HTH,
-Arthur


Thanks. I have been sitting on the bench for a couple of years. I have
another life besides being a programmer and I decided to live it for a
while. The problem has been my skills are a bit rusty and I've got to
get back "In The Zone."

Regards,
Stan Milam.
 
S

Stan Milam

Arthur J. O'Dwyer wrote:

But we're not done! Remember, I wanted to get rid of those weird
library-routine calls for two reasons: first, for clarity, and second,
for efficiency. There's no reason to be scanning so many times through
the input string! What's more, we can really elucidate the algorithm
much better with "simple" code like this:

char *translate(char *src, const char *from, const char *to)
{
size_t in, out;
size_t len_to = strlen(to);
for (in=out=0; src[in] != '\0'; ++in)
{
/* Is the current character in the |from| array? */
const char *t = strchr(from, src[in]);
/* No; just echo it to the output string */
if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
}

src[out] = '\0';
return src;
}

You're right, we are not done. Your version does not do the same thing
as mine. Here are the differences:

1. In mine if the char is in the "to" and "from" strings, but not the
"source" string then leave the source character alone. So far we agree.

2. In mine when the character is in the "source" and "from" strings
then *all instances* of the character are replaced with the
corresponding character from the "to" string. Hence the call to
chrsubst() in my code. Yours does too but it is not readily obvious what
you are doing. With mine I took the what I call the straight-ahead
approach and once this condition is met I change them all. But I like
yours very much as you only change the characters when you have to with
no function call.

3. In mine when the character is found in both the "source" and "from"
strings but there is no corresponding character in the "to" string *all
instances* of the character will be squeezed out of the "source" string.
This is what str_rmchars() does. A casual glance at yours tells me
yours does not squeeze out the characters at all.
A version of this routine using the 'strpbrk' library function
might be interesting to see, also, though I doubt it would be any
cleaner or quicker.

Anyway, I hope you get the idea --- "clean and simple" beats pretty
much any other strategy, any day. I highly recommend "The Elements of
Programming Style" (Kernighan & Plauger) to anyone who wants to learn
to write (or critique!) source code.

HTH,
-Arthur

I really did not this thread to be about functionality or writing style,
but rather the legitimacy of modifying the source string. Also, one of
my programming tenants is to not re invent the wheel, hence the calls to
my existing routines. I have heard the argument about the overhead of
setting up and calling a function before. My answer has always been "so
what?, Programmer time is much more expensive than computer time." As an
example many years ago I was asked by my employer to review C code for a
rather large program. I was horrified to see the contractors had
written in-line, data specific quick sorts and binary search routines.
When asked why they did not use qsort() and bsearch() their reply was
they did not want to incur the overhead of calling a library routine.
They also wrote their own versions of standard library calls like
strcpy() because they thought they could do it better than the vendor
when the vendor docs clearly stated that when optimization was turned on
the string functions would be turned into in-line assembly code. I was
livid. These guys had spent many hours at significant dollar rates to
write, test, and document code that did not need to be written.

Arthur, you have helped tremendously. Thanks.

Regards,
Stan Milam.
 
C

Chris Torek

Arthur J. O'Dwyer wrote:

[snipped to just the code]
char *translate(char *src, const char *from, const char *to)
{
size_t in, out;
size_t len_to = strlen(to);
for (in=out=0; src[in] != '\0'; ++in)
{
/* Is the current character in the |from| array? */
const char *t = strchr(from, src[in]);
/* No; just echo it to the output string */
if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
}

src[out] = '\0';
return src;
}

You're right, we are not done. Your version does not do the same thing
as mine. Here are the differences:

1. In mine if the char is in the "to" and "from" strings, but not the
"source" string then leave the source character alone. So far we agree.

I am not sure why you list this as a "difference"... :)
2. In mine when the character is in the "source" and "from" strings
then *all instances* of the character are replaced with the
corresponding character from the "to" string.

This happens in Arthur's as well. He loops over all characters in
the string -- the array whose first element "src" points to -- and
applies the "replace or delete character" algorithm to each character.
... Yours does too

So this is not a difference either. (Or is it? See below.)
but it is not readily obvious what you are doing.

It seems quite clear to me (thought I would probably have used
somewhat different variable names).
With mine I took the what I call the straight-ahead
approach and once this condition is met I change them all.

This has a potential problem. Suppose we replace "a" with "b", and
"b" with "c", i.e.:

char buf[] = "abcdedcba";

(void) translate(buf, "ab", "bc");

If you first replace *all* the 'a's with 'b's, we change buf[]'s
contents from "abcdedcba" to "bbcdedcbb". Next, we replace *all*
the 'b's with 'c's, producing "cccdedccc".

I suspect that anyone making that call would *want* "bccdedccb"
in the buf[] array. To obtain that as the result, the translation
must not be "re-applied" to translated characters. To me, the
easiest way to guarantee this is to make a clear distinction
between "as-yet-untranslated" characters and "after-translation"
characters. In Arthur's code, this is precisely the same as the
array index variable "in": characters at src[in] and beyond are
not-yet-translated; other characters are.

Of course, if "cccdedccc" *is* the desired result, Arthur's code
will not achieve it.
3. In mine when the character is found in both the "source" and "from"
strings but there is no corresponding character in the "to" string *all
instances* of the character will be squeezed out of the "source" string.

Arthur's does this as well, because the "translate single character"
algorithm (embedded within the loop) is:

if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
/* else do nothing, which effectively deletes the character */

Note that, initially, out==in. If every "from" character has a
corresponding "to" character, so that (t - from) is always a
valid index in "to", this "out==in" condition will remain true
throughout the entire loop, so that the antepenultimate line
in the function:

src[out] = '\0';

does not change src[out] at all. On the other hand, if some
"from" characters are to be deleted -- e.g., from[] = "abc"
while to[] = "bc", which will change buf[] = "abcdedcba" to
"bcdedba\0a\0" -- then, in this case, "out" sometimes does not
increment when "in" does, so that eventually out < in. This
is why the src[out] = '\0' line is included.
I really did not this thread to be about functionality or writing style,
but rather the legitimacy of modifying the source string.

There is nothing ultimately *wrong* with this, although if one
is to do it, I, at least, would not call it a "source string"
at all, but rather a "buffer" or something similar. If one is
to modify the array, this must be made clear to callers, so
that they do not attempt things like this:

char *result = translate("hello", "h", "j");

This code has undefined behavior, because translate() attempts to
replace the "h" in the anonymous array created by the string
literal "hello", which may be in read-only storage and/or shared
with other "hello"s in the program.

On the other hand, if you choose *not* to modify the original
array, you are faced with the question of where to put the
storage space for the result. (This is a FAQ; see 7.5a and 7.5b.)
Also, one of my programming tenants is to not re invent the wheel,
hence the calls to my existing routines.

While this is a good tenet, there are times when the existing wheels
are unsuitable, for whatever reason (too large, too small, too
triangular, etc. :) ). The problem of translating "abcdedcba" to
"bccdedccb" (rather than "cccdedccc") is, I think, one case of such
unsuitable wheels.
 
A

Arthur J. O'Dwyer

Arthur said:
char *translate(char *src, const char *from, const char *to)
{
size_t in, out;
size_t len_to = strlen(to);
for (in=out=0; src[in] != '\0'; ++in)
{
/* Is the current character in the |from| array? */
const char *t = strchr(from, src[in]);
/* No; just echo it to the output string */
if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
}

src[out] = '\0';
return src;
}

Stan Milam said:
You're right, we are not done. Your version does not do the same thing
as mine. Here are the differences: [...]
With mine I took the what I call the straight-ahead
approach and once this condition is met I change them all.

This has a potential problem. Suppose we replace "a" with "b", and
"b" with "c", i.e.:

char buf[] = "abcdedcba";

(void) translate(buf, "ab", "bc");

If you first replace *all* the 'a's with 'b's, we change buf[]'s
contents from "abcdedcba" to "bbcdedcbb". Next, we replace *all*
the 'b's with 'c's, producing "cccdedccc".

I suspect that anyone making that call would *want* "bccdedccb"
in the buf[] array.

Good catch! I recall that many of the examples in Kernighan & Plauger
turned out to have hidden bugs which were revealed by the refactoring
process. In this case, it looks to me like a hidden bug existed, and
was unintentionally corrected /without/ being revealed --- at least,
not to me! It's a good thing Chris Torek was on the job. ;-)
Of course, if "cccdedccc" *is* the desired result, Arthur's code
will not achieve it.

Right.

[A very good explanation of #3 snipped.]

Thanks,
-Arthur
 
S

Stan Milam

Chris said:
With mine I took the what I call the straight-ahead
approach and once this condition is met I change them all.


This has a potential problem. Suppose we replace "a" with "b", and
"b" with "c", i.e.:

char buf[] = "abcdedcba";

(void) translate(buf, "ab", "bc");

If you first replace *all* the 'a's with 'b's, we change buf[]'s
contents from "abcdedcba" to "bbcdedcbb". Next, we replace *all*
the 'b's with 'c's, producing "cccdedccc".

I suspect that anyone making that call would *want* "bccdedccb"
in the buf[] array. To obtain that as the result, the translation
must not be "re-applied" to translated characters. To me, the
easiest way to guarantee this is to make a clear distinction
between "as-yet-untranslated" characters and "after-translation"
characters. In Arthur's code, this is precisely the same as the
array index variable "in": characters at src[in] and beyond are
not-yet-translated; other characters are.

Of course, if "cccdedccc" *is* the desired result, Arthur's code
will not achieve it.

Yes, that is the desired behavior.
3. In mine when the character is found in both the "source" and "from"
strings but there is no corresponding character in the "to" string *all
instances* of the character will be squeezed out of the "source" string.


Arthur's does this as well, because the "translate single character"
algorithm (embedded within the loop) is:

if (t == NULL)
src[out++] = src[in];
/* Yes; if it has a match in |to|, then echo that entry. */
else if (t - from < len_to)
src[out++] = to[t - from];
/* else do nothing, which effectively deletes the character */

Note that, initially, out==in. If every "from" character has a
corresponding "to" character, so that (t - from) is always a
valid index in "to", this "out==in" condition will remain true
throughout the entire loop, so that the antepenultimate line
in the function:

src[out] = '\0';

does not change src[out] at all. On the other hand, if some
"from" characters are to be deleted -- e.g., from[] = "abc"
while to[] = "bc", which will change buf[] = "abcdedcba" to
"bcdedba\0a\0" -- then, in this case, "out" sometimes does not
increment when "in" does, so that eventually out < in. This
is why the src[out] = '\0' line is included.

I really did not this thread to be about functionality or writing style,
but rather the legitimacy of modifying the source string.


There is nothing ultimately *wrong* with this, although if one
is to do it, I, at least, would not call it a "source string"
at all, but rather a "buffer" or something similar. If one is
to modify the array, this must be made clear to callers, so
that they do not attempt things like this:

char *result = translate("hello", "h", "j");

This code has undefined behavior, because translate() attempts to
replace the "h" in the anonymous array created by the string
literal "hello", which may be in read-only storage and/or shared
with other "hello"s in the program.

On the other hand, if you choose *not* to modify the original
array, you are faced with the question of where to put the
storage space for the result. (This is a FAQ; see 7.5a and 7.5b.)

Yep, this is what I was talking about.
While this is a good tenet, there are times when the existing wheels
are unsuitable, for whatever reason (too large, too small, too
triangular, etc. :) ). The problem of translating "abcdedcba" to
"bccdedccb" (rather than "cccdedccc") is, I think, one case of such
unsuitable wheels.

Nah, just using the tools you already have.
 
C

CBFalconer

Stan said:
Chris said:
With mine I took the what I call the straight-ahead
approach and once this condition is met I change them all.

This has a potential problem. Suppose we replace "a" with "b",
and "b" with "c", i.e.:

char buf[] = "abcdedcba";

(void) translate(buf, "ab", "bc");

If you first replace *all* the 'a's with 'b's, we change buf[]'s
contents from "abcdedcba" to "bbcdedcbb". Next, we replace *all*
the 'b's with 'c's, producing "cccdedccc".

I suspect that anyone making that call would *want* "bccdedccb"
in the buf[] array. To obtain that as the result, the translation
must not be "re-applied" to translated characters. To me, the
easiest way to guarantee this is to make a clear distinction
between "as-yet-untranslated" characters and "after-translation"
characters. In Arthur's code, this is precisely the same as the
array index variable "in": characters at src[in] and beyond are
not-yet-translated; other characters are.

Of course, if "cccdedccc" *is* the desired result, Arthur's code
will not achieve it.

Yes, that is the desired behavior.

If you replace all the a's and b's etc. with identifiers (words),
and specify that i/o is done with streams (no backtracking) you
have the behaviour of id2id (my utility for mass reorganization of
identifiers in source code). This strictly eliminates any such
'reapplication'. It also has the advantage of needing no presized
overflowable buffers. See id2id-20 in:

<http://cbfalconer.home.att.net/download/>
 
A

Arthur J. O'Dwyer

Stan said:
Chris said:
char buf[] = "abcdedcba";
(void) translate(buf, "ab", "bc");

If you first replace *all* the 'a's with 'b's, we change buf[]'s
contents from "abcdedcba" to "bbcdedcbb". Next, we replace *all*
the 'b's with 'c's, producing "cccdedccc".

I suspect that anyone making that call would *want* "bccdedccb"[.] [...]
Of course, if "cccdedccc" *is* the desired result, Arthur's code
will not achieve it.

Yes, that is the desired behavior.

If you replace all the a's and b's etc. with identifiers (words),
and specify that i/o is done with streams (no backtracking) you
have the behaviour of id2id (my utility for mass reorganization of
identifiers in source code). This strictly eliminates any such
'reapplication'.

*If*. :) Anyway, you misread Stan's reply; he says that "cccdedccc",
the result of reapplication, *IS* the desired behavior. (Now, I
personally think he's just being defensive, since I can't see any
sensible application for that result. But perhaps he knows of one,
and I'm just being re-defensive. ;)

-Arthur
 
S

Stan Milam

Arthur said:
Stan said:
Chris Torek wrote:

char buf[] = "abcdedcba";
(void) translate(buf, "ab", "bc");

If you first replace *all* the 'a's with 'b's, we change buf[]'s
contents from "abcdedcba" to "bbcdedcbb". Next, we replace *all*
the 'b's with 'c's, producing "cccdedccc".

I suspect that anyone making that call would *want* "bccdedccb"[.]
[...]
Of course, if "cccdedccc" *is* the desired result, Arthur's code
will not achieve it.


Yes, that is the desired behavior.


If you replace all the a's and b's etc. with identifiers (words),
and specify that i/o is done with streams (no backtracking) you
have the behaviour of id2id (my utility for mass reorganization of
identifiers in source code). This strictly eliminates any such
'reapplication'.


*If*. :) Anyway, you misread Stan's reply; he says that "cccdedccc",
the result of reapplication, *IS* the desired behavior. (Now, I
personally think he's just being defensive, since I can't see any
sensible application for that result. But perhaps he knows of one,
and I'm just being re-defensive. ;)

-Arthur

Actually, I came up with an application for this behavior today. But
then I suppose I am being re-re-defensive! :)

By the way, I rewrote the function today using a lot of your insights. I
removed the calls to chrsubst() and str_rmchars(), used the for() loop
and continue statements you advocated. Actually, I wrote two new
versions. The second removed the call to strlen() for the *p_to string
and incremented the pointer each time it was used to translate a
character. At the top I am testing to see if it is '\0'. This test
decides whether the character is translated or squeezed out of the
target string. I also rewrote the description. Note to self: don't try
doing this stuff when I am tired and sleepy - it just gets messy.

Regards,
Stan Milam.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,777
Messages
2,569,604
Members
45,233
Latest member
AlyssaCrai

Latest Threads

Top