Good point. You present a string which results in a linked list of
addresses of characters as a potential worst case, and you're right.
This could occur in genetics applications with which Malcolm seems to
work.
Indeed.
And there are many things we can do about this problem. Sure, each
linked list node could in general hold not a pointer and a length, but
an array of pointers and lengths of fixed size as a cache.
Or we could throw out linked lists and return instead an array of
pointers and lengths. We could use run length encoding in this array,
whose element would look like this:
struct changes { int repeatMe; char *p; int len };
But soft. The problem is becoming one of representing strings in the
best way possible. The string in your example is itself "inefficient"
since it could be represented more compactly using a repeat count. You
pass my solution weird data, yes, its performance suffers.
Okay; fair enough Edward. BTW, I created an "automated" testing framework
for sub-string replacement/search algorithms:
http://groups.google.com/group/comp.lang.c/browse_frm/thread/6c021e75...
As you can see I am using it to test your algorithm and a "newer" one of
mine that I have not formally posted to this news group yet. Both of our
algorithms are passing 10,000,000 iterations of the test. So far, this is
pretty good news!
I cannot seem to find a bug in our algorithms; cool!
:^)
And yes, we're somewhat responsible for not having surprise
performance issues for unexpected data. Those of us who are real
programmers lose beauty sleep over this, unlike Richard Heathfield,
who doesn't seem to care that in his "reusable" linked list a large
amount of data will be a performance hog.
But now you're talking about how best to represent strings, which is
best done not in C but in an OO language, because inside the object we
can represent the string how the hell we want.
If you examine the newer solution you might come to the conclusion that my
technique of amortizing the amount of calls to `realloc()' is "kind of
okay". However, it just might be a performance nightmare because of the
chance that the piece of shi% native allocator might need to relocate the
storage and perform a full-blown O(N) copy on each damn call. Therefore, I
conclude that I will probably get a very big performance enchantment if I
represent the final destination string as a singly linked-list. This way, I
can amortize the number of calls to `malloc()' and guarantee that I will not
ever have to do a horrible O(N) copy. I think this is a case in favor of
more exotic representations of strings.
My algorithm builds a limited amount of matches (e.g., 4096 in the code
as-is) and then flushes all of those matches to the destination string.
Expansion of the destination string only happens once per-4096 matches.
Pretty good. However, `realloc()' sucks if it has to fuc%ing copy!!!!!!!!!
GRRRRR!!!
Therefore, instead of calling `realloc()' every 4096 matches... I would
instead create a new segment of the destination string using `malloc()' and
link it to the previous via linked-list. BAM! Problem solved...
Here is a snippet of code that encompasses my newer replacement algorithm:
/* Chris M. Thomasson's Sub-String Replacement Algorithm
__________________________________________________________________*/
#include <string.h>
#define xstrstr strstr
#define xstrlen strlen
#define xstrcmp strcmp
#define xmemcpy memcpy
#define MATCH_MAX 4096
char*
chris_replace(char const* src,
char const* cmp,
char const* xchg)
{
size_t cmp_size = xstrlen(cmp);
size_t xchg_size = xstrlen(xchg);
char* dest = NULL;
size_t src_offset = 0;
size_t dest_head = 0;
size_t total_count = 0;
char const* head = src;
char const* tail = src;
char const* matches[MATCH_MAX];
if (! cmp_size)
{
/* special case handler */
size_t src_size = xstrlen(src);
if (! src_size)
{
if ((dest = malloc(xchg_size + 1)))
{
memcpy(dest, xchg, xchg_size);
dest[xchg_size] = '\0';
}
}
else
{
if ((dest = malloc(src_size + 1)))
{
memcpy(dest, src, src_size);
dest[src_size] = '\0';
}
}
return dest;
}
while (head)
{
size_t i;
size_t dest_size;
size_t src_size;
size_t count = 0;
char* dest_tmp;
char const* prev_head = head;
/* acquire matches */
while ((head = xstrstr(head, cmp)))
{
matches[count++] = head;
tail = head;
head += cmp_size;
if (count == MATCH_MAX) break;
}
/* calculate sizes and expand dest buffer */
total_count += count;
src_size = (! head) ? (tail - src) + xstrlen(tail)
: (head - src);
dest_size = (src_size - (total_count * cmp_size)) +
(total_count * xchg_size);
if (! (dest_tmp = realloc(dest, dest_size + 1)))
{
free(dest);
return NULL;
}
dest = dest_tmp;
src_size -= src_offset;
/* flush matches to dest buffer */
for (i = 0; i < count; ++i)
{
size_t offset = matches
- prev_head;
size_t stride = offset + cmp_size;
memcpy(dest + dest_head, prev_head, offset);
src_offset += stride;
src_size -= stride;
prev_head += stride;
dest_head += offset;
memcpy(dest + dest_head, xchg, xchg_size);
dest_head += xchg_size;
}
if (src_size)
{
memcpy(dest + dest_head, prev_head, src_size);
dest_head += src_size;
}
dest[dest_size] = '\0';
}
return dest;
}
I know that I flaked out on eluding `string.h', but I wanted to focus on a
specific portion of the algorithm before I integrate my non-naive sub-string
search algorithm into it.
When I get some more time, I will convert this to a linked-list and alter my
testing function to be able to verify a result string in a linked-list
against the expected NULL terminated string.
By the time I add the linked-list destination string ability, and the
non-naive sub-string search... Well, it should outperform every solution
posted here so far. BTW, your soultion, and my previous soultion, are no
good for extremely large input. Both of ours might build very large
linked-lists. Mine will be amortized, but it still not good enough! This
newer soultion works well with huge input. Of course, the linked-list
verison I am going to code will be even better...
Stay tuned!
;^)
Why not at least do a simple stack-based region allocator for
linked-list
nodes and switch over to `malloc()' when that is exhausted? Or attempt
to
amortize everything by letting a single list node hold multiple matches?
[...]