is there an alternative to strstr

  • Thread starter Ramprasad A Padmanabhan
  • Start date
T

Tom St Denis

Dan Pop said:
In <[email protected]> David Rubin

Doesn't seem to work very well when you have to search inside a string,
does it?

in theory if they're all the same size it could work. Though yeah you're
right for general strings bsearch isn't too helpful.

Tom
 
D

Dan Pop

In said:
In my program I am checking wether a emailid exists in a list

I have in the complete_list a string like
"<[email protected]> <[email protected]> <[email protected]> ... <[email protected]>" that is a
string of emailids sorted

Now I want to find out if another ID "<[email protected]>" exists in this list

How is it possible to optimize the search given that the complete list
string is sorted in order

To exploit this fact, you need a different data format, a plain string is
not particularly suitable for a binary search.

If you can store the email addresses in an array of strings, one per
string, and keep them still sorted, the bsearch() function from the
standard library is likely to be faster than a strstr() in a plain
string. The comparison function can be a simple wrapper around strcmp().

If you're performing this search often enough on large lists of email
addresses, it may be worth considering changing the format in which
you keep the email addresses, so that bsearch becomes an option.
Compare the performance of the two approaches on real data sets to
see which is better.

OTOH, if you build the list of addresses on the fly and the input is NOT
already sorted, you may consider putting them in a binary tree structure.
Except for pathological cases (input data already sorted) the search is
simple and fast and new data can be added with very little overhead (see
also the example in K&R2).

Dan
 
D

David Rubin

Dan said:
Doesn't seem to work very well when you have to search inside a string,
does it?

No, but it's not a stretch to see how one *could* use bsearch given the
format of the data being searched.

/david
 
D

Dan Pop

In said:
No, but it's not a stretch to see how one *could* use bsearch given the
format of the data being searched.

I strongly suspect the *real* data consists of variable length email
addresses...

Dan
 
T

tom_usenet

In my program I am checking wether a emailid exists in a list

I have in the complete_list a string like
"<[email protected]> <[email protected]> <[email protected]> ... <[email protected]>" that is a
string of emailids sorted

Now I want to find out if another ID "<[email protected]>" exists in this list


How is it possible to optimize the search given that the complete list
string is sorted in order

You need an array of char*s that point to the start of each string.
Then it's easy. e.g.

char **index;
int nStrings = 0;
size_t length = strlen(complete_list);
/*count strings*/
for (int i = 0; i != length; ++i)
if(complete_list == '<')
++nStrings;

/*make index*/
index = malloc(nStrings * sizeof *index);
nStrings = 0;
for (int i = 0; i != length; ++i)
if(complete_list == '<')
index[nStrings++] = complete_list + i;

Now you can use bsearch on your index array, being careful to make
your comparison function compare correctly (only up to the " ").

Tom
 
T

Thomas Stegen CES2000

Dan said:
I strongly suspect the *real* data consists of variable length email
addresses...

But they all seem to be delimited by < and > so the compare function
could just look for those. Would have to look both backwards and
forwards though in case one ends up in the middle of an address.

There might be some other issues though. Like what to do if you end
up directly on a < or > or maybe even between them. Might not be an
issue though.

This function makes a few simplifications and might do strange things
if there can be whitespace before the first address. It is more a show
of concept. And I hope I get the return values right :)

And I know it is not a good idea to post untested code, but
I am in a hurry.

int compare(void *p1, void *p2)
{
char *s = p1;
char *target = p2;

while(*s != '<')
{
s--;
}

while((*s != '>') && (*target != '\0'))
{
if(*s > *target)
{
return 1;
}
if(*s < *target)
{
return -1;
}
s++;
target++;
}
return 0;
}
 
D

Dan Pop

In said:
But they all seem to be delimited by < and > so the compare function
could just look for those. Would have to look both backwards and
forwards though in case one ends up in the middle of an address.

bsearch can operate on constant size objects only. The compare function
is a non-issue here. As another poster showed, you need to index the
string if you want to use bsearch, with a properly crafted compare
function.

The trivial case is when all addresses have the same length: you can
get the item count from (strlen(addresses) + 1) / (addrlen + 1) and use
a wrapper around memcmp() as the compare function.

Dan
 
G

Glen Herrmannsfeldt

David Rubin said:

A hash table might be a better solution, and it doesn't require the list to
be ordered.

Note, though, that it should be a list, such as an array of strings, for
bsearch, and it is a little easier even for hsearch.

Though with strtok on a copy of the string it shouldn't be too hard to do
it.

-- glen
 
R

Ramprasad A Padmanabhan

Dan said:
To exploit this fact, you need a different data format, a plain string is
not particularly suitable for a binary search.

If you can store the email addresses in an array of strings, one per
string, and keep them still sorted, the bsearch() function from the
standard library is likely to be faster than a strstr() in a plain
string. The comparison function can be a simple wrapper around strcmp().

If you're performing this search often enough on large lists of email
addresses, it may be worth considering changing the format in which
you keep the email addresses, so that bsearch becomes an option.
Compare the performance of the two approaches on real data sets to
see which is better.

OTOH, if you build the list of addresses on the fly and the input is NOT
already sorted, you may consider putting them in a binary tree structure.
Except for pathological cases (input data already sorted) the search is
simple and fast and new data can be added with very little overhead (see
also the example in K&R2).

Dan

Ok I have put the email ids in a sorted array.
Can I use bsearch on an array with differrent sized elements. Because in
a real scenario the email ids will have different sizes

Thanks
Ram
 
T

Thomas Stegen CES2000

Ramprasad A Padmanabhan wrote:

Ok I have put the email ids in a sorted array.
Can I use bsearch on an array with differrent sized elements. Because in
a real scenario the email ids will have different sizes

If by array you mean something like char *ids[100] or something
similar, then there is no problem using bsearch to search for
a given string in the array. The problems discussed elsethread
only apply when you store all the addresses in the same string
which is not the case if you use char *ids[100]. Then you have
100 different strings (or pointers to characters to be precise).

But they do need to be sorted. You can either use qsort or, as
Dan Pop suggested, a binary search tree. Note that sorting using
a binary search tree is very much equivalent to sorting using the
quicksort algorithm (assuming of course the qsort is implemented
using qsort) since both will be O(n log n). Then looking up the
tree or using bsearch on the array is also very much the same,
both will be O(log n).

If you need to continually add addresses to your program a tree
might be better, if you only add addresses once then the qsort/
bsearch combo is probably faster since there is some overhead
in the tree approach.
 
P

Paul Hsieh

Ramprasad A Padmanabhan said:
In my program I am checking wether a emailid exists in a list

I have in the complete_list a string like
"<[email protected]> <[email protected]> <[email protected]> ... <[email protected]>" that is a
string of emailids sorted

Now I want to find out if another ID "<[email protected]>" exists in this list

How is it possible to optimize the search given that the complete list
string is sorted in order.

Well to be optimized for performance, you are going to need to
preprocess your list of strings to allow for indexing and possibly
using bsearch().

However, if you are more concerned with optimizing for "safety" and
"simplicity", you could try using bstrlib (http://bstring.sf.net/):

#include <stdio.h>
#include "bstrlib.h"

struct cbsScanEmailAddress {
bstring db, newID;
int ret;
};

static int testID (void * parm, int ofs, int len) {
struct cbsScanEmailAddress * s = (struct cbsScanEmailAddress *)parm;
struct tagbstring t;

blk2tbstr (t, s->db->data + ofs, len);
s->ret = bstrcmp (&t, s->newID);
return -(s->ret >= 0);
}

int scanForEmailAddress (const bstring dbIDs, const bstring newID) {
struct cbsScanEmailAddress s;

s.db = dbIDs;
s.newID = newID;
bsplitcb (dbIDs, ' ', 0, testID, &s);
return s.ret == 0;
}

struct tagbstring dbID = bsStatic ("<[email protected]> <[email protected]> " \
"<[email protected]> <[email protected]>");
struct tagbstring newID1 = bsStatic ("<[email protected]>");
struct tagbstring newID2 = bsStatic ("<[email protected]>");

int main () {
printf ("%d\n", scanForEmailAddress (&dbID, &newID1));
printf ("%d\n", scanForEmailAddress (&dbID, &newID2));
return 0;
}
 
G

Glen Herrmannsfeldt

Ramprasad A Padmanabhan said:
In my program I am checking wether a emailid exists in a list

I have in the complete_list a string like
"<[email protected]> <[email protected]> <[email protected]> ... <[email protected]>" that is a
string of emailids sorted

Now I want to find out if another ID "<[email protected]>" exists in this list


How is it possible to optimize the search given that the complete list
string is sorted in order

As a completely different idea, you could search it with the Boyer-Moore
algorithm. I don't believe strstr usually does that, as the startup
overhead is a little high, and strstr is often used on short strings.

-- glen
 
A

Al Bowers

Ramprasad said:
Dan Pop wrote:


Ok I have put the email ids in a sorted array.
Can I use bsearch on an array with differrent sized elements. Because in
a real scenario the email ids will have different sizes

No bsearch and qsort is not suitable for different sized elements.
However you can use an array of char pointers and use
qsort to sort the array of char pointers and use bsearch to find the key
in the array.

#include <stdio.h>
#include <string.h>
#include <stdlib.h>

typedef struct EMAILLIST
{
char **email;
size_t email_count;
} EMAILLIST;

int AddEmailAddr(const char *s, EMAILLIST *p);
char **SearchEmailList(const char *key, EMAILLIST *p);
void FreeEmailList(EMAILLIST *p);
int cmp(const void *v1, const void *v2);

int main(void)
{
EMAILLIST my = {NULL};
char **tmp, *s;
char emails[] = "<[email protected]> <[email protected]> "
"<[email protected]> <[email protected]>";
s = strtok(emails," ");
while(s)
{
AddEmailAddr(s,&my);
s = strtok(NULL," ");
}
AddEmailAddr("<[email protected]>",&my);
if((tmp = SearchEmailList("<[email protected]>",&my)) != NULL)
printf("Email \"%s\" found in the list\n",*tmp);
printf("TEST: my.email_count = %u\n",my.email_count);
FreeEmailList(&my);
return 0;
}

int AddEmailAddr(const char *s, EMAILLIST *p)
{
char **tmp;
size_t num = p->email_count;

if(p->email != NULL)
if(SearchEmailList(s,p) != NULL)
{
printf("Email Dupe, Email \"%s\" not added\n",s);
return 0;
}
if((tmp = realloc(p->email,
(num+1)*(sizeof *p->email))) == NULL)
{
printf("Memory Allocation Error, Email \"%s\" not added\n",s);
return 0;
}
p->email = tmp;
if((p->email[num] = malloc(strlen(s)+1)) == NULL)
{
printf("Memory Allocation Error, Email \"%s\" not added\n",s);
return 0;
}
strcpy(p->email[num],s);
qsort(p->email,++p->email_count,sizeof *p->email,cmp);
return 1;
}

void FreeEmailList(EMAILLIST *p)
{
size_t i;

for(i = 0;i < p->email_count;i++)
free(p->email);
free(p->email);
p->email = NULL;
p->email_count = 0;
return;
}

int cmp(const void *v1, const void *v2)
{
const char **s1 = (const char **)v1;
const char **s2 = (const char **)v2;

return strcmp(*s1,*s2);
}

char **SearchEmailList(const char *key, EMAILLIST *p)
{
if(p->email == NULL) return NULL;
return bsearch(&key,p->email,p->email_count,sizeof(char *),cmp);
}
 
D

Dan Pop

As a completely different idea, you could search it with the Boyer-Moore
algorithm. I don't believe strstr usually does that, as the startup
overhead is a little high, and strstr is often used on short strings.

Would you really expect a newbie who has trouble *using* bsearch to be
able to implement Boyer-Moore?

Although likely to be faster than strstr, it's not likely to be faster
than bsearch, but it has the merit that it works on the original data
format.

Dan
 
G

Glen Herrmannsfeldt

Dan Pop said:
In <o8xnb.51291$HS4.234455@attbi_s01> "Glen Herrmannsfeldt"

(snip)
Would you really expect a newbie who has trouble *using* bsearch to be
able to implement Boyer-Moore?

It isn't all that hard to do, though knowing the language you are writing in
does help. Also, already written versions may exist.
Although likely to be faster than strstr, it's not likely to be faster
than bsearch, but it has the merit that it works on the original data
format.

If the strings keep the '<' and '>' on them, it might be pretty fast, once
the initialization is done. It does save the time of parsing the input
data into separate strings, which should also count in the time, as should
the Boyer-Moore table building.

-- glen
 
J

Jarno A Wuolijoki

Would you really expect a newbie who has trouble *using* bsearch to be
able to implement Boyer-Moore?

Although likely to be faster than strstr, it's not likely to be faster
than bsearch, but it has the merit that it works on the original data
format.

Of course, if that is an absolute requirement, you can do binary
search on the original data as well. (just not with bsearch())


char *findaddr(char *needle, size_t needlen,
char *haystack, size_t hslen)
{
char *segbeg=haystack, *segend=haystack+hslen;

for (;;) {
char *midbeg, *midend;
int cmpres;

while (segbeg<segend && *segbeg!='<') segbeg++;
while (segbeg<segend && *(segend-1)!='>') segend--;
if (segbeg==segend) return 0;

midbeg=segbeg+(segend-segbeg)/2;
while (*midbeg!='<') midbeg--;

midend=midbeg;
while (*midend!='>') midend++;
midend++;

cmpres=compare(midbeg, midend-midbeg, needle, needlen);
if (cmpres==0) return midbeg;
if (cmpres<0) segbeg=midend; else segend=midbeg;
}
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top