searching data for a large set of substrings

C

C3

I have to process some data in C that is given to me as a char * array. I
have a fairly large number of substrings (well, they're not actually
printable, but let's treat them as strings) that I have to search for in the
data.

I need to keep a count of how often each of these substrings occurs in my
original data and then print it out at the end.

This is a fairly mundane task, but since I have so many substrings, it's a
pain having to #define them all. Can anybody suggest an efficient way of
doing this task?

BTW, I have to use C. If I could use Perl, I'd be in heaven.


cheers,
 
P

pete

C3 said:
I have to process some data in C
that is given to me as a char * array. I
have a fairly large number of substrings (well, they're not actually
printable, but let's treat them as strings)
that I have to search for in the data.

I need to keep a count of how often each of these
substrings occurs in my original data and
then print it out at the end.

This is a fairly mundane task,
but since I have so many substrings, it's a
pain having to #define them all.
Can anybody suggest an efficient way of
doing this task?

BTW, I have to use C. If I could use Perl, I'd be in heaven.


/* BEGIN new.c output */

The substring "string", occurs 4 times.
The substring "C", occurs 3 times.
The substring "kibo", occurs 0 times.
The substring "hen", occurs 1 times.
The substring "of", occurs 5 times.

/* END new.c output */

/* BEGIN new.c */

#include <stdio.h>
#include <string.h>

int main(void)
{
struct substring {
char *string;
long unsigned count;
} substring[] = {{"string"},{"C"},{"kibo"},{"hen"},{"of"}},
*str_ptr;
char *data[] = {
"I have to process some data in C that is given to me "
"as a char * array.\n I have a fairly large number of "
"substrings\n(well, they're not actually printable, "
"but let's treat them as strings)\nthat I have to search "
"for in the data.",
"I need to keep a count of how often each of these "
"substrings occurs in my original data and then print it "
"out at the end.",
"This is a fairly mundane task, but since I have so many "
"substrings, it's a pain having to #define them all.\n"
"Can anybody suggest an efficient way of doing this task?\n",
"BTW, I have to use C.\n"
"If I could use Perl, I'd be in heaven.\ncheers,"
};
size_t nstring, ndata;
char **data_ptr, *ptr;
long unsigned count;

nstring = sizeof substring / sizeof *substring;
for (str_ptr = substring; nstring-- != 0; ++str_ptr) {
count = 0;
data_ptr = data;
ndata = sizeof data / sizeof *data;
while (ndata-- != 0) {
ptr = *data_ptr;
ptr = strstr(*data_ptr, str_ptr -> string);
while (ptr != NULL) {
++count;
ptr = strstr(ptr + 1, str_ptr -> string);
}
++data_ptr;
}
str_ptr -> count = count;
}
puts("/* BEGIN new.c output */\n");
nstring = sizeof substring / sizeof *substring;
for (str_ptr = substring; nstring-- != 0; ++str_ptr) {
printf("The substring \"%s\", occurs %lu times.\n",
str_ptr -> string, str_ptr -> count);
}
puts("\n/* END new.c output */");
return 0;
}

/* END new.c */
 
C

C3

This has the guts of what I'm after, but I need to give it a fairly long
list of substrings (byte sequences), which can't be done dynamically due to
the struct.

I have already read my substrings into char[] arrays, now I just need to do
the substring matching. Any help you could give me would be appreciated.

cheers,
 
A

Al Bowers

C3 said:
This has the guts of what I'm after, but I need to give it a fairly long
list of substrings (byte sequences), which can't be done dynamically due to
the struct.

I have already read my substrings into char[] arrays, now I just need to do
the substring matching. Any help you could give me would be appreciated.

Write a new function that uses function strstr to do the matching. This
new function will have parameters that will point to the substr array,
a parameter that will describe the size of the array, and a pointer to
the data string that you wish to search. Depending on how you want
to store the results of the search, other parameters may be needed.

Here is an example of a function, PrintResults, that will simply print
the results of the search. (Case issue avoided for simplicity)

#include <stdio.h>
#include <string.h>

void PrintResults(char *substr[], size_t cnt,const char *string);

int main(void)
{
char *substr[] = {"george","bill","adam","a"};
char *teststr = "this is george and this is adam.\n"
"there are many georges and some \n"
"adams, but george is the best because\n"
"he is the mostest. george is the king\n"
"of the jungle";

printf("In the string:\n\n\"%s\"\n\nSearch found:\n",teststr);
PrintResults(substr,(sizeof substr)/(sizeof *substr),teststr);
return 0;
}

void PrintResults(char *substr[], size_t cnt,const char *string)
{
size_t i;
unsigned subcnt;
const char *s1, *s2;

if(!substr || !string || *string == '\0') return;
for(i = 0; i < cnt;i++)
{
subcnt = 0;
if(*substr != '\0')
for(s1 = string;(s2 = strstr(s1,substr)); s1 = (s2+1))
subcnt++;
printf("%u occurrances of \"%s\"\n",subcnt,substr);
}
return;
}
 
C

CBFalconer

C3 said:
This has the guts of what I'm after, but I need to give it a
fairly long list of substrings (byte sequences), which can't be
done dynamically due to the struct.

I have already read my substrings into char[] arrays, now I just
need to do the substring matching. Any help you could give me
would be appreciated.

What is 'this'? What is 'the struct'? You can compare strings
with strcmp().
 
R

Robert Harris

C3 said:
This has the guts of what I'm after, but I need to give it a fairly long
list of substrings (byte sequences), which can't be done dynamically due to
the struct.

I have already read my substrings into char[] arrays, now I just need to do
the substring matching. Any help you could give me would be appreciated.

cheers,
Look at lex (which generates C code)
 
C

C3

Unfortunately, it appears this won't work for another reason. My substrings
may contain the null character. I have the length, so it shouldn't be a
problem.
 
M

Mac

Unfortunately, it appears this won't work for another reason. My substrings
may contain the null character. I have the length, so it shouldn't be a
problem.

I was going to ask about that. I'm glad you figured it out yourself.

I don't know how big the data sets are or how many substrings you have,
but you could just go through the data one character at a time, and for
each byte call memcmp() once for each substring.

char *data, **substring;
long i, j, nsubstrings, *count, data_length, *substring_length;

/* allocate space and populate substrings and so on */
....

for (j=0; j<data_length; j++)
{
for(i=0; i<nsubstrings; i++)
if (memcmp(data[j], substring, substring_length) == 0))
++count;
}

This is just a sketch of an idea to be taken for what it is worth.

In particular, it might be better to use size_t for some of the types.

--Mac
 
C

C3

Thanks for the suggestion. I guess there's no Perlesque q&d way to do what I
need.

cheers,
 
C

C3

Thank you. I've always believed this to be the case, for oh so very obvious
reasons. How did you know? :)

cheers,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top