How to find a string in a stream of binary data?

A

Angus Comber

Hello

My code below opens a Word document in binary mode and places the data into
a buffer. I then want to search this buffer for a string. I tried using
strstr but think it stops looking when it reaches first null character or
some control character in data. What C function should I use to be able to
search in a BYTE data buffer?


Code:

#include <stdio.h>

char szPath[MAX_PATH] = "";
strcpy(szPath, "E:\\MyPath\\ahl.doc");
FILE* stream;

FILE *file = fopen(szPath, "rb"); // Open the file
fseek(file, 0, SEEK_END); // Seek to the end
long file_size = ftell(file); // Get the current position
rewind (file); // rewind to start of file

// allocate memory to contain the whole file.
BYTE* byBuffer = (BYTE*) malloc (file_size);
// if (buffer == NULL) exit (2);

// copy the file into the buffer.
fread (byBuffer,1,file_size, file);

const char* szFind = "selected";
//strstr(StrToLookIn, StrToFind);
char* szResult = strstr((char*)byBuffer, szFind);

fclose(file); // Close the file

free( byBuffer );


Angus Comber
(e-mail address removed)
 
A

Allin Cottrell

Angus said:
My code below opens a Word document in binary mode and places the data into
a buffer. I then want to search this buffer for a string. I tried using
strstr but think it stops looking when it reaches first null character or
some control character in data. What C function should I use to be able to
search in a BYTE data buffer?

Code:

#include <stdio.h>

You need to include more than this (e.g. stdlib.h for malloc).
char szPath[MAX_PATH] = "";
strcpy(szPath, "E:\\MyPath\\ahl.doc");
FILE* stream;

FILE *file = fopen(szPath, "rb"); // Open the file
fseek(file, 0, SEEK_END); // Seek to the end
long file_size = ftell(file); // Get the current position

Hmm, you're assuming C99, where you can use C++-style comments,
and can introduce new variables at any point in the code? For
portability, you're best to stick with C90.
rewind (file); // rewind to start of file

// allocate memory to contain the whole file.
BYTE* byBuffer = (BYTE*) malloc (file_size);

Look at the archive of this newsgroup, and you'll see many good
reasons _not_ to cast the return from malloc, if you're writing
in C.
// if (buffer == NULL) exit (2);

exit(2)?? What sort of code is that?

Well, anyway, here is an ISO/ANSI C program that (I think) will
do what you want, not necessarily with greatest efficiency.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long find_string_in_buf (unsigned char *buf, size_t len,
const char *s)
{
long i, j;
int slen = strlen(s);
long imax = len - slen - 1;
long ret = -1;
int match;

for (i=0; i<imax; i++) {
match = 1;
for (j=0; j<slen; j++) {
if (buf[i+j] != s[j]) {
match = 0;
break;
}
}
if (match) {
ret = i;
break;
}
}

return ret;
}

int main (int argc, char **argv)
{
const char *targ = "selected";
const char *fname;
FILE *fp;
size_t file_size;
unsigned char *buf;
long loc;

if (argc < 2) {
fputs("Please supply a filename\n", stderr);
exit(EXIT_FAILURE);
}

fname = argv[1];

fp = fopen(fname, "rb");
if (fp == NULL) {
fprintf(stderr, "Couldn't open '%s'\n", fname);
exit(EXIT_FAILURE);
}

fseek(fp, 0, SEEK_END);
file_size = ftell(fp);
rewind(fp);

buf = malloc(file_size);
if (buf == NULL) {
fputs("Out of memory\n", stderr);
exit(EXIT_FAILURE);
}

fread(buf, 1, file_size, fp);

loc = find_string_in_buf(buf, file_size, targ);
if (loc < 0) {
printf("The target string '%s' was not found\n", targ);
} else {
printf("The target string '%s' was found at byte %ld\n",
targ, loc);
}

fclose(fp);
free(buf);

return 0;
}
 
E

E. Robert Tisdale

Allin said:
Hmm, you're assuming C99, where you can use C++-style comments,
and can introduce new variables at any point in the code?
For portability, you're best to stick with C90.

When, if ever, would you recommend moving on to C 99?
 
M

Mike Wahler

Angus Comber said:
Hello

My code below opens a Word document in binary mode and places the data into
a buffer. I then want to search this buffer for a string. I tried using
strstr but think it stops looking when it reaches first null character or
some control character in data. What C function should I use to be able to
search in a BYTE data buffer?


Code:

#include <stdio.h>

char szPath[MAX_PATH] = "";
strcpy(szPath, "E:\\MyPath\\ahl.doc");
FILE* stream;

FILE *file = fopen(szPath, "rb"); // Open the file
fseek(file, 0, SEEK_END); // Seek to the end
long file_size = ftell(file); // Get the current position
rewind (file); // rewind to start of file

// allocate memory to contain the whole file.
BYTE* byBuffer = (BYTE*) malloc (file_size);
// if (buffer == NULL) exit (2);

// copy the file into the buffer.
fread (byBuffer,1,file_size, file);

const char* szFind = "selected";
//strstr(StrToLookIn, StrToFind);
char* szResult = strstr((char*)byBuffer, szFind);

fclose(file); // Close the file

free( byBuffer );


Angus Comber
(e-mail address removed)


#include <stdio.h>
#include <string.h>

#define BUFFER_SIZE 200 /* adjust to your needs */


/* nz_str() */
/* */
/* behaves as 'strstr()' but handles input with */
/* embedded zero characters */
/* */
/* 'data' and 'to_find' must be zero-terminated */
char *nz_strstr(char *data, const char *to_find)
{
char *result = 0;

while(!(result = strstr(data, to_find)))
data += strlen(data) + 1;

return result;
}

int main(int argc, char **argv)
{
char buffer[BUFFER_SIZE] = "Hello world\0 this is\0 a test";
char what[] = "test";
char *p = nz_strstr(buffer, what);

printf("string '%s' ", what);

if(p)
printf("found at offset %lu\n", (unsigned long)(p - buffer));
else
printf("not found\n");


return 0;
}

-Mike
 
A

Allin Cottrell

E. Robert Tisdale said:
When, if ever, would you recommend moving on to C 99?

When the more commonly used C compilers offer support for C99
that is comparable to the support they curently offer for C90.

At present, the most widely used C compilers can be made C90-
conforming, if you know the right options to use. At present,
no commonly used C compiler can be made C99-conforming, with
any combination of options.
 
M

Mike Wahler

Mike Wahler said:
/* nz_str() */
/* */
/* behaves as 'strstr()' but handles input with */
/* embedded zero characters */
/* */
/* 'data' and 'to_find' must be zero-terminated */

Correction: 'data' must be terminated by at least
*two* consecutive zero characters.

char *nz_strstr(char *data, const char *to_find)
{
char *result = 0;

while(!(result = strstr(data, to_find)))
data += strlen(data) + 1;

... because of the "+ 1" used to step *over* the zeros.


BTW, Angus, referring to your original code, there's no such
type as 'BYTE' in C.

-Mike
 
R

Richard Heathfield

Angus said:
Hello

My code below opens a Word document in binary mode and places the data
into
a buffer.

<OT>
You might get a bit more joy out of Word docs if you do some research into
"structured storage" or "compound documents".
I then want to search this buffer for a string. I tried using
strstr but think it stops looking when it reaches first null character or
some control character in data. What C function should I use to be able
to search in a BYTE data buffer?

Look up the Boyer-Moore search algorithm on the Net, and implement it in C.
 
J

jacob navia

Allin Cottrell said:
E. Robert Tisdale wrote:

At present, the most widely used C compilers can be made C90-
conforming, if you know the right options to use. At present,
no commonly used C compiler can be made C99-conforming, with
any combination of options.

The freely available lcc-win32 compiler implements most of C99
http://www.cs.virginia.edu/~lcc-win32
 
D

David Resnick

Allin Cottrell said:
Angus said:
My code below opens a Word document in binary mode and places the data into
a buffer. I then want to search this buffer for a string. I tried using
strstr but think it stops looking when it reaches first null character or
some control character in data. What C function should I use to be able to
search in a BYTE data buffer?

Code:

#include <stdio.h>

You need to include more than this (e.g. stdlib.h for malloc).
char szPath[MAX_PATH] = "";
strcpy(szPath, "E:\\MyPath\\ahl.doc");
FILE* stream;

FILE *file = fopen(szPath, "rb"); // Open the file
fseek(file, 0, SEEK_END); // Seek to the end
long file_size = ftell(file); // Get the current position

Hmm, you're assuming C99, where you can use C++-style comments,
and can introduce new variables at any point in the code? For
portability, you're best to stick with C90.
rewind (file); // rewind to start of file

// allocate memory to contain the whole file.
BYTE* byBuffer = (BYTE*) malloc (file_size);

Look at the archive of this newsgroup, and you'll see many good
reasons _not_ to cast the return from malloc, if you're writing
in C.
// if (buffer == NULL) exit (2);

exit(2)?? What sort of code is that?

Well, anyway, here is an ISO/ANSI C program that (I think) will
do what you want, not necessarily with greatest efficiency.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

long find_string_in_buf (unsigned char *buf, size_t len,
const char *s)
{
long i, j;
int slen = strlen(s);
long imax = len - slen - 1;
long ret = -1;
int match;

for (i=0; i<imax; i++) {
match = 1;
for (j=0; j<slen; j++) {
if (buf[i+j] != s[j]) {
match = 0;
break;
}
}
if (match) {
ret = i;
break;
}
}

return ret;
}


A perhaps simpler implementation is as follows. Seems like using memcmp
is a win here.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* look for the query in binary input data of indicated length */
const char *find_str_in_data(const char *data, size_t data_len,
const char *query)
{
const char *p = data;
size_t query_len = strlen(query);

while (1) {
if (data + data_len - p < query_len) {
break;
}
if (*p == query[0]) {
if (memcmp(p, query, query_len) == 0) {
return p;
}
}
p++;
}

return NULL;
}

int main()
{
const char *query = "foo";
const char data1[] = { 'a', 'b', '\0', 'f', '\0', 'f', 'o', 'o' };
const char data2[] = { 'a', 'b', '\0', 'f', '\0', 'f', 'o', 'o', 'q'};
const char data3[] = { 'f', 'o' };
const char *result;

result = find_str_in_data(data1, sizeof data1, query);
printf("query '%s'%s found in data1\n", query, result ? "" : " NOT");
result = find_str_in_data(data2, sizeof data2, query);
printf("query '%s'%s found in data2\n", query, result ? "" : " NOT");
result = find_str_in_data(data3, sizeof data3, query);
printf("query '%s'%s found in data3\n", query, result ? "" : " NOT");

return EXIT_SUCCESS;
}

-David
 
M

Mike Wahler

Mike Wahler said:
Correction: 'data' must be terminated by at least
*two* consecutive zero characters.




.. because of the "+ 1" used to step *over* the zeros.

And *another* problem! (surprised nobody caught it):

This will have undefined behavior if the searched-for string isn't
found, it'll run off the end of the 'data' array.

It needs a 'size' parameter to check against, or at least
a 'dummy' copy of the searched-for item pasted onto the
end as a 'sentinel'.

-Mike
 
G

glen herrmannsfeldt

Angus said:
Hello

My code below opens a Word document in binary mode and places the data into
a buffer. I then want to search this buffer for a string. I tried using
strstr but think it stops looking when it reaches first null character or
some control character in data. What C function should I use to be able to
search in a BYTE data buffer?

Use the Aho Corasick algorithm referenced previously in this newsgroup.

Though the discussion doesn't seem to have much to do with the algorithm.

-- glen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top