Search a binary file for a string... again! (its to slow)

S

spike

Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}

int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText)))
{
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}
-------------------------------------------------------------------------
 
J

Joerg Schoen

spike said:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

First of all: Your strequal function should be named differently, since
the standard reserves any name with "str..." for itself. Secondly, it
is basically a slow implementation of "strcmp"
int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}


You should better code it like something like

int strequal(char sText[],char sBuff[])
{
int i;
for(i=0;i<STRSIZE;i++)
if(sText != sBuff)
return FALSE;

return TRUE;
}
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= ....
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);

That is what really kills you! Read 4 bytes, check them, rewind 3
bytes and start again.

Find something better here!
 
M

Mike Wahler

spike said:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Yup, it's just a *guess*. See below.
Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}


Some guesses of my own:

Instead of 'rolling your own' function like that, have
you considered trying the 'memcmp()' function from the
standard library? It's likely optimized for your
platform.

But note that the language doesn't specify any particular
'performance', only behavior.

Another option to look into (in case it's the actual i/o
that's the bottleneck and not the memory parsing) is to
see if your implementation offers any 'extension' functions
which would probably have more 'intimate' knowledge of your
file system and exploit it.

But really, all this is just guessing, I suggest you use
a profiler to find out where the bottleneck *really* is.
I've found that it's often not where one's 'logic' says
it should be.

-Mike
 
S

Spacen Jasset

spike said:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}

int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText)))
{
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}
-------------------------------------------------------------------------


Have a bigger buffer than 4 chars! Use at least 4k (or more really - spash
out 32k). That should make things much better ( less function calls and io
requests ) - and do not use fsetpos/fgetpos. Just write a looping function
first that reads in chunks of data -- after this, write a function that goes
through a buffer a character at a time, and keep track how much of the
search sting you have found so far.

Other than that consider using an algorithm such as Boyer-Moore
 
M

Malcolm

spike said:
Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it
faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}

You can optimise this by writing it

int mystrncmp(const char *s1, const char *s2, int n)
{
int i;

for(i=0;i<n;i++)
if(s1 != s2)
return s2 - s1;
return 0;
}
This will save some cycles. You can also replace with the stdlib function
strncmp(), which may be even faster as coded in assembly.
int main()
{
FILE *fp;
char sText[STRSIZE] = "name";
char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText)))
You might save a bit of time by simply calling fgetc(), which is designed to
return single characters.
Most of the time the first character in the buffer won't match the first
character in the string. So you can avoid the overhead of the function call
by simply rejecting it.
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);
You are doing this on each character read. This will slow you down.
}
fclose(fp);
}
else
{
printf("Could not open file");
}
return 0;
}
The way to speed up is to have a more sophisticated buffer.

Read characters until you come up with a match in the first position.
Then read enough characters for the string into the buffer.
If you have a match, you have a match.

Otherwise, search the buffer for the next matching character to the first
letter, shift left, and read more in. If the alphabet is large and the
string quite short you should frequently flush the buffer, in which case you
go back to reading characters one at a time.

Shifting characters is expensive, but a circular buffer involves the modulus
operation, which is also expensive. To get even more speed, you can declare
an oversize buffer, as large as you like, and keep a travelling pointer to
the start position. Only when you hit the end do you need to move some
characters to the beginning and reset the start pointer. (If the buffer
empties then of course you go back to the initial start position for free).
 
N

Nick Landsberg

Remember the above define! We'll come back to it
later.
int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;
}
else
{
iRetval = iRetval && FALSE;
}
}
return iRetval;
}


You can optimise this by writing it

int mystrncmp(const char *s1, const char *s2, int n)
{
int i;

for(i=0;i<n;i++)
if(s1 != s2)
return s2 - s1;
return 0;
}
This will save some cycles. You can also replace with the stdlib function
strncmp(), which may be even faster as coded in assembly.
int main()
{
FILE *fp;
char sText[STRSIZE] = "name";


Latent bug in the code here. STRSIZE was defined
as 4. "name" is a 5 character string including
the NULL byte '/0' at the end. Therefore a NULL was
written *somewhere* (implementation specific).
It is very possible (but not guaranteed) that
the compiler allocated storage for sText and
sBuff right after each other.
If that was the case, the NULL byte
was written into the first character of sBuff.
Later on, when you read data into sBuff, that
NULL byte would be overwritten with some character
from the file and sText would no longer be
a null-terminated string. This would make
for some "interesting" behavior and probably
prevent you from using any of the standard library
functions.

Take the advice of the other posters about
using standard library functions and larger buffer
sizes, too.


char sBuff[STRSIZE];
unsigned long pos;

if((fp = fopen("demo.dem","rb")) != NULL)
{
while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >=

(sizeof(sText)))

You might save a bit of time by simply calling fgetc(), which is designed to
return single characters.

Most of the time the first character in the buffer won't match the first
character in the string. So you can avoid the overhead of the function call
by simply rejecting it.
if(strequal(sText,sBuff))
{
printf("a match!\n");
}
fgetpos(fp, &pos);
pos = pos-(STRSIZE-1);
fsetpos(fp, &pos);

You are doing this on each character read. This will slow you down.

The comment was absolutely correct. (Sorry, I don't
recall who made it.) You are (conceptually) reading
4 characters, then taking 3 steps back, and reading
4 characters again, 3 of which you have already read
once. While this works, your question was about
speed, and this sequence above looks like a real
pig.
 
C

CBFalconer

Joerg said:
First of all: Your strequal function should be named differently,
since the standard reserves any name with "str..." for itself.
Secondly, it is basically a slow implementation of "strcmp"
.... snip ...


That is what really kills you! Read 4 bytes, check them, rewind 3
bytes and start again.

Find something better here!

Here's a revision to what I published to the OP's original
request. He must have not seen that. Some slight changes have
been made to ensure that we can have binary file access, which is
not necessarily possible via stdin. This never backtracks.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <assert.h>

/* The difference between a binary and a text file, on read,
is the conversion of end-of-line delimiters. What those
delimiters are does not affect the action. There is also
a potential problem with 0x1a EOF markers in DOS/Windoze.

This is a version of Knuth-Morris-Pratt algorithm. The
point of using this is to avoid any backtracking in file
reading, and thus avoiding any use of buffer arrays.
*/

size_t chrcount; /* debuggery, count of input chars, zeroed */

/* --------------------- */

/* Almost straight out of Sedgewick */
/* The next array indicates what index in id should next be
compared to the current char. Once the (lgh - 1)th char
has been successfully compared, the id has been found.
The array is formed by comparing id to itself. */
void initnext(int *next, const char *id, int lgh)
{
int i, j;

assert(lgh > 0);
next[0] = -1; i = 0; j = -1;
while (i < lgh) {
while ((j >= 0) && (id != id[j])) j = next[j];
i++; j++;
next = j;
}
#if (0)
for (i = 0; i < lgh; i++)
printf("id[%d] = '%c' next[%d] = %d\n",
i, id, i, next);
#endif
} /* initnext */

/* --------------------- */

/* reads f without rewinding until either EOF or *marker
has been found. Returns EOF if not found. At exit the
last matching char has been read, and no further. */
int kmpffind(const char *marker, int lgh, int *next, FILE *f)
{
int j; /* char position in marker to check */
int ch; /* current char */

assert(lgh > 0);
j = 0;
while ((j < lgh) && (EOF != (ch = getc(f)))) {
chrcount++;
while ((j >= 0) && (ch != marker[j])) j = next[j];
j++;
}
return ch;
}

/* --------------------- */

int binfsrch(const char *marker, FILE *f)
{
int *next;
int lgh;
int ch;
int items; /* count of markers found */

if (!(next = malloc(strlen(marker) * sizeof *next))) {
puts("No memory");
exit(EXIT_FAILURE);
}
else {
lgh = strlen(marker);
initnext(next, marker, lgh);
items = 0;
while (EOF != kmpffind(marker, lgh, next, f)) {
/* found it, dump the following printable chars */
items++;
printf("%d %s : \"", items, marker);
while (isprint(ch = getc(f))) {
chrcount++;
putchar(ch);
}
puts("\"");
if (EOF == ch) break;
else chrcount++;
}
return items;
}
} /* binfsrch */

/* --------------------- */

int main(int argc, char **argv)
{
FILE *f;

f = stdin;
if (3 == argc) {
if (!(f = fopen(argv[2], "rb"))) {
printf("Can't open %s\n", argv[2]);
exit(EXIT_FAILURE);
}
argc--;
}
if (2 != argc) {
puts("Usage: binfsrch name [binaryfile]");
puts(" (file defaults to stdin text mode)");
}
else if (binfsrch(argv[1], f)) {
printf("\"%s\" : found\n", argv[1]);
}
else printf("\"%s\" : not found\n", argv[1]);
printf("%lu chars\n", (unsigned long)chrcount);
return 0;
} /* main binfsrch */
 
B

Barry Schwarz

Im writing a program to search for a string in a binary file.
And it works. The problem is: It is sooo slow! how can i make it faster?
It takes 27 seconds just to search a 5 meg file.
I guess it has something to do with the strequal() function...

Btw, thanks to all of you who answered last time!

code:
-------------------------------------------------------------------------
#include <stdio.h>
#define MAXLEN 50
#define FALSE 0
#define TRUE !FALSE
#define STRSIZE 4

int strequal(char sText[],char sBuff[])
{
int iRetval = TRUE;
int i;
for(i=0;i<STRSIZE;i++)
{
if(sText == sBuff)
{
iRetval = iRetval && TRUE;


I don't know if this contributes to your timing problem but this can
never change the value of iRetval and is effectively a no-op.
}
else
{
iRetval = iRetval && FALSE;

Similarly, this can only and will always set iRetval to FALSE.
Furthermore, once iRetval is FALSE, it can never become TRUE so there
is no point in continuing the loop.
}
}
return iRetval;
}
snip


<<Remove the del for email>>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,077
Latest member
SangMoor21

Latest Threads

Top