Search a binary file for a string... again! (its to slow)

Discussion in 'C Programming' started by spike, Feb 28, 2004.

  1. spike

    spike Guest

    Im writing a program to search for a string in a binary file.
    And it works. The problem is: It is sooo slow! how can i make it faster?
    It takes 27 seconds just to search a 5 meg file.
    I guess it has something to do with the strequal() function...

    Btw, thanks to all of you who answered last time!

    code:
    -------------------------------------------------------------------------
    #include <stdio.h>
    #define MAXLEN 50
    #define FALSE 0
    #define TRUE !FALSE
    #define STRSIZE 4

    int strequal(char sText[],char sBuff[])
    {
    int iRetval = TRUE;
    int i;
    for(i=0;i<STRSIZE;i++)
    {
    if(sText == sBuff)
    {
    iRetval = iRetval && TRUE;
    }
    else
    {
    iRetval = iRetval && FALSE;
    }
    }
    return iRetval;
    }

    int main()
    {
    FILE *fp;
    char sText[STRSIZE] = "name";
    char sBuff[STRSIZE];
    unsigned long pos;

    if((fp = fopen("demo.dem","rb")) != NULL)
    {
    while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >= (sizeof(sText)))
    {
    if(strequal(sText,sBuff))
    {
    printf("a match!\n");
    }
    fgetpos(fp, &pos);
    pos = pos-(STRSIZE-1);
    fsetpos(fp, &pos);
    }
    fclose(fp);
    }
    else
    {
    printf("Could not open file");
    }
    return 0;
    }
    -------------------------------------------------------------------------
    spike, Feb 28, 2004
    #1
    1. Advertising

  2. spike

    Joerg Schoen Guest

    spike wrote:

    > Im writing a program to search for a string in a binary file.
    > And it works. The problem is: It is sooo slow! how can i make it faster?
    > It takes 27 seconds just to search a 5 meg file.
    > I guess it has something to do with the strequal() function...


    First of all: Your strequal function should be named differently, since
    the standard reserves any name with "str..." for itself. Secondly, it
    is basically a slow implementation of "strcmp"

    > int strequal(char sText[],char sBuff[])
    > {
    > int iRetval = TRUE;
    > int i;
    > for(i=0;i<STRSIZE;i++)
    > {
    > if(sText == sBuff)
    > {
    > iRetval = iRetval && TRUE;
    > }
    > else
    > {
    > iRetval = iRetval && FALSE;
    > }
    > }
    > return iRetval;
    > }


    You should better code it like something like

    int strequal(char sText[],char sBuff[])
    {
    int i;
    for(i=0;i<STRSIZE;i++)
    if(sText != sBuff)
    return FALSE;

    return TRUE;
    }

    > while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >=

    ....
    > fgetpos(fp, &pos);
    > pos = pos-(STRSIZE-1);
    > fsetpos(fp, &pos);


    That is what really kills you! Read 4 bytes, check them, rewind 3
    bytes and start again.

    Find something better here!
    Joerg Schoen, Feb 28, 2004
    #2
    1. Advertising

  3. spike

    Mike Wahler Guest

    "spike" <> wrote in message
    news:...
    > Im writing a program to search for a string in a binary file.
    > And it works. The problem is: It is sooo slow! how can i make it faster?
    > It takes 27 seconds just to search a 5 meg file.
    > I guess it has something to do with the strequal() function...


    Yup, it's just a *guess*. See below.

    > Btw, thanks to all of you who answered last time!
    >
    > code:
    > -------------------------------------------------------------------------
    > #include <stdio.h>
    > #define MAXLEN 50
    > #define FALSE 0
    > #define TRUE !FALSE
    > #define STRSIZE 4
    >
    > int strequal(char sText[],char sBuff[])
    > {
    > int iRetval = TRUE;
    > int i;
    > for(i=0;i<STRSIZE;i++)
    > {
    > if(sText == sBuff)
    > {
    > iRetval = iRetval && TRUE;
    > }
    > else
    > {
    > iRetval = iRetval && FALSE;
    > }
    > }
    > return iRetval;
    > }


    Some guesses of my own:

    Instead of 'rolling your own' function like that, have
    you considered trying the 'memcmp()' function from the
    standard library? It's likely optimized for your
    platform.

    But note that the language doesn't specify any particular
    'performance', only behavior.

    Another option to look into (in case it's the actual i/o
    that's the bottleneck and not the memory parsing) is to
    see if your implementation offers any 'extension' functions
    which would probably have more 'intimate' knowledge of your
    file system and exploit it.

    But really, all this is just guessing, I suggest you use
    a profiler to find out where the bottleneck *really* is.
    I've found that it's often not where one's 'logic' says
    it should be.

    -Mike
    Mike Wahler, Feb 28, 2004
    #3
  4. "spike" <> wrote in message
    news:...
    > Im writing a program to search for a string in a binary file.
    > And it works. The problem is: It is sooo slow! how can i make it faster?
    > It takes 27 seconds just to search a 5 meg file.
    > I guess it has something to do with the strequal() function...
    >
    > Btw, thanks to all of you who answered last time!
    >
    > code:
    > -------------------------------------------------------------------------
    > #include <stdio.h>
    > #define MAXLEN 50
    > #define FALSE 0
    > #define TRUE !FALSE
    > #define STRSIZE 4
    >
    > int strequal(char sText[],char sBuff[])
    > {
    > int iRetval = TRUE;
    > int i;
    > for(i=0;i<STRSIZE;i++)
    > {
    > if(sText == sBuff)
    > {
    > iRetval = iRetval && TRUE;
    > }
    > else
    > {
    > iRetval = iRetval && FALSE;
    > }
    > }
    > return iRetval;
    > }
    >
    > int main()
    > {
    > FILE *fp;
    > char sText[STRSIZE] = "name";
    > char sBuff[STRSIZE];
    > unsigned long pos;
    >
    > if((fp = fopen("demo.dem","rb")) != NULL)
    > {
    > while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >=

    (sizeof(sText)))
    > {
    > if(strequal(sText,sBuff))
    > {
    > printf("a match!\n");
    > }
    > fgetpos(fp, &pos);
    > pos = pos-(STRSIZE-1);
    > fsetpos(fp, &pos);
    > }
    > fclose(fp);
    > }
    > else
    > {
    > printf("Could not open file");
    > }
    > return 0;
    > }
    > -------------------------------------------------------------------------


    Have a bigger buffer than 4 chars! Use at least 4k (or more really - spash
    out 32k). That should make things much better ( less function calls and io
    requests ) - and do not use fsetpos/fgetpos. Just write a looping function
    first that reads in chunks of data -- after this, write a function that goes
    through a buffer a character at a time, and keep track how much of the
    search sting you have found so far.

    Other than that consider using an algorithm such as Boyer-Moore
    Spacen Jasset, Feb 28, 2004
    #4
  5. spike

    Malcolm Guest

    "spike" <> wrote in message
    > Im writing a program to search for a string in a binary file.
    > And it works. The problem is: It is sooo slow! how can i make it
    > faster?
    > It takes 27 seconds just to search a 5 meg file.
    > I guess it has something to do with the strequal() function...
    >
    > -------------------------------------------------------------------------
    > #include <stdio.h>
    > #define MAXLEN 50
    > #define FALSE 0
    > #define TRUE !FALSE
    > #define STRSIZE 4
    >
    > int strequal(char sText[],char sBuff[])
    > {
    > int iRetval = TRUE;
    > int i;
    > for(i=0;i<STRSIZE;i++)
    > {
    > if(sText == sBuff)
    > {
    > iRetval = iRetval && TRUE;
    > }
    > else
    > {
    > iRetval = iRetval && FALSE;
    > }
    > }
    > return iRetval;
    > }
    >

    You can optimise this by writing it

    int mystrncmp(const char *s1, const char *s2, int n)
    {
    int i;

    for(i=0;i<n;i++)
    if(s1 != s2)
    return s2 - s1;
    return 0;
    }
    This will save some cycles. You can also replace with the stdlib function
    strncmp(), which may be even faster as coded in assembly.
    >
    > int main()
    > {
    > FILE *fp;
    > char sText[STRSIZE] = "name";
    > char sBuff[STRSIZE];
    > unsigned long pos;
    >
    > if((fp = fopen("demo.dem","rb")) != NULL)
    > {
    > while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >=

    (sizeof(sText)))
    >

    You might save a bit of time by simply calling fgetc(), which is designed to
    return single characters.
    >
    > {
    >
    >

    Most of the time the first character in the buffer won't match the first
    character in the string. So you can avoid the overhead of the function call
    by simply rejecting it.
    > if(strequal(sText,sBuff))
    > {
    > printf("a match!\n");
    > }
    > fgetpos(fp, &pos);
    > pos = pos-(STRSIZE-1);
    > fsetpos(fp, &pos);
    >

    You are doing this on each character read. This will slow you down.
    >
    > }
    > fclose(fp);
    > }
    > else
    > {
    > printf("Could not open file");
    > }
    > return 0;
    > }
    >

    The way to speed up is to have a more sophisticated buffer.

    Read characters until you come up with a match in the first position.
    Then read enough characters for the string into the buffer.
    If you have a match, you have a match.

    Otherwise, search the buffer for the next matching character to the first
    letter, shift left, and read more in. If the alphabet is large and the
    string quite short you should frequently flush the buffer, in which case you
    go back to reading characters one at a time.

    Shifting characters is expensive, but a circular buffer involves the modulus
    operation, which is also expensive. To get even more speed, you can declare
    an oversize buffer, as large as you like, and keep a travelling pointer to
    the start position. Only when you hit the end do you need to move some
    characters to the beginning and reset the start pointer. (If the buffer
    empties then of course you go back to the initial start position for free).
    Malcolm, Feb 28, 2004
    #5
  6. Malcolm wrote:
    > "spike" <> wrote in message
    >
    >>Im writing a program to search for a string in a binary file.
    >>And it works. The problem is: It is sooo slow! how can i make it
    >>faster?
    >>It takes 27 seconds just to search a 5 meg file.
    >>I guess it has something to do with the strequal() function...
    >>
    >>-------------------------------------------------------------------------
    >>#include <stdio.h>
    >>#define MAXLEN 50
    >>#define FALSE 0
    >>#define TRUE !FALSE
    >>#define STRSIZE 4


    Remember the above define! We'll come back to it
    later.

    >>
    >>int strequal(char sText[],char sBuff[])
    >>{
    >> int iRetval = TRUE;
    >> int i;
    >> for(i=0;i<STRSIZE;i++)
    >> {
    >> if(sText == sBuff)
    >> {
    >> iRetval = iRetval && TRUE;
    >> }
    >> else
    >> {
    >> iRetval = iRetval && FALSE;
    >> }
    >> }
    >> return iRetval;
    >>}
    >>

    >
    > You can optimise this by writing it
    >
    > int mystrncmp(const char *s1, const char *s2, int n)
    > {
    > int i;
    >
    > for(i=0;i<n;i++)
    > if(s1 != s2)
    > return s2 - s1;
    > return 0;
    > }
    > This will save some cycles. You can also replace with the stdlib function
    > strncmp(), which may be even faster as coded in assembly.
    >
    >>int main()
    >>{
    >> FILE *fp;
    >> char sText[STRSIZE] = "name";


    Latent bug in the code here. STRSIZE was defined
    as 4. "name" is a 5 character string including
    the NULL byte '/0' at the end. Therefore a NULL was
    written *somewhere* (implementation specific).
    It is very possible (but not guaranteed) that
    the compiler allocated storage for sText and
    sBuff right after each other.
    If that was the case, the NULL byte
    was written into the first character of sBuff.
    Later on, when you read data into sBuff, that
    NULL byte would be overwritten with some character
    from the file and sText would no longer be
    a null-terminated string. This would make
    for some "interesting" behavior and probably
    prevent you from using any of the standard library
    functions.

    Take the advice of the other posters about
    using standard library functions and larger buffer
    sizes, too.



    >> char sBuff[STRSIZE];
    >> unsigned long pos;
    >>
    >> if((fp = fopen("demo.dem","rb")) != NULL)
    >> {
    >> while ((fread(&sBuff, sizeof(char), sizeof(sText), fp)) >=

    >
    > (sizeof(sText)))
    >
    > You might save a bit of time by simply calling fgetc(), which is designed to
    > return single characters.
    >
    >> {
    >>
    >>

    >
    > Most of the time the first character in the buffer won't match the first
    > character in the string. So you can avoid the overhead of the function call
    > by simply rejecting it.
    >
    >> if(strequal(sText,sBuff))
    >> {
    >> printf("a match!\n");
    >> }
    >> fgetpos(fp, &pos);
    >> pos = pos-(STRSIZE-1);
    >> fsetpos(fp, &pos);
    >>

    >
    > You are doing this on each character read. This will slow you down.


    The comment was absolutely correct. (Sorry, I don't
    recall who made it.) You are (conceptually) reading
    4 characters, then taking 3 steps back, and reading
    4 characters again, 3 of which you have already read
    once. While this works, your question was about
    speed, and this sequence above looks like a real
    pig.

    >
    >> }
    >> fclose(fp);
    >> }
    >> else
    >> {
    >> printf("Could not open file");
    >> }
    >> return 0;
    >>}
    >>

    >
    > The way to speed up is to have a more sophisticated buffer.
    >
    > Read characters until you come up with a match in the first position.
    > Then read enough characters for the string into the buffer.
    > If you have a match, you have a match.
    >
    > Otherwise, search the buffer for the next matching character to the first
    > letter, shift left, and read more in. If the alphabet is large and the
    > string quite short you should frequently flush the buffer, in which case you
    > go back to reading characters one at a time.
    >
    > Shifting characters is expensive, but a circular buffer involves the modulus
    > operation, which is also expensive. To get even more speed, you can declare
    > an oversize buffer, as large as you like, and keep a travelling pointer to
    > the start position. Only when you hit the end do you need to move some
    > characters to the beginning and reset the start pointer. (If the buffer
    > empties then of course you go back to the initial start position for free).
    >
    >


    --
    Ñ
    "It is impossible to make anything foolproof because fools are so
    ingenious" - A. Bloch
    Nick Landsberg, Feb 28, 2004
    #6
  7. spike

    CBFalconer Guest

    Joerg Schoen wrote:
    > spike wrote:
    >
    > > Im writing a program to search for a string in a binary file.
    > > And it works. The problem is: It is sooo slow! how can i make
    > > it faster? It takes 27 seconds just to search a 5 meg file.
    > > I guess it has something to do with the strequal() function...

    >
    > First of all: Your strequal function should be named differently,
    > since the standard reserves any name with "str..." for itself.
    > Secondly, it is basically a slow implementation of "strcmp"
    >

    .... snip ...
    >
    > > fgetpos(fp, &pos);
    > > pos = pos-(STRSIZE-1);
    > > fsetpos(fp, &pos);

    >
    > That is what really kills you! Read 4 bytes, check them, rewind 3
    > bytes and start again.
    >
    > Find something better here!


    Here's a revision to what I published to the OP's original
    request. He must have not seen that. Some slight changes have
    been made to ensure that we can have binary file access, which is
    not necessarily possible via stdin. This never backtracks.

    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include <ctype.h>
    #include <assert.h>

    /* The difference between a binary and a text file, on read,
    is the conversion of end-of-line delimiters. What those
    delimiters are does not affect the action. There is also
    a potential problem with 0x1a EOF markers in DOS/Windoze.

    This is a version of Knuth-Morris-Pratt algorithm. The
    point of using this is to avoid any backtracking in file
    reading, and thus avoiding any use of buffer arrays.
    */

    size_t chrcount; /* debuggery, count of input chars, zeroed */

    /* --------------------- */

    /* Almost straight out of Sedgewick */
    /* The next array indicates what index in id should next be
    compared to the current char. Once the (lgh - 1)th char
    has been successfully compared, the id has been found.
    The array is formed by comparing id to itself. */
    void initnext(int *next, const char *id, int lgh)
    {
    int i, j;

    assert(lgh > 0);
    next[0] = -1; i = 0; j = -1;
    while (i < lgh) {
    while ((j >= 0) && (id != id[j])) j = next[j];
    i++; j++;
    next = j;
    }
    #if (0)
    for (i = 0; i < lgh; i++)
    printf("id[%d] = '%c' next[%d] = %d\n",
    i, id, i, next);
    #endif
    } /* initnext */

    /* --------------------- */

    /* reads f without rewinding until either EOF or *marker
    has been found. Returns EOF if not found. At exit the
    last matching char has been read, and no further. */
    int kmpffind(const char *marker, int lgh, int *next, FILE *f)
    {
    int j; /* char position in marker to check */
    int ch; /* current char */

    assert(lgh > 0);
    j = 0;
    while ((j < lgh) && (EOF != (ch = getc(f)))) {
    chrcount++;
    while ((j >= 0) && (ch != marker[j])) j = next[j];
    j++;
    }
    return ch;
    }

    /* --------------------- */

    int binfsrch(const char *marker, FILE *f)
    {
    int *next;
    int lgh;
    int ch;
    int items; /* count of markers found */

    if (!(next = malloc(strlen(marker) * sizeof *next))) {
    puts("No memory");
    exit(EXIT_FAILURE);
    }
    else {
    lgh = strlen(marker);
    initnext(next, marker, lgh);
    items = 0;
    while (EOF != kmpffind(marker, lgh, next, f)) {
    /* found it, dump the following printable chars */
    items++;
    printf("%d %s : \"", items, marker);
    while (isprint(ch = getc(f))) {
    chrcount++;
    putchar(ch);
    }
    puts("\"");
    if (EOF == ch) break;
    else chrcount++;
    }
    return items;
    }
    } /* binfsrch */

    /* --------------------- */

    int main(int argc, char **argv)
    {
    FILE *f;

    f = stdin;
    if (3 == argc) {
    if (!(f = fopen(argv[2], "rb"))) {
    printf("Can't open %s\n", argv[2]);
    exit(EXIT_FAILURE);
    }
    argc--;
    }
    if (2 != argc) {
    puts("Usage: binfsrch name [binaryfile]");
    puts(" (file defaults to stdin text mode)");
    }
    else if (binfsrch(argv[1], f)) {
    printf("\"%s\" : found\n", argv[1]);
    }
    else printf("\"%s\" : not found\n", argv[1]);
    printf("%lu chars\n", (unsigned long)chrcount);
    return 0;
    } /* main binfsrch */

    --
    Chuck F () ()
    Available for consulting/temporary embedded and systems.
    <http://cbfalconer.home.att.net> USE worldnet address!
    CBFalconer, Feb 28, 2004
    #7
  8. On 28 Feb 2004 12:22:33 -0800, (spike) wrote:

    >Im writing a program to search for a string in a binary file.
    >And it works. The problem is: It is sooo slow! how can i make it faster?
    >It takes 27 seconds just to search a 5 meg file.
    >I guess it has something to do with the strequal() function...
    >
    >Btw, thanks to all of you who answered last time!
    >
    >code:
    >-------------------------------------------------------------------------
    >#include <stdio.h>
    >#define MAXLEN 50
    >#define FALSE 0
    >#define TRUE !FALSE
    >#define STRSIZE 4
    >
    >int strequal(char sText[],char sBuff[])
    >{
    > int iRetval = TRUE;
    > int i;
    > for(i=0;i<STRSIZE;i++)
    > {
    > if(sText == sBuff)
    > {
    > iRetval = iRetval && TRUE;


    I don't know if this contributes to your timing problem but this can
    never change the value of iRetval and is effectively a no-op.

    > }
    > else
    > {
    > iRetval = iRetval && FALSE;


    Similarly, this can only and will always set iRetval to FALSE.
    Furthermore, once iRetval is FALSE, it can never become TRUE so there
    is no point in continuing the loop.

    > }
    > }
    > return iRetval;
    >}
    >

    snip


    <<Remove the del for email>>
    Barry Schwarz, Feb 28, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Apple
    Replies:
    3
    Views:
    299
    Apple
    Aug 1, 2005
  2. Bogdan

    Binary tree search vs Binary search

    Bogdan, Oct 18, 2010, in forum: C Programming
    Replies:
    22
    Views:
    3,059
    Michael Angelo Ravera
    Oct 21, 2010
  3. thunk
    Replies:
    1
    Views:
    303
    thunk
    Mar 30, 2010
  4. thunk
    Replies:
    0
    Views:
    469
    thunk
    Apr 1, 2010
  5. thunk
    Replies:
    14
    Views:
    612
    thunk
    Apr 3, 2010
Loading...

Share This Page