Searching for byte string in a binary file.

  • Thread starter Walter Dnes (delete the 'z' to get my real address
  • Start date
W

Walter Dnes (delete the 'z' to get my real address

I took a C course some time ago, but I'm only now beginning to use it,
for a personal pet project. My current stumbling-block is finding an
efficient way to find a match between the beginning of a "counted"
string and data in a binary file. Given...

#include <stdio.h>

int main(int argc, char *argv[])
{
char bstring[255];
int bstring_length, match_length;
long match_location;
FILE *input_file, *output_file;

if(argc != 3){
printf(" Correct usage:\n %s input_filename output_filename\n", argv[0]);
return 1;
}
if((input_file = fopen(argv[1], "rb")) == NULL){
printf("Error opening %s for input\n", argv[1]);
return 1;
}
if((output_file = fopen(argv[2], "wb")) == NULL){
printf("Error opening %s for output\n", argv[2]);
return 1;
}
...

Later on, bstring and bstring_length get initialized. bstring may
contain \0's, so the "logical length" (which will not exceed 255) is
actually stored in bstring_length. What I'm trying to do is to find the
location and length of the longest match from the left side of bstring.
If that's not clear, here's an example...

bstring[0] = 'a';
bstring[1] = '\0';
bstring[2] = 'b';
bstring[3] = 'c';
bstring[4] = 'd';
bstring_length = 5;

Assume that the sequence is not matched anywhere in the file. However,
assume that the sequence 'a', '\0', 'b', 'c' does exist in the file.
What's the most efficient search that returns match_length ( 4 ) and
ftell() of the beginning of the match? If it's a function, should I
declare in main() like so?

int bstring_length, *match_length;
long *match_location;

I assume that the function prototype would be something like...

int findmatch( bstring[255], bstring_length, match_length, match_location);

By passing pointers, I assume that both match_location and
match_length will be modified and available to main(). Or is my
understanding of pointers lacking. findmatch() should return 0 if a
match is found, and 1 if no match. Would memory-mapped files help here?
 
J

Jens.Toerring

Walter Dnes (delete the 'z' to get my real address) said:
I took a C course some time ago, but I'm only now beginning to use it,
for a personal pet project. My current stumbling-block is finding an
efficient way to find a match between the beginning of a "counted"
string and data in a binary file. Given...
#include <stdio.h>

int main(int argc, char *argv[])
{
char bstring[255];
int bstring_length, match_length;
long match_location;
FILE *input_file, *output_file;
if(argc != 3){
printf(" Correct usage:\n %s input_filename output_filename\n", argv[0]);
return 1;

Make that

return EXIT_FAILURE;

not all OSes return 1 for failure and 0 for success, and with the macros
EXIT_FAILURE and EXIT_SUCCESS, defined in <stdlib.h>, you're always on
the safe side.

Another common convention is to print error messages to stderr and not
mix them with the normal output of your program. So better use

fprintf( stderr, "Correct usage:\n %s input_filename output_filename\n",
argv[0]);
}
if((input_file = fopen(argv[1], "rb")) == NULL){
printf("Error opening %s for input\n", argv[1]);
return 1;
}
if((output_file = fopen(argv[2], "wb")) == NULL){
printf("Error opening %s for output\n", argv[2]);
return 1;
}
...
Later on, bstring and bstring_length get initialized. bstring may
contain \0's, so the "logical length" (which will not exceed 255) is
actually stored in bstring_length. What I'm trying to do is to find the
location and length of the longest match from the left side of bstring.
If that's not clear, here's an example...
bstring[0] = 'a';
bstring[1] = '\0';
bstring[2] = 'b';
bstring[3] = 'c';
bstring[4] = 'd';
bstring_length = 5;
Assume that the sequence is not matched anywhere in the file. However,
assume that the sequence 'a', '\0', 'b', 'c' does exist in the file.
What's the most efficient search that returns match_length ( 4 ) and
ftell() of the beginning of the match? If it's a function, should I
declare in main() like so?
int bstring_length, *match_length;
long *match_location;
I assume that the function prototype would be something like...
int findmatch( bstring[255], bstring_length, match_length, match_location);

Sorry, that's not a prototype.
By passing pointers, I assume that both match_location and
match_length will be modified and available to main(). Or is my
understanding of pointers lacking. findmatch() should return 0 if a
match is found, and 1 if no match.

When you want to set 'match_location' within find_match() to a long
value that's visible in main() you need to change that a bit. You
must define 'match_location' in main() as long, i.e. have there

long match_location;

and then pass a pointer to that variable to the function. So your
prototype for the function would look like

int find_match( char buf[ 255 ], int blen, int mlen, long *loc );

(I intentionally changed the names to be used within the function
from the ones you defined in main() to make it more obvious what
belongs to main() and what belongs to find_match().) If you don't
intend to change the char array within the function it probably
would be better to make the first argument "const char buf[ 255 ]".

You now would call that function like

find_match( bstring, bstring_length, match_len, &match_location );

(please note the '&' in front of 'match_location', it tells that
you pass the address of 'match_location' and not its value).
Within find_match() you would then assign the position of the match
to '*loc', i.e. you write it into the memory location 'loc' points
to.

By the way, using names like 'bstring' and 'bstring_length' is a bit
misleading since 'bstring' isn't a string (a real string can't have
embedded '\0' characters), it's just an array of chars. So it might
be prudent to give it a name that doesn't make people assume that
it's going to be a string.
Would memory-mapped files help here?

The rest of your questions on how to do the matching in the fastest,
most effective way are not really C questions but more about what
kind of algorithm to use. That's better discussed in groups like
comp.programming since it really doesn't is is about C. And a google
search for e.g. "Boyer-Moore algorithm" (just to name one) will
probably come up with a lot of interesting links (and show you
how many possible algorithms have been developed for that problem
over the years. E.g.

http://www-igm.univ-mlv.fr/~lecroq/string/

looks rather interesting). Finally, questions about memory-mapping
files are off-topic here because that can only be done with system-
specific extensions to C.
Regards, Jens
 
W

Walter Dnes (delete the 'z' to get my real address


Thanks. That appears to be exactly what I was looking for.
Make that

return EXIT_FAILURE;

not all OSes return 1 for failure and 0 for success, and with the macros
EXIT_FAILURE and EXIT_SUCCESS, defined in <stdlib.h>, you're always on
the safe side.

Another common convention is to print error messages to stderr and not
mix them with the normal output of your program. So better use

fprintf( stderr, "Correct usage:\n %s input_filename output_filename\n",
argv[0]);

Thanks for the tips. Note that I've changed the subject. Another
question; is this the correct way to define TRUE/FALSE?

const int TRUE = (1==1);
const int FALSE = (1!=1);
 
B

Ben Pfaff

Walter Dnes (delete the 'z' to get my real address) said:
Thanks for the tips. Note that I've changed the subject. Another
question; is this the correct way to define TRUE/FALSE?

const int TRUE = (1==1);
const int FALSE = (1!=1);

It is correct, but indicates a lack of understanding. Read the
FAQ. In addition to the FAQ commentary on this issue, it should
be noted that the value of variables, even those defined as
constant, cannot be used in constant expressions.

9.1: What is the right type to use for Boolean values in C? Why
isn't it a standard type? Should I use #defines or enums for
the true and false values?

A: C does not provide a standard Boolean type, in part because
picking one involves a space/time tradeoff which can best be
decided by the programmer. (Using an int may be faster, while
using char may save data space. Smaller types may make the
generated code bigger or slower, though, if they require lots of
conversions to and from int.)

The choice between #defines and enumeration constants for the
true/false values is arbitrary and not terribly interesting (see
also questions 2.22 and 17.10). Use any of

#define TRUE 1 #define YES 1
#define FALSE 0 #define NO 0

enum bool {false, true}; enum bool {no, yes};

or use raw 1 and 0, as long as you are consistent within one
program or project. (An enumeration may be preferable if your
debugger shows the names of enumeration constants when examining
variables.)

Some people prefer variants like

#define TRUE (1==1)
#define FALSE (!TRUE)

or define "helper" macros such as

#define Istrue(e) ((e) != 0)

These don't buy anything (see question 9.2 below; see also
questions 5.12 and 10.2).

9.2: Isn't #defining TRUE to be 1 dangerous, since any nonzero value
is considered "true" in C? What if a built-in logical or
relational operator "returns" something other than 1?

A: It is true (sic) that any nonzero value is considered true in C,
but this applies only "on input", i.e. where a Boolean value is
expected. When a Boolean value is generated by a built-in
operator, it is guaranteed to be 1 or 0. Therefore, the test

if((a == b) == TRUE)

would work as expected (as long as TRUE is 1), but it is
obviously silly. In fact, explicit tests against TRUE and
FALSE are generally inappropriate, because some library
functions (notably isupper(), isalpha(), etc.) return,
on success, a nonzero value which is not necessarily 1.
(Besides, if you believe that "if((a == b) == TRUE)" is
an improvement over "if(a == b)", why stop there? Why not
use "if(((a == b) == TRUE) == TRUE)"?) A good rule of thumb
is to use TRUE and FALSE (or the like) only for assignment
to a Boolean variable or function parameter, or as the return
value from a Boolean function, but never in a comparison.

The preprocessor macros TRUE and FALSE (and, of course, NULL)
are used for code readability, not because the underlying values
might ever change. (See also questions 5.3 and 5.10.)

Although the use of macros like TRUE and FALSE (or YES
and NO) seems clearer, Boolean values and definitions can
be sufficiently confusing in C that some programmers feel
that TRUE and FALSE macros only compound the confusion, and
prefer to use raw 1 and 0 instead. (See also question 5.9.)

References: K&R1 Sec. 2.6 p. 39, Sec. 2.7 p. 41; K&R2 Sec. 2.6
p. 42, Sec. 2.7 p. 44, Sec. A7.4.7 p. 204, Sec. A7.9 p. 206; ISO
Sec. 6.3.3.3, Sec. 6.3.8, Sec. 6.3.9, Sec. 6.3.13, Sec. 6.3.14,
Sec. 6.3.15, Sec. 6.6.4.1, Sec. 6.6.5; H&S Sec. 7.5.4 pp. 196-7,
Sec. 7.6.4 pp. 207-8, Sec. 7.6.5 pp. 208-9, Sec. 7.7 pp. 217-8,
Sec. 7.8 pp. 218-9, Sec. 8.5 pp. 238-9, Sec. 8.6 pp. 241-4;
"What the Tortoise Said to Achilles".
 
C

Chris Torek

"Walter Dnes (delete the 'z' to get my real address)"
It is correct, but indicates a lack of understanding. Read the
FAQ. In addition to the FAQ commentary on this issue, it should
be noted that the value of variables, even those defined as
constant, cannot be used in constant expressions.

I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.
 
J

John L

Chris Torek said:
I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.

As a matter of interest, what was the motivation for this?
 
K

Keith Thompson

John L said:
As a matter of interest, what was the motivation for this?

Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.

The reason for adding <stdbool.h> is to avoid breaking existing code.
"bool", "true", and "false" are the natural names for the boolean type
and its values, but making them keywords would have broken any code
that declared them as identifiers. Making them macros that don't
appear without a "#include <stdbool.h>" avoids this problem (existing
code is unlikely to use a header by that name).
 
D

Dan Pop

In said:
Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.

The motivation for adding the type is not obvious to me. I've coded a lot
in C without needing such a type. A simple naming convention for flag
variables was more than enough. No need for the TRUE/FALSE nonsense,
either.
The reason for adding <stdbool.h> is to avoid breaking existing code.

This is obvious to me. I wish *all* the C99 additions were made with
such concern in mind...

Dan
 
B

Ben Pfaff

[regarding bool]
Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.

The motivation for adding the type is not obvious to me. I've coded a lot
in C without needing such a type. A simple naming convention for flag
variables was more than enough. No need for the TRUE/FALSE nonsense,
either.

This is a case where the motivation could only be non-obvious to
an expert. Regardless of whether it is useful, many, many
libraries define their own constants for truth and falsity, and
many of these conflict with each other. Adding standard
constants should help to reduce these conflicts.
 
W

Walter Dnes (delete the 'z' to get my real address

I think it is also worth adding that C99 *does* now have a built-in
boolean type. The spelling of this type (and its values) is
deliberately sneaky, and you are supposed to "#include <stdbool.h>"
to make the names "bool", "true", and "false" available.

Thank you very much for that pointer...

[/usr] find / -name stdbool*
/usr/lib/gcc-lib/i386-linux/3.0.4/include/stdbool.h
/usr/lib/gcc-lib/i386-linux/2.95.4/include/stdbool.h

[23:28:42][/home/waltdnes] gcc --version
2.95.4

OK, my system defaults to version 2.95.4, so...

[/home/waltdnes] cat /usr/lib/gcc-lib/i386-linux/2.95.4/include/stdbool.h

/* stdbool.h for GNU. */
#ifndef __STDBOOL_H__
#define __STDBOOL_H__ 1

/* The type `bool' must promote to `int' or `unsigned int'. The
* constants
`true' and `false' must have the value 0 and 1 respectively. */
typedef enum
{
false = 0,
true = 1
} bool;

/* The names `true' and `false' must also be made available as macros.
* */
#define false false
#define true true

/* Signal that all the definitions are present. */
#define __bool_true_false_are_defined 1

#endif /* stdbool.h */

I'll simply "#include <stdbool.h>" and be done with it.
 
D

Dan Pop

In said:
[email protected] (Dan Pop) said:
[regarding bool]
Are you asking about the motivation for adding the type, or for
requiring the use of <stdbool.h> to make the names visible?

The motivation for adding the type seems obvious to me; if it's not
obvious to you, ask again.

The motivation for adding the type is not obvious to me. I've coded a lot
in C without needing such a type. A simple naming convention for flag
variables was more than enough. No need for the TRUE/FALSE nonsense,
either.

This is a case where the motivation could only be non-obvious to
an expert. Regardless of whether it is useful, many, many
libraries define their own constants for truth and falsity, and
many of these conflict with each other. Adding standard
constants should help to reduce these conflicts.

How? Do these many badly designed libraries magically get fixed by
the addition of _Bool? All this addition achieves is *unnecessarily*
complicating the C type system.

Dan
 
O

Old Wolf

The motivation for adding the type is not obvious to me. I've coded a lot
in C without needing such a type. A simple naming convention for flag
variables was more than enough. No need for the TRUE/FALSE nonsense,
either.

It's useful to have an implicit integral type -> true/false conversion,
for example:

#define FLAG_BAR 0x10000UL
_Bool is_bar_set(unsigned long flags)
{
return flags & FLAG_BAR;
}

If this function returned a 16-bit or smaller type, then the results
would not be as expected. Obviously you can work around this
(eg. typedef the flags type and return that type, or (test)?1:0, etc.)
but it is more readable & maintainable to properly support a boolean type.
 
B

Ben Pfaff

How? Do these many badly designed libraries magically get fixed by
the addition of _Bool? All this addition achieves is *unnecessarily*
complicating the C type system.

The addition doesn't eliminate any existing conflicts. It does,
however, set a precedent that new libraries are likely to follow,
whether they require a C99 implementation or not. In the long
term, this should reduce the problem.
 
D

Dan Pop

In said:
The addition doesn't eliminate any existing conflicts. It does,
however, set a precedent that new libraries are likely to follow,
whether they require a C99 implementation or not. In the long
term, this should reduce the problem.

To an insignificant degree, as the harm has already been done. The
standard merely encourages a brain dead coding practice. That would have
been OK with the me if the addition of _Bool had no cost in terms of
complexity of the language (like, e.g. making %lf well defined for printf)
but this is not the case.

People who don't use C99 don't care about C99 and they will happily keep
using their own ad hoc solutions for a problem that doesn't exist in the
first place.

Dan
 
D

Dan Pop

In said:
It's useful to have an implicit integral type -> true/false conversion,

The conversion is far less useful than many people believe and C89 fully
supports it, via the !! "operator". _Bool is really a solution in search
of a problem.

Dan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top