looking at binary files with C

U

Uno

$ indent -kr hist1.c
$ cc -Wall -Wextra hist1.c -o hist
$ ./hist
Segmentation fault
$ cat hist1.c
#include <stdio.h>
#define SIZE 100000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

for (i = 0; i < SIZE; ++i) {
a = 0;
}

for (j = 0; j < 7000000; ++j) {
c = fgetc(fp);


if (c != EOF) {

a[counter] = c;
counter++;
} else {
for (i = 0; i < 100; ++i) {
printf("i and frequency is %ld %ld\n", i, a);
}


break;
}
}
printf("Counter reached %ld\n", counter);
fclose(fp);
return 0;
}

// cc -Wall -Wextra hist1.c -o hist
$

So, why the seg fault?
 
I

Ian Collins

$ indent -kr hist1.c
$ cc -Wall -Wextra hist1.c -o hist
$ ./hist
Segmentation fault
$ cat hist1.c
#include<stdio.h>
#define SIZE 100000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

Checking for success here would be a good idea.
 
U

Uno

$ indent -kr hist1.c
$ cc -Wall -Wextra hist1.c -o hist
$ ./hist
Segmentation fault
$ cat hist1.c
#include<stdio.h>
#define SIZE 100000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];

If long is 4 bytes, this reserves 400,000 bytes. If 8, then 800,000.

I'm sure I have a gig of memory.
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

for (i = 0; i< SIZE; ++i) {
a = 0;
}

for (j = 0; j< 7000000; ++j) {
c = fgetc(fp);

if (c != EOF) {

a[counter] = c;


Depending on the size of your file, this could easily exceed the
amount of space reserved for a. When that happens, your program
exhibits undefined behavior.


I've tried different values here, but 20 million is greater than 19 and
change.

-rw-r--r-- 1 dan dan 19573712 2011-06-05 17:32 shoulder.wmv
$ cc -Wall -Wextra hist2.c -o hist
$ ./hist
Segmentation fault
$ cat hist2.c
#include <stdio.h>
#define SIZE 20000000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

for (i = 0; i < SIZE; ++i) {
a = 0;
}

for (j = 0; j < SIZE; ++j) {
c = fgetc(fp);

//printf("c is %d\n", c);


if (c != EOF) {

a[counter] = c;
counter++;
} else {
for (i = 0; i < 100; ++i) {
printf("i and frequency is %ld %ld\n", i, a);
}


break;
}
}
printf("Counter reached %ld\n", counter);
fclose(fp);
return 0;
}

// cc -Wall -Wextra hist2.c -o hist
$

Why are you using a long to store a character?

A long is intended to store the frequency of a given char.
counter++;
} else {
for (i = 0; i< 100; ++i) {
printf("i and frequency is %ld %ld\n", i, a);


This code does not compute any frequencies. Any non-zero a has had
its value changed exactly once and the value is set to the i-th
character in the file.


This code hasn't done anything yet; that is true.
 
J

Joachim Schmitz

Uno said:
$ indent -kr hist1.c
$ cc -Wall -Wextra hist1.c -o hist
$ ./hist
Segmentation fault
$ cat hist1.c
#include<stdio.h>
#define SIZE 100000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];

If long is 4 bytes, this reserves 400,000 bytes. If 8, then 800,000.

I'm sure I have a gig of memory.

But you're not using it for long a[SIZE], so accessing a[SIZE+1] would
result in a SIGSEGV usually.
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

for (i = 0; i< SIZE; ++i) {
a = 0;
}

for (j = 0; j< 7000000; ++j) {
c = fgetc(fp);

if (c != EOF) {

a[counter] = c;


Depending on the size of your file, this could easily exceed the
amount of space reserved for a. When that happens, your program
exhibits undefined behavior.


I've tried different values here, but 20 million is greater than 19
and change.

-rw-r--r-- 1 dan dan 19573712 2011-06-05 17:32 shoulder.wmv
$ cc -Wall -Wextra hist2.c -o hist
$ ./hist
Segmentation fault
$ cat hist2.c
#include <stdio.h>
#define SIZE 20000000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

for (i = 0; i < SIZE; ++i) {
a = 0;
}

for (j = 0; j < SIZE; ++j) {
c = fgetc(fp);

//printf("c is %d\n", c);


if (c != EOF) {

a[counter] = c;
counter++;
} else {
for (i = 0; i < 100; ++i) {
printf("i and frequency is %ld %ld\n", i, a);
}


break;
}
}
printf("Counter reached %ld\n", counter);
fclose(fp);
return 0;
}

// cc -Wall -Wextra hist2.c -o hist
$

Why are you using a long to store a character?

A long is intended to store the frequency of a given char.
counter++;
} else {
for (i = 0; i< 100; ++i) {
printf("i and frequency is %ld %ld\n", i, a);


This code does not compute any frequencies. Any non-zero a has
had its value changed exactly once and the value is set to the i-th
character in the file.


This code hasn't done anything yet; that is true.


Have you tried using a debugger on this? Where does the debugger report the
program to fail?
You probably want to add -g to the compiler options.

Bye, Jojo
 
I

Ike Naar

$ indent -kr hist1.c
$ cc -Wall -Wextra hist1.c -o hist
$ ./hist
Segmentation fault
$ cat hist1.c
#include <stdio.h>
#define SIZE 100000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];
long i, j;
FILE *fp;

fp = fopen("shoulder.wmv", "rb+");

for (i = 0; i < SIZE; ++i) {
a = 0;
}

for (j = 0; j < 7000000; ++j) {
c = fgetc(fp);


if (c != EOF) {

a[counter] = c;
counter++;
} else {
for (i = 0; i < 100; ++i) {
printf("i and frequency is %ld %ld\n", i, a);
}


break;
}
}
printf("Counter reached %ld\n", counter);
fclose(fp);
return 0;
}

// cc -Wall -Wextra hist1.c -o hist


Don't repeat yourself.
$

So, why the seg fault?

Is shoulder.wmv more than SIZE bytes long?
 
E

Eric Sosman

#define SIZE 100000
[...]
long a[SIZE];
[...]
for (j = 0; j < 7000000; ++j) {
[...]
So, why the seg fault?

Choose the phrase that most accurately completes this statement:
"7000000 is ________ 100000."

a) less than
b) greater than
c) equal to
d) more equal to
e) more tweeted than
 
B

BartC

c = fgetc(fp);


if (c != EOF) {

a[counter] = c;

Try:
++a[c];

here if trying to count frequency of each character.

Up above, you might need something like:

long a[256]={0};
 
J

Joe Pfeiffer

Uno said:
So, why the seg fault?

Let's just say this code is a prime example of why I took points of on
student programs that had magic numbers (like 7000000 for instance) in
the executable code.
 
J

Joe Pfeiffer

Uno said:
$ indent -kr hist1.c
$ cc -Wall -Wextra hist1.c -o hist
$ ./hist
Segmentation fault
$ cat hist1.c
#include<stdio.h>
#define SIZE 100000

int main(void)
{
int c;
long counter = 0;
long a[SIZE];

If long is 4 bytes, this reserves 400,000 bytes. If 8, then 800,000.

I'm sure I have a gig of memory.

That doesn't mean your array is a gig long; your array is only as long
as you made it.
 
U

Uno

Let's just say this code is a prime example of why I took points of on
student programs that had magic numbers (like 7000000 for instance) in
the executable code.

There were a lot of problems with this source that needing ironing out,
and I needed to start to somehwhere. When I've got a program that
compiles, I like to have the magic numbers #defined. I haven't really
done much C on a 64-bit processor, so my ballparking seems to be wanting.

$ cc -Wall -Wextra -g hist3.c -o hist
$ ./hist
UCHAR_MAX + 1 is 256
a[0] is 217337
a[1] is 123676
a[2] is 137894
a[3] is 100155
....
a[252] is 97099
a[253] is 75547
a[254] is 113450
a[255] is 128194
Counter reached 19573713
Counter2 reached 19573712
$ cat hist3.c
#include <stdio.h>
#include <limits.h>
#define SIZE 20000000
#define SIZE2 (UCHAR_MAX+1)

int main(void)
{
int c;
long counter = 0;
long counter2 = 0;
long a[SIZE2];
long i, j;
FILE *fp;
fp = fopen("shoulder.wmv", "rb+");
printf("UCHAR_MAX + 1 is %d\n", SIZE2);
for (i = 0; i < (SIZE2); ++i) {
a = 0;
}
for (j = 0; j < SIZE; ++j) {
c = fgetc(fp);
counter++;
if (c != EOF) {
a[c] = ++a[c];
} else
break;
}
for (i = 0; i < (SIZE2); ++i) {
printf("a[%ld] is %ld\n", i, a);
counter2 = counter2 + a;
}
printf("Counter reached %ld\n", counter);
printf("Counter2 reached %ld\n", counter2);
fclose(fp);
return 0;
}

// cc -Wall -Wextra -g hist3.c -o hist
$

I needed two different SIZES, one that would be larger than the byte
count of the file to be read and another that was one greater than
UCHAR_MAX. Seems to behave. Thanks all for comments.
 
K

Keith Thompson

BartC said:
a[c] = ++a[c];

I think just:

++a[c];

will do here, since ++ also modifies it's operand as well as return it's
value plus 1.

In fact the behavior of a[c] = ++a[c] is undefined, since it modifies
a[c] twice without an intervening sequence point.
 
E

Eric Sosman

There were a lot of problems with this source that needing ironing out,
and I needed to start to somehwhere.

An *excellent* place to start is with a description of what
you want the program to do. As it is, you've just thrown a bunch
of code at us and expected us to divine your intentions. If the
code were not so clearly broken, divination might stand a chance;
even then it would be better to state your purpose. With broken
code, such a statement is even more important.
When I've got a program that
compiles, I like to have the magic numbers #defined.

Your choice, but you're making extra work for yourself. When
you're replacing numeric constants with macro names and you come
across a 3, should you change it to SHIFTS_PER_DAY or to VERSION?
Essentially, you've created a reverse-engineering problem for yourself,
giving yourself an extra opportunity to make misteaks.
I haven't really
done much C on a 64-bit processor, so my ballparking seems to be wanting.

Nothing in your code stands out as bitness-related.
#include <stdio.h>
#include <limits.h>
#define SIZE 20000000
#define SIZE2 (UCHAR_MAX+1)

int main(void)
{
int c;
long counter = 0;
long counter2 = 0;
long a[SIZE2];
long i, j;
FILE *fp;
fp = fopen("shoulder.wmv", "rb+");

Why the "+", since you never write to the file? And why
is there still no check for failure, after others have already
pointed out the lack?
printf("UCHAR_MAX + 1 is %d\n", SIZE2);
for (i = 0; i < (SIZE2); ++i) {

The parentheses around SIZE2 are harmless, but not helpful:
They're just unnecessary clutter that gives the reader more to
parse. Perhaps you think they're necessary because SIZE2 expands
to an expression rather than to a single token? But the expression
is already parenthesized (as good macros should be), so tacking on
an extra layer isn't needed. What you've written is equivalent to

for (i = 0; i < ((UCHAR_MAX+1)); ++i)
a = 0;
}
for (j = 0; j < SIZE; ++j) {
c = fgetc(fp);
counter++;
if (c != EOF) {
a[c] = ++a[c];


Undefined behavior: Two modifications of the same object
with no intervening sequence point.
} else
break;
}
for (i = 0; i < (SIZE2); ++i) {
printf("a[%ld] is %ld\n", i, a);
counter2 = counter2 + a;
}
printf("Counter reached %ld\n", counter);
printf("Counter2 reached %ld\n", counter2);
fclose(fp);
return 0;
}

// cc -Wall -Wextra -g hist3.c -o hist
$

I needed two different SIZES, one that would be larger than the byte
count of the file to be read and another that was one greater than
UCHAR_MAX.


SIZE2 (or equivalent) is necessary, yes. But SIZE seems to
serve no useful purpose -- unless, as mentioned above, you have
purposes that aren't stated.
Seems to behave.

"Things are seldom what they seem.
Skim milk masquerades as cream ..."
 
U

Uno

BartC said:
a[c] = ++a[c];

I think just:

++a[c];

will do here, since ++ also modifies it's operand as well as return it's
value plus 1.

In fact the behavior of a[c] = ++a[c] is undefined, since it modifies
a[c] twice without an intervening sequence point.

I changed that. So now I'm trying to get the data in a dynamic array:

$ indent -i3 hist7.c
$ cc -Wall -Wextra hist7.c -o hist
$ ./hist
file_length is 19573712
a[245] is 50064
a[246] is 38224
a[247] is 58741
a[248] is 92306
a[249] is 58773
a[250] is 65617
a[251] is 41401
a[252] is 97099
a[253] is 75547
a[254] is 113450
a[255] is 128194
counter2 reached 819416
counter reached 19573713
execution makes it this far
b[245] is 50064
b[246] is 38224
b[247] is 58741
b[248] is 92306
b[249] is 58773
b[250] is 65617
b[251] is 41401
b[252] is 97099
b[253] is 75547
b[254] is 113450
b[255] is 128194
counter3 reached 819416
$ cat hist7.c
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#define SIZE 20000000
#define SIZE2 (UCHAR_MAX+1)

int
main (void)
{
int c;
long counter = 0;
long counter2 = 0;
long counter3 = 0;
long a[SIZE2], b[SIZE2];
long i, j, end;
FILE *fp;
char filename[] = "shoulder.wmv";
unsigned char *p;

fp = fopen (filename, "rb+");
fseek (fp, 0L, SEEK_END);
end = ftell (fp);
fseek (fp, 0L, SEEK_SET);
printf ("file_length is %ld\n", end);
p = malloc (end);

for (i = 0; i < (SIZE2); ++i)
{
a = 0;
}
for (j = 0; j < SIZE; ++j)
{
c = fgetc (fp);
counter++;
if (c != EOF)
{
p[j] = c;
++a[c];
}
else
break;
}
for (i = 245; i < (SIZE2); ++i)
{
printf ("a[%ld] is %ld\n", i, a);
counter2 = counter2 + a;
}
printf ("counter2 reached %ld\n", counter2);
printf ("counter reached %ld\n", counter);
for (i = 0; i < SIZE2; ++i)
{
b = 0;
}
printf ("execution makes it this far\n");
for (j = 0; j < end; ++j)
{
c = p[j];
++b[c];
}

for (i = 245; i < (SIZE2); ++i)
{
printf ("b[%ld] is %ld\n", i, b);
counter3 = counter3 + b;
}
printf ("counter3 reached %ld\n", counter3);
fclose (fp);
return 0;
}

// cc -Wall -Wextra hist7.c -o hist
$

Does the malloc'ing look good for reading this file? Also, what do
POSIX and C say about the EOF character itself? If my data are typical,
I would say that neither thinks that EOF is part of the file.

(Because it's pentecost) Yabba-daba-doo!
 
U

Uno

An *excellent* place to start is with a description of what
you want the program to do. As it is, you've just thrown a bunch
of code at us and expected us to divine your intentions. If the
code were not so clearly broken, divination might stand a chance;
even then it would be better to state your purpose. With broken
code, such a statement is even more important.

It takes me a few posts before I get a syntactical C program to use to
describe what I'm doing. Without referent source, I can't communicate this.
Your choice, but you're making extra work for yourself. When
you're replacing numeric constants with macro names and you come
across a 3, should you change it to SHIFTS_PER_DAY or to VERSION?
Essentially, you've created a reverse-engineering problem for yourself,
giving yourself an extra opportunity to make misteaks.

But you did notice that I have
#define SIZE2 (UCHAR_MAX+1)
, and this was the seg fault I couldn't get my head around when I first
posted.
Nothing in your code stands out as bitness-related.

A client paid me to put a binary file on the net, and I used perl's
net::ftp without knowing that I had to call the binary method in order
to upload without *nix eating ascii 13. (Thx a lot c.l.p.misc: for
nothing.)

I knew by comparing what I uploaded to what I downloaded, I would find
the problem. Others, for example, Alan Curry, might claim that *they*
found the problem, and I do not dispute that.
#include <stdio.h>
#include <limits.h>
#define SIZE 20000000
#define SIZE2 (UCHAR_MAX+1)

int main(void)
{
int c;
long counter = 0;
long counter2 = 0;
long a[SIZE2];
long i, j;
FILE *fp;
fp = fopen("shoulder.wmv", "rb+");

Why the "+", since you never write to the file? And why
is there still no check for failure, after others have already
pointed out the lack?

The + on all opens because I might want to write to them. No check for
failure because the only one I've got in my head is
or die "death $@\n"; Is there a K&R2 reference for how to do it?
printf("UCHAR_MAX + 1 is %d\n", SIZE2);
for (i = 0; i < (SIZE2); ++i) {

The parentheses around SIZE2 are harmless, but not helpful:
They're just unnecessary clutter that gives the reader more to
parse. Perhaps you think they're necessary because SIZE2 expands
to an expression rather than to a single token? But the expression
is already parenthesized (as good macros should be), so tacking on
an extra layer isn't needed. What you've written is equivalent to

for (i = 0; i < ((UCHAR_MAX+1)); ++i)
a = 0;
}
for (j = 0; j < SIZE; ++j) {
c = fgetc(fp);
counter++;
if (c != EOF) {
a[c] = ++a[c];


Undefined behavior: Two modifications of the same object
with no intervening sequence point.


Fixed that. Thx.
} else
break;
}
for (i = 0; i < (SIZE2); ++i) {
printf("a[%ld] is %ld\n", i, a);
counter2 = counter2 + a;
}
printf("Counter reached %ld\n", counter);
printf("Counter2 reached %ld\n", counter2);
fclose(fp);
return 0;
}

// cc -Wall -Wextra -g hist3.c -o hist
$

I needed two different SIZES, one that would be larger than the byte
count of the file to be read and another that was one greater than
UCHAR_MAX.


SIZE2 (or equivalent) is necessary, yes. But SIZE seems to
serve no useful purpose -- unless, as mentioned above, you have
purposes that aren't stated.


Right, so I'm looking for your comment on the dynamic allocation which I
posted as a reply to keith.
"Things are seldom what they seem.
Skim milk masquerades as cream ..."

Arizona on fire.
Brewer raises my ire.
Of C I don't tire.
 
K

Keith Thompson

Uno said:
BartC said:
a[c] = ++a[c];

I think just:

++a[c];

will do here, since ++ also modifies it's operand as well as return it's
value plus 1.

In fact the behavior of a[c] = ++a[c] is undefined, since it modifies
a[c] twice without an intervening sequence point.

I changed that. So now I'm trying to get the data in a dynamic array:

And do what with it?
$ indent -i3 hist7.c [...]
$ cat hist7.c
#include <stdio.h>
#include <limits.h>
#include <stdlib.h>
#define SIZE 20000000
#define SIZE2 (UCHAR_MAX+1)

int
main (void)
{
int c;
long counter = 0;
long counter2 = 0;
long counter3 = 0;
long a[SIZE2], b[SIZE2];
long i, j, end;
FILE *fp;
char filename[] = "shoulder.wmv";
unsigned char *p;

fp = fopen (filename, "rb+");
fseek (fp, 0L, SEEK_END);
end = ftell (fp);
fseek (fp, 0L, SEEK_SET);
printf ("file_length is %ld\n", end);
p = malloc (end);

You don't check whether malloc() succeeded. (Don't bother telling
us how much memory you have; just check whether p==NULL.)

Note that the fseek() method isn't guaranteed to tell you the size
of the file. It probably does so on your system, but the standard
doesn't guarantee it.

(And as long as your code is non-portable, you probably might as well
use some non-portable method like fstat().)

If it works on your system, and if you're sufficiently sure that
the file's size won't change while the program is running, that's
probably ok.
for (i = 0; i < (SIZE2); ++i)
{
a = 0;
}
for (j = 0; j < SIZE; ++j)
{
c = fgetc (fp);
counter++;


You increment counter even when fgetc() returns EOF? I don't know what
you're counting, so I can't tell whether that's right or not.
if (c != EOF)
{
p[j] = c;
++a[c];
}
else
break;
} [snip]

// cc -Wall -Wextra hist7.c -o hist
$

Does the malloc'ing look good for reading this file? Also, what do
POSIX and C say about the EOF character itself? If my data are typical,
I would say that neither thinks that EOF is part of the file.

There is no "EOF character". fgetc() returns *either* the value
of the character it just read (treated as an unsigned char and
converted to int) *or* the special value EOF if it was unable to
read a character.

Typically UCHAR_MAX==255 (it could be larger) and EOF==-1 (it can
be any negative int value). So fgetc() will return either a value
in the range 0..255, or the value -1.

No, EOF is not part of the file; it's an indication that there's
nothing more in the file, or that there was an error. You can call
feof() and/or ferror() to determine which.

Do you really need to read the entire file into memory? Perhaps you
do, but it's often better to process the data as you're reading it.
It depends on what your goal is.

[...]
 
K

Keith Thompson

Uno said:
It takes me a few posts before I get a syntactical C program to use to
describe what I'm doing. Without referent source, I can't communicate
this.
[...]

You should be able to describe it in English. If you can't, I suggest
that you don't really know what you're doing.
 
U

Uno

Uno said:
It takes me a few posts before I get a syntactical C program to use to
describe what I'm doing. Without referent source, I can't communicate
this.
[...]

You should be able to describe it in English. If you can't, I suggest
that you don't really know what you're doing.

My english is fine, keith, thx for caring.
 
I

Ian Collins

Uno said:
On 06/12/2011 03:42 PM, Eric Sosman wrote:
On 6/12/2011 4:47 PM, Uno wrote:

There were a lot of problems with this source that needing ironing out,
and I needed to start to somehwhere.

An *excellent* place to start is with a description of what
you want the program to do. As it is, you've just thrown a bunch
of code at us and expected us to divine your intentions. If the
code were not so clearly broken, divination might stand a chance;
even then it would be better to state your purpose. With broken
code, such a statement is even more important.

It takes me a few posts before I get a syntactical C program to use to
describe what I'm doing. Without referent source, I can't communicate
this.
[...]

You should be able to describe it in English. If you can't, I suggest
that you don't really know what you're doing.

My english is fine, keith, thx for caring.

thx?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top