Binary file I/O

J

J. Campbell

OK...I'm in the process of learning C++. In my old (non-portable)
programming days, I made use of binary files a lot...not worrying
about endian issues. I'm starting to understand why C++ makes it
difficult to read/write an integer directly as a bit-stream to a file.
However, I'm at a bit of a loss for how to do the following. So as
not to obfuscate the issue, I won't show what I've been attempting ;-)

What I want to do is the following, using the standare IO streams.

1) open an arbitrary file (file1).
2) starting with the first byte in (file1), read a chunk of data into
an array of integers.
3) manipulate the array, as integer data, and then output the contents
of the array to another file (file2).
4) read the next data-chunk from file1 into the array.
5) goto 3 until end of file.

If anyone knows of a tutorial that contains concrete examples of this,
I'd appreciate a pointer to the info. Thanks
 
J

Jonathan Mcdougall

OK...I'm in the process of learning C++. In my old (non-portable)
programming days, I made use of binary files a lot...not worrying
about endian issues. I'm starting to understand why C++ makes it
difficult to read/write an integer directly as a bit-stream to a file.
However, I'm at a bit of a loss for how to do the following. So as
not to obfuscate the issue, I won't show what I've been attempting ;-)

What I want to do is the following, using the standare IO streams.

# include <fstream>
# include <iostream>
# include <vector>
# include <sstream>
# include <string>
1) open an arbitrary file (file1).

std::ifstream file1("f.txt");
2) starting with the first byte in (file1), read a chunk of data into
an array of integers.

const int CHUNK = 128;

char buffer[CHUNK];
file1.read(buffer, CHUNK);

3) manipulate the array, as integer data,

void manipulate(std::vector<int> &v);


manipulate(data);
and then output the contents
of the array to another file (file2).

std::eek:fstream file2("g.txt");;
std::copy(data.begin(), data.end(),
std::ostream_iterator said:
4) read the next data-chunk from file1 into the array.
5) goto 3 until end of file.

goto 3; :)
If anyone knows of a tutorial that contains concrete examples of this,
I'd appreciate a pointer to the info. Thanks

The C++ Standard Library by Josuttis.

Jonathan
 
J

Jonathan Mcdougall

# include <fstream>
# include <vector>
# include <algorithm>

Forget these ones :
# include <sstream>
# include <iostream>
1) open an arbitrary file (file1).

std::ifstream file1("f.txt");
2) starting with the first byte in (file1), read a chunk of data into
an array of integers.

const int CHUNK = 128;

char buffer[CHUNK];
file1.read(buffer, CHUNK);

std::vector<int> data;
std::copy(buffer, buffer + 128, std::back_inserter(data));

std::copy(buffer, buffer + CHUNK, std::back_inserter(data));
void manipulate(std::vector<int> &v);


manipulate(data);


std::eek:fstream file2("g.txt");;
std::copy(data.begin(), data.end(),
std::eek:stream_iterator<int>(std::cout, "\n"));

std::copy(data.begin(), data.end(),
std::eek:stream_iterator<int>(file2, "\n"));


Sorry about that,

Jonathan
 
J

J. Campbell

Thanks Jonathan.

Your response is most helpful. Now, I need to digest why it works,
and why it's necessarry.

I want to clairify a few things. Assuming int is 32-bits, then,
after:
-----
const int CHUNK = 128;

char buffer[CHUNK];
file1.read(buffer, CHUNK);
------
at this point the char array, "buffer" contains 128 elements of 1-byte
each, right?

-----
std::vector<int> data;
std::copy(buffer, buffer + 128, std::back_inserter(data));
-----
now, the vector named "data" contains 32 elements, each of which is a
4-byte integer, right?

How do I know if the bytes that went into the vector integers went in
head-first or feet-first? in other words, if the first 4 bytes of the
file were (HEX):
FF 00 00 00
will the first int in the vector "data" be FF000000 (dec 4278190080)
or will it be 000000FF (dec 255)? Or is it machine dependent?

can I avoid all the "std::" by using "using namespace std;" or is it
necessary to scope-resolve all the keywords?

Another thing... Do you think it's better to read chunks of a file as
I've indicated, or is it better to load the whole file into memory?

Also, your method leaves 2-duplicates of the data in memory...one as
the char array, and once as the vector. is this a problem?

One more thing...I asked a question here recently:

http://groups.google.com/[email protected]&rnum=1

about accessing a char array as an array of int. How is the vector
method different/safer than the (unsafe & non-portable) method I
demonstrated in the earlier post.

thanks again for the help.

I don't seem to be able to quit typing;-) Sorry to innundate you with
so many questions...I realize that you may not choose to address them
all..
 
R

Rolf Magnus

Thomas said:
To nitpick, the constant should be "unsigned" since a quantity can't
be negative. i.e.
const unsigned int CHUNK_SIZE = 128;

I'd disagree. It should be signed, since you might have negative offsets
when accessing the array elements, and mixing signed and unsigned
arithmetic can be problematic, and some compilers warn if you do.
Besides, what would you really gain from making it unsigned?
A 4-byte _signed_ integer.

Yes, as int is by default signed.
It is machine dependent. The topic is called Endianism.

I've only seen it be called Enidaness.
Try this experiment:
const unsigned int endian_test = 0x01020304;
unsigned char byte0;
unsigned char byte1;
unsigned char byte2;
unsigned char byte3;
unsigned char * ptr = (unsigned char *) &endian_test;
byte0 = *ptr++;
byte1 = *ptr++;
byte2 = *ptr++;
byte3 = *ptr++;
cout << hex << (unsigned short) byte0 << endl;
cout << hex << (unsigned short) byte1 << endl;
cout << hex << (unsigned short) byte2 << endl;
cout << hex << (unsigned short) byte3 << endl;

This is a personal, style, issue. Here are some popular styles:
1. Declare each function and class with a separate "using" statement:
using std::cout;
using std::vector;
2. Use the global "using" statement:
using namespace std;
3. Prefix each function and class with its namespace:
std::cout << "hello" << std::endl;
There are different opinions on which to use. Use a search engine
and search this newsgroup for "namespace" and "using".

At least, most people seem to agree that it's a bad idea to put
something like this in a header.
Btw: you can also put using into functions.
If you have the space, read in the whole file; otherwise read it
in as chunks. The fewer reads, the faster the execution.

Not necessarily. If you need maximum speed, you should test it for
different block sizes.
 
J

Jonathan Mcdougall

Thanks Jonathan.

Your response is most helpful. Now, I need to digest why it works,
and why it's necessarry.

I want to clairify a few things. Assuming int is 32-bits, then,
after:

You can't "assume" this, it depends on the platform. Anyways it does
not matter in this case.
-----
const int CHUNK = 128;

char buffer[CHUNK];
file1.read(buffer, CHUNK);

No, 'data' contains 128 elements of type int. Each element has a size
of sizeof(int), which *could* be 4 bytes.

data[0]

contains the value which was in

buffer[0]

For example, if the first byte in the file was 65, then buffer[0]
contains char(65) (which is 'A') and data[0] simply contains 65.
can I avoid all the "std::" by using "using namespace std;" or is it
necessary to scope-resolve all the keywords?

Yes, but I personnaly not recommend it. I prefer to qualify
everything, but it is a matter of style (and carefulness).
Another thing... Do you think it's better to read chunks of a file as
I've indicated, or is it better to load the whole file into memory?

Depends on the file size and the memory available.
Also, your method leaves 2-duplicates of the data in memory...one as
the char array, and once as the vector. is this a problem?

Well you explicitly wanted an array of integers and since there is no
function which takes an int[], I needed to do a conversion.
One more thing...I asked a question here recently:

http://groups.google.com/[email protected]&rnum=1

about accessing a char array as an array of int. How is the vector
method different/safer than the (unsafe & non-portable) method I
demonstrated in the earlier post.

Variable-length arrays are, afaik, illegal in C++ anyways. Take a
look at that :

http://www.btinternet.com/~chrisnewton/pp/contarray.xml


Jonathan
 
J

Jonathan Mcdougall

-----
A 4-byte _signed_ integer.

I just want to remind you that 'data' contains *128* elements, not 32
and that the endianness discussion does not apply.

<snip>

Jonathan
 
J

J. Campbell

Jonathan,

I just tried out your method, and it leaves me scratching my head.
After stumbling briefly for lack of the header to define
back_inserter() and ostream_iterator() (thanks Google and SGI), the
code compiles fine:
__________code__________________

#include <fstream>
#include <vector>
#include <iterator>

using namespace std;

int main(){
const int DATACHUNK = 20;
char buffer[DATACHUNK];

ifstream filein("shifttest.cpp");
filein.read(buffer, DATACHUNK);

vector<int> filedata;
copy(buffer, buffer + DATACHUNK, back_inserter(filedata));

ofstream fileout("shifttest.joe");
copy(filedata.begin(), filedata.end(),
ostream_iterator<int>(fileout, "\n" ));
}

_____end code_________________

However, when I look at the file out, it contains:

35
105
110
99
108
117
100
101
32
60
105
111
115
116
114
101
97
109
62
10

which is the ASCII representation of the integer representation of the
ASCII sequence "#include <iostream>"

which, strangely enough, happens to be the first line of
"shifttest.cpp" ;-)

This is really not at all what I am wanting to do. Now my 20 bytes is
represented by 93 bytes of a rather odd data-type...neither characters
nor integers, but rather some strange beast that combines the worst of
both worlds.

I'm left wondering, in this strange new world of C++ do I need to get
used to dealing with ASCII representations of numbers for file I/O?
Or do I need to always break my 4-byte integers into individual bytes
prior to I/O if I don't want to waste storage space? I suppose this
would be pretty easy...something like:

//not tested
int bytetowrite;
char holdword[4];

for(int i = 0; i < 4; i++)
holdword = (bytetowrite & (255 << (i * 8))) >> (i * 8);
//holdword now contains, small-byte first, the data from bytetowrite

However, this seems a bit tedious, considering that this rigamarole
doesn't really do anything to the internal data. I feel like there's
something really basic that I don't *get* about streams... All I
really want to do is "get at" the data in a file and treat that data
as numbers typed to the native processor word size...then, manipulate
the data and write the data out to a second file. Consider, for
example, that the file consists of a binary bitmap and I want to
invert it, or rotate it or something.

Anyway...It's apparent that I have a lot to learn. This C++ is
tantalizing me...the code is about 10 to 20 x faster than my old
16-bit compiler...but jeez...what would seem to be a simple
manipulation can become so frustrating!!! It feels a little like
typing with my toes.

Thanks for the help people. It is beginning to make some sense.

Joe

Jonathan Mcdougall said:
# include <fstream>
# include <vector>
# include <algorithm>

Forget these ones :
# include <sstream>
# include <iostream>
1) open an arbitrary file (file1).

std::ifstream file1("f.txt");
2) starting with the first byte in (file1), read a chunk of data into
an array of integers.

const int CHUNK = 128;

char buffer[CHUNK];
file1.read(buffer, CHUNK);

std::vector<int> data;
std::copy(buffer, buffer + 128, std::back_inserter(data));

std::copy(buffer, buffer + CHUNK, std::back_inserter(data));
void manipulate(std::vector<int> &v);


manipulate(data);


std::eek:fstream file2("g.txt");;
std::copy(data.begin(), data.end(),
std::eek:stream_iterator<int>(std::cout, "\n"));

std::copy(data.begin(), data.end(),
std::eek:stream_iterator<int>(file2, "\n"));


Sorry about that,

Jonathan
 
J

J. Campbell

Jonathan Mcdougall said:
I just want to remind you that 'data' contains *128* elements, not 32
and that the endianness discussion does not apply.

<snip>

Jonathan

Jonathan...I now understand what's going on and the endianness
discussion. My news reader has serious lag, so I may not be current
with the discussion. However...I understand more after this post.
when I said I wanted the file bytes represented by integers, I meant
that I wanted the first ((char)/sizeof(int)) (eg 4) bytes of data to
be put into integerarray[0], the next into integerarray[1]...etc.
Anyway...thanks for clairifying this.
 
J

Jonathan Mcdougall

I just tried out your method, and it leaves me scratching my head.
After stumbling briefly for lack of the header to define
back_inserter() and ostream_iterator() (thanks Google and SGI), the
code compiles fine:

This depends on the implementation. The standard does not specify
which header must be included by which; <iterator> probably got
__________code__________________

#include <fstream>
#include <vector>
#include <iterator>

using namespace std;

int main(){
const int DATACHUNK = 20;
char buffer[DATACHUNK];

ifstream filein("shifttest.cpp");
filein.read(buffer, DATACHUNK);

vector<int> filedata;
copy(buffer, buffer + DATACHUNK, back_inserter(filedata));

ofstream fileout("shifttest.joe");
copy(filedata.begin(), filedata.end(),
ostream_iterator<int>(fileout, "\n" ));
}

_____end code_________________

However, when I look at the file out, it contains:

35
105
110
99
108
117
100
101
32
60
105
111
115
116
114
101
97
109
62
10

which is the ASCII representation of the integer representation of the
ASCII sequence "#include <iostream>"
which, strangely enough, happens to be the first line of
"shifttest.cpp" ;-)

You asked for binary, that is what I gave you. If you want the ASCII
This is really not at all what I am wanting to do. Now my 20 bytes is
represented by 93 bytes

93 ?? Why do you say that?
of a rather odd data-type...neither characters
nor integers, but rather some strange beast that combines the worst of
both worlds.

These numbers you saw are the ASCII value of the characters in the
file. The thing is, characters and integers are actually the very
same thing, it's just the output which makes the difference : ints are
displayed as numbers and chars are displayed as characters, which
depend on your implementation (but you are probably using ASCII).

Remember your subject is "Binary file I/O", not "Text file I/O".
I'm left wondering, in this strange new world of C++ do I need to get
used to dealing with ASCII representations of numbers for file I/O?

It depends on what you want. In the case of a simple text file
(remember, *text* is a ambiguous term in programming, everything boils
down to zeros and ones) , values would be ASCII numbers and text would
be the representation on the screen (65 would be 'A').

In the case of a binary file (such as an image), values would be
simple numbers formatted according to the image's type (jpg, bmp..)
and text would be... garbage, since these numbers would be printed
according to the ASCII table (remember when you first started and
tried to display binary files on screen? Loads of smileys and beeps
and ascii graphics..).
However, this seems a bit tedious, considering that this rigamarole
doesn't really do anything to the internal data. I feel like there's
something really basic that I don't *get* about streams... All I
really want to do is "get at" the data in a file and treat that data
as numbers typed to the native processor word size...then, manipulate
the data and write the data out to a second file. Consider, for
example, that the file consists of a binary bitmap and I want to
invert it, or rotate it or something.

In that case, you would store every byte in a vector of whatever
(unsigned char would be the best, I think), you skip the header until
the data, you invert it and store the whole thing in a new file.

The actual type of the vector (or array, as you wish) does not matter
except for the memory wasted.
Anyway...It's apparent that I have a lot to learn. This C++ is
tantalizing me...the code is about 10 to 20 x faster than my old
16-bit compiler...but jeez...what would seem to be a simple
manipulation can become so frustrating!!! It feels a little like
typing with my toes.

Hehe.. and you're still only playing with i/o.


Jonathan
 
J

Jonathan Mcdougall

-----
I just want to remind you that 'data' contains *128* elements, not 32
and that the endianness discussion does not apply.

<snip>

Jonathan

Jonathan...I now understand what's going on and the endianness
discussion. My news reader has serious lag, so I may not be current
with the discussion. However...I understand more after this post.
when I said I wanted the file bytes represented by integers, I meant
that I wanted the first ((char)/sizeof(int)) (eg 4) bytes of data to
be put into integerarray[0], the next into integerarray[1]...etc.
Anyway...thanks for clairifying this.

Oh, sorry.

Well the std::copy() is not good in that case, you will have to make a
loop and to assign the values manually :

for (int i=0; i<CHUNK; i+=4)
{
int temp = 0;
for ( int j=0; j<4; ++j)
{
temp |= (buffer[i + j] << (8 * (3 - j));
}

data.push_back(temp);
}

Something like that?

And sorry for the brutal endianness conversation break, I didn't mean
it.

Jonathan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,528
Members
45,000
Latest member
MurrayKeync

Latest Threads

Top