printing binary data?

S

Steven T. Hatton

I'm trying to write a program like hexel. I guess I could fish out the
source for hexel and look at that, but for now I'm trying to figure out how
I can do with with std::stringstream and std::string. I had something
working with std::string. I simply treated it as an STL container, and
iterated over its elements. The results were a bit confusing to me. Some
of the stuff was printing out as 1 or 2 characters hex numbers, as I
expected. Other characters were printing out in what looks to me to be
representative of a larger data size than a byte. For example:

00 04 00 fffffff1 ffffffff

I decided to try fetching the std::string::data() representation, and then
to use regular char pointers, but that didn't work as I naively expected.
For example, I was trying to add the size of the string to the pointer
returned from std::string::data; The result was 0;

Is there a better approach to working with bytes of raw data than using
strings? I mean using tools from the Standard Library?

I'm thinking my problem is comming from the fact that the locale is set to
en_US.UTF-8, but I really don't know how that might impact the behavior of
std::string;
 
R

red floyd

Steven said:
I'm trying to write a program like hexel. I guess I could fish out the
source for hexel and look at that, but for now I'm trying to figure out how
I can do with with std::stringstream and std::string. I had something
working with std::string. I simply treated it as an STL container, and
iterated over its elements. The results were a bit confusing to me. Some
of the stuff was printing out as 1 or 2 characters hex numbers, as I
expected. Other characters were printing out in what looks to me to be
representative of a larger data size than a byte. For example:

00 04 00 fffffff1 ffffffff

I decided to try fetching the std::string::data() representation, and then
to use regular char pointers, but that didn't work as I naively expected.
For example, I was trying to add the size of the string to the pointer
returned from std::string::data; The result was 0;

Is there a better approach to working with bytes of raw data than using
strings? I mean using tools from the Standard Library?

I'm thinking my problem is comming from the fact that the locale is set to
en_US.UTF-8, but I really don't know how that might impact the behavior of
std::string;

Make sure your data is unsigned (the leading 'f's are sign extension).
Also, if you need to, mask it with 0xff just in your output.

i.e.: instead of

os << *p;

use:

os << (*p & 0xff); // worst case scenario
 
S

Stephen Howe

Other characters were printing out in what looks to me to be
A pure guess: characters are being converted signed ints and that is the
source of your 8-digit hex values.
Is there a better approach to working with bytes of raw data than using
strings? I mean using tools from the Standard Library?

We have no idea what you did as there is _NO_ example code
or an example what data output you wanted.

Stephen Howe
 
S

Steven T. Hatton

Stephen Howe said:
A pure guess: characters are being converted signed ints and that is the
source of your 8-digit hex values.

That was my supposition. The problem seems to be that std::string and
std::ifstream, etc., are using signed char; which is one of the more
annoying aspects of the C++ Standard.

I read the data using std::ifstream, then I used a std::eek:stringstrm to
convert it to std::string.
We have no idea what you did as there is _NO_ example code
or an example what data output you wanted.

I thought it was fairly clear that I wanted two character hex
representations of each unit of data. I was asking if there were
components of the Standard Library better suited to working with data in
binary form. Perhaps something similar to Java's
java.io.ByteArrayInputStream:

http://java.sun.com/j2se/1.5.0/docs/api/java/io/ByteArrayInputStream.html

This stuff's pretty nice to work with:
http://java.sun.com/j2se/1.5.0/docs/api/java/net/SocketImpl.html
http://java.sun.com/j2se/1.5.0/docs/api/javax/net/ssl/SSLServerSocket.html

Ironically, one of the primary design features which makes it so viable is
taken direction from TC++PL. Even the naming convention is the one
Stroustrup introduced. Whith C++ products, I often find myself spending
more time trying to second guess macros and understand the idiosyncracies
of the particular implementation. It's a shame so few C++ programmers
really understand what I'm talking about. There's really not that much
wrong with C++, per se. The problems result from people failing to
understand how little things add up to big problems.

I started the code listed below based on a 280 line program that used all
kinds of typical C-style convolutions. There is a small bit of the
original functionality missing, but I can restor that with about five lines
of code. The solution I came up with for the negative char values is quite
obvious. I now understand that the 8-place hex values are the result of
converting a negative char to an unsigned int. Casting to int is necessary
because the implementation tries to print char data as characters, whereas
it prints ints as numbers.

#include <iostream>
#include <fstream>
#include <iomanip>
#include <sstream>
#include <string>

namespace hexlite {
using namespace std;
typedef string::const_iterator c_itr;

ostream& printline(c_itr start, c_itr stop, ostream& out) {
while(start<stop) out<<setw(2)<<(128 + static_cast<int>(*start++))<<" ";
return out;
}

ostream& dump(const string& dataString, ostream& out) {

ostream hexout(out.rdbuf());
hexout.setf(ios::hex, ios::basefield);
hexout.fill('0');

c_itr from (dataString.begin());
c_itr dataEnd (from + dataString.size());
c_itr end (dataEnd - (dataString.size()%16));

for(c_itr start = from; start < end; start += 16)
printline(start, start + 16, hexout)<<endl;

printline(end, dataEnd, hexout)<<endl;
return out;
}
}

int main(int argc, char* argv[]) {
if (argc < 1) { std::cerr<<"enter a file name"<<std::endl; return -1; }

std::ifstream inf(argv[1]);
if(inf) {
std::eek:stringstream oss;
oss << inf.rdbuf();
hexlite::dump(oss.str(), std::cout);
return 0;
}
std::cerr <<"\nCan't open file:"<<argv[1]<<std::endl;
return -1;
}
 
K

Karl Heinz Buchegger

Steven T. Hatton said:
Ironically, one of the primary design features which makes it so viable is
taken direction from TC++PL. Even the naming convention is the one
Stroustrup introduced. Whith C++ products, I often find myself spending
more time trying to second guess macros and understand the idiosyncracies
of the particular implementation. It's a shame so few C++ programmers
really understand what I'm talking about. There's really not that much
wrong with C++, per se. The problems result from people failing to
understand how little things add up to big problems.

The problem with people like you is, that they continue to think that
your view of the world is the only correct one. If you would stop
to do that but instead start playing the game the C++ way, you would
have fewer problems.

You are simply using the wrong tools for your attempt. std::string
and stringstreams are not ment to be used for manipulating binary
data. If you want to do that, then eg. std::vector< unsigned char >
is your tool.
 
T

Tobias Blomkvist

Steven T. Hatton sade:
That was my supposition. The problem seems to be that std::string and
std::ifstream, etc., are using signed char; which is one of the more
annoying aspects of the C++ Standard.

Try

typedef std::basic_ifstream<unsigned char> uifstream;

Tobias
 
D

Dietmar Kuehl

Tobias said:
Try

typedef std::basic_ifstream<unsigned char> uifstream;

It is definitely not that easy. To create stream objects operating
on a different character type than 'char' and 'wchar_t' you have
to do quite a lot of work although it is mostly relatively trivial.
However, I don't think you really need to do this at all because
you don't want all those formatting functions for binary data
anyway. The easiest approach for binary data is, IMO, to create a
a "formatting" layer similar to the text formatting layer which
uses stream buffers underneath. In this context it is acceptable
that the stream buffer actually uses 'char' objects and to cast
them to 'unsigned char' where necessary.
 
S

Steven T. Hatton

Dietmar said:
It is definitely not that easy.

I had to think for a moment to determine of that has slipped past my edits.
It is exactly what appeared in my code at one point.
To create stream objects operating
on a different character type than 'char' and 'wchar_t' you have
to do quite a lot of work although it is mostly relatively trivial.
However, I don't think you really need to do this at all because
you don't want all those formatting functions for binary data
anyway. The easiest approach for binary data is, [see below]
In this context it is acceptable
that the stream buffer actually uses 'char' objects and to cast
them to 'unsigned char' where necessary.

This is something that has me a bit confused. If I read in data using a
std::ifstream that has signed char as its character set, then cast it to
unsigned char, will that guarantee me that the content of the
representative storage locations faithfully represents the file? Is that
something I can rely on being portable?

Take the example of converting unsigned char to int. When the char is
negative, the int has, on my system, (if I understand correctly) the 128th
bit of the integer representation set. Therefore -127 would look like this:
1000000...00001111111. Now, if that were cast to unsigned char, we might
expect it to be truncated, rather than having the sign bit preserved.

One question becomes; where to cast? IOW, should I cast signed char to
unsigned char one byte at a time as I pull them out of the input buffer? I
might accomplish that by using a back inserter and copy from the istream
into std::vector<unsigned char>.

I'm currently working on creating a numeric type descriptor template that
will print a description of the numeric_limits class associated with a
numeric type. It looks like this in my edit buffer:

template<typename T, const char[] TypeName>
struct numeric_descriptor: public std::numeric_limits<T>{
static const std::string sc_typeName;
numeric_descriptor{

}
virtual std::eek:stream& print(std::eek:stream& out) const {

}
};
// Be aware that the above is purely scratch code, and not expected to be
// useful or even to compile.
IMO, to create a
a "formatting" layer similar to the text formatting layer which
uses stream buffers underneath. In this context it is acceptable
that the stream buffer actually uses 'char' objects and to cast
them to 'unsigned char' where necessary.

I'm rather surprized there isn't a byte (or 'octet') input stream in the
Standard Library. I mean to say a stream of unsigned integral type with a
guaranteed number of bits per unit of data. Perhaps that seemed too
trivial for the designers to consider.

I like the "formatting layer" suggestion. That could come in handy for lots
of representations. After thinking about this a bit more, I believe what I
should be doing is adding 256 only to the negative valued sign char
instances. The way I did things last night, 0 is represented as 0x80,
which is pretty silly.
 
S

Steven T. Hatton

Karl said:
The problem with people like you is, that they continue to think that
your view of the world is the only correct one.

I am not the subject of this newsgroup.
If you would stop
to do that but instead start playing the game the C++ way, you would
have fewer problems.

The One True C++ Way[TM]? What I did was based on suggestions from
authoritative C++ experts - or, at least my understanding of such. That
is, using string to hold non-text data. I honestly wish more C++
programmers would do things the C++ way, not the "C with BCPL comment
syntax" way. Ironically, my original post in this thread specifically
asked if there was a better way to accomplish what I am attempting.
You are simply using the wrong tools for your attempt. std::string
and stringstreams are not ment to be used for manipulating binary
data. If you want to do that, then eg. std::vector< unsigned char >
is your tool.

It's not quite that simple. Using std::vector<unsigned char> was one of the
options which crossed my mind, as was using the std::stringbuf inside of
std::eek:stringstream, rather than spitting it out as std::string.
std::eek:stringstream seemed to be the easiest way to get the contents of the
file into an in-memory object I could work on. I didn't have to mess with
allocators, or extractors. I'm still not convinced it's a bad idea to use
std::eek:stringstream to allocate the storage. I might be able to cast its
stringbuf to std::vector<unsigned char> in one step. It may also be
perfectly usable as-is. It does provide many ways of accessing the data.

Fortunately, the code I wrote is fairly generic, and follows STL
conventions, for the most part, so changing the data container should be
relatively easy. The biggest problem I was having not is do to the
underlying data type being signed char. It is due to the fact that
std::eek:stream derivatives try to print char data as characters rather than
Hindu-Arabic numeric characters. Having the data in a std::vector<unsigned
char> doesn't solve that problem.
 
S

Steven T. Hatton

Steven said:
After thinking about this a bit more, I believe
what I should be doing is adding 256 only to the negative valued sign char
instances. The way I did things last night, 0 is represented as 0x80,
which is pretty silly.

ostream& printline(c_itr start, c_itr stop, ostream& out) {
while(start<stop) out
<<setw(2)
<<(static_cast<unsigned int>(static_cast<unsigned char>(*start++)))<<"
";
Duh!
 
O

Old Wolf

Steven said:
That was my supposition. The problem seems to be that std::string and
std::ifstream, etc., are using signed char; which is one of the more
annoying aspects of the C++ Standard.

They use plain char. Most compilers have a switch that decides
whether plain char is signed or not. The standard allows plain
char to be unsigned.

Unfortunately there is so much existing code that would break
if plain char were unsigned, that it would be suicidal for a
compiler vendor to make that the default for IA32. We're
stuck with signed char for the foreseeable future.
I read the data using std::ifstream, then I used a std::eek:stringstrm to
convert it to std::string.

Recall that streams are FORMATTERS. If you don't want to reformat
any data, do not use '>>' or '<<'.

std::vector<unsigned char> is well suited.
I write a helper function for appending one such buffer to another,
and then they are convenient to use as well.
I thought it was fairly clear that I wanted two character hex
representations of each unit of data. I was asking if there were
components of the Standard Library better suited to working with data in
binary form. Perhaps something similar to Java's
java.io.ByteArrayInputStream:

To read raw data, use istream::get() and put it in a byte vector.
Ironically, one of the primary design features which makes it so viable is
taken direction from TC++PL. Even the naming convention is the one
Stroustrup introduced. Whith C++ products, I often find myself spending
more time trying to second guess macros and understand the idiosyncracies
of the particular implementation. It's a shame so few C++ programmers
really understand what I'm talking about. There's really not that much
wrong with C++, per se. The problems result from people failing to
understand how little things add up to big problems.

A poor workman blames his tools.
The solution I came up with for the negative char values is quite
obvious. I now understand that the 8-place hex values are the result of
converting a negative char to an unsigned int. Casting to int is necessary
because the implementation tries to print char data as characters, whereas
it prints ints as numbers.

ostream& printline(c_itr start, c_itr stop, ostream& out) {
while(start<stop) out<<setw(2)<<(128 + static_cast<int>(*start++))<<" ";
return out;
}

Firstly, the static_cast<int> is superfluous, because when you
add a char to an int (128 in this case), the char is converted
to int implicitly.

This seems a slightly bizarre solution, as you will print ' '
as 0xA0 instead of 0x20 etc., unless I'm missing something.
My preferred way would be:

out << int((unsigned char)*start++)

unless you have a wide screen and want to write out two
static_casts :)

Another way is:

out << (0xFFU & *start++)

which works in 2's complement (which is all known C++ systems).
 
T

Tobias Blomkvist

Steven T. Hatton sade:
Take the example of converting unsigned char to int. When the char is
negative, the int has, on my system, (if I understand correctly) the 128th
bit of the integer representation set. Therefore -127 would look like this:
1000000...00001111111. Now, if that were cast to unsigned char, we might
expect it to be truncated, rather than having the sign bit preserved.

-127 = 0x81 = 1000001

sign extended to 4 byte int

11111111 11111111 11111111 10000001

Tobias
 
S

Steven T. Hatton

Old said:
Steven T. Hatton wrote:

They use plain char. Most compilers have a switch that decides
whether plain char is signed or not. The standard allows plain
char to be unsigned.

Exactly my point. I can, to some extent, appreciate why things are as they
are, but I have to wonder if people have not taken things to extremes.

One thing I have running around in the back of my mind is the idea of
formalizing the idea of an abstract execution host environment. But there
may still be issues of whose machine is closes to the abstraction, and
therefore, unfairly favored, etc.. I found this interesting bit of usenet
traffic in my SuSE 9.3 distro.

http://gcc.gnu.org/onlinedocs/libstdc++/27_io/binary_iostreams_kuehl.txt
Unfortunately there is so much existing code that would break
if plain char were unsigned, that it would be suicidal for a
compiler vendor to make that the default for IA32. We're
stuck with signed char for the foreseeable future.

A byte-oriented, or, perhaps even larger, unsigned "raw data" stream "out of
the box" would be nice to have.
Recall that streams are FORMATTERS.

But stream buffers aren't.
If you don't want to reformat
any data, do not use '>>' or '<<'.

This is one way to get the data without messing with the format:
std::eek:stringstream oss<< somestream.rdbuf();
std::string somestring(oss.str());

I haven't tried to create a std::vector<unsigned char> directly from the
std::eek:stringstream::string_buf. It seems doable, but there may be a few
tricks involved.
To read raw data, use istream::get() and put it in a byte vector.

What is not clear to me is whether there is a reliable (or, perhaps I should
say 'standardized') way to get the file size. Ideally, I want a way to
read a file regardless of its location, e.g., local harddrive, network,
etc. One advantage to the approach I've taken is that it works for the
current situation. I also have the ability to use both
A poor workman blames his tools.

Not completely sure what you mean here. There are a lot of people ready to
jump over to C++/CLI without a lot of hesitation.

"Stan Lippman's BLog C++/CLI"

http://blogs.msdn.com/slippman/

There _are_ problems with C++ code bases. There _are_ currently some
significant limitations to what can be done easily with C++. There are also
many examples of things which have evolved over the years to become
horrific big balls of mud

http://www.laputan.org/mud/mud.html#BigBallOfMud
out << int((unsigned char)*start++)
Ha! That's basically what I ended up doing.
unless you have a wide screen and want to write out two
static_casts :)

I do have a wide screen, and I use it. However, I'm not sure if using the
static cast is of any real value. I suppose it's a way to document intent.
Another way is:

out << (0xFFU & *start++)

which works in 2's complement (which is all known C++ systems).

OK. If I get that, you are basically converting to unsigned int by masking
the whole char. 0000...000111111 & 10101010 == 000...000010101010. Unless
there's a performance gain to be had, I find that a bit too esoteric.
 
P

pillbug

not to detract from the utility and type-safety of <string> and
<sstream>, but sometimes the old ways can present a clarity of
intention unrivaled by modern constructs:

int bytes;
unsigned char data [16];
int fd = open ("data.bin", O_RDONLY | O_BINARY);

while ((bytes = read (fd, data, 16)) == 16)
{
printf ("%02X %02X %02X %02X" \
"%02X %02X %02X %02X" \
"%02X %02X %02X %02X" \
"%02X %02X %02X %02X\n",
data [0],data [1],data [2],data [3],
data [4],data [5],data [6],data [7],
data [8],data [9],data [10],data [11],
data [12],data [13],data [14],data [15]);
}

if (bytes > 0 && bytes != 16)
{
// handle partial line
}

close (fd)

alternatively, if you are enamored of iostreams, you could try this:

typedef std::basic_string<unsigned char> unsigned_string;
typedef std::basic_stringstream<unsigned char> unsigned_stringstream;

sorry if i'm way off here, back to lurking :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top