how to encode a float in base64?

A

aaragon

Hello,

I've been reading about the base64 format and it seems that it is
straightforward to encode text (I followed a document that I found on
the internet entitled How to Base64 written by Randy Charles Morin).
However, I wonder how we encode a float in base64. Is there an easy
way to do this? A float has 4 bytes, do I have to convert each byte to
an unsigned char and then do the conversion to base64? I really don't
see how I can convert a float to base64. Please help,

 
J

Jack Klein

Hello,

I've been reading about the base64 format and it seems that it is
straightforward to encode text (I followed a document that I found on
the internet entitled How to Base64 written by Randy Charles Morin).
However, I wonder how we encode a float in base64. Is there an easy
way to do this? A float has 4 bytes, do I have to convert each byte to
an unsigned char and then do the conversion to base64? I really don't
see how I can convert a float to base64. Please help,


I don't see a question about the C++ language here, base64 encoding
can be done in any language.

Still...

You don't encode a float, or a double, or an int, or a text stream in
base64, you encode streams of 8-bit octets, which happen to correspond
to streams of bytes on most architecture.

Also note that you can only include streams with a length that is
evenly divisible by 3, so if you have a stream of bytes that is not
evenly divisible by 3, you add one or two bytes of 0 at the end before
encoding.

So if you have verified, on your implementation, that sizeof(float) is
4 and CHAR_BIT is 8, and that one float is all that you want to
encode, you should define an array of 6 characters, initialized to 0.
Then you can memcpy() the float into the beginning of the buffer. Then
you can encode the 6 byte buffer.

--
Jack Klein
Home: http://JK-Technology.Com
FAQs for
comp.lang.c http://c-faq.com/
comp.lang.c++ http://www.parashift.com/c++-faq-lite/
alt.comp.lang.learn.c-c++
http://www.club.cc.cmu.edu/~ajo/docs/FAQ-acllc.html
 
P

Paul Brettschneider

Jack said:
I don't see a question about the C++ language here, base64 encoding
can be done in any language.

Still...

You don't encode a float, or a double, or an int, or a text stream in
base64, you encode streams of 8-bit octets, which happen to correspond
to streams of bytes on most architecture.

Indeed. Typically you will just generate a stream of bytes and convert that.
You will not convert individual floating point values. Which would be
inefficient due to padding, as you say below.
Also note that you can only include streams with a length that is
evenly divisible by 3, so if you have a stream of bytes that is not
evenly divisible by 3, you add one or two bytes of 0 at the end before
encoding.

So if you have verified, on your implementation, that sizeof(float) is
4 and CHAR_BIT is 8, and that one float is all that you want to
encode, you should define an array of 6 characters, initialized to 0.

And kiss portability good bye. ;)
Also you have to consider the binary float representation of your
architecture, whether it's big or little endian and if you ever want to
transmit data between machines with non-compatible binary representation of
floating point values. My point: if you can avoid it, don't BASE64 encode
your floats just use plain-text representation delimited by spaces.

Depending on your data you might want to look into XML or csv encodings.
There must be libraries for both, taking care of non-printable characters.
Then you can memcpy() the float into the beginning of the buffer. Then
you can encode the 6 byte buffer.

Isn't using unions the idiomatic thing to do in this case? Like this:

#include <iostream>
#include <cmath>
#include <algorithm>
#include <iterator>

int main()
{
const size_t size = sizeof(double);
union {
double f;
unsigned char s[size];
} u;
u.f = 4.0 * std::atan(1.0); // Pi
std::cout << u.f << '\n';
std::copy(&u.s[0], &u.s[size],
std::eek:stream_iterator<unsigned int>(std::cout, "-"));
std::cout << std::endl;
}

On IA32:
3.14159
24-45-68-84-251-33-9-64-

On PA-RISC:
3.14159
64-9-33-251-84-68-45-24-

;)

Paul
 
J

James Kanze

@gmail.com> wrote in comp.lang.c++:

[...]
You don't encode a float, or a double, or an int, or a text
stream in base64, you encode streams of 8-bit octets, which
happen to correspond to streams of bytes on most architecture.
Also note that you can only include streams with a length that
is evenly divisible by 3, so if you have a stream of bytes
that is not evenly divisible by 3, you add one or two bytes of
0 at the end before encoding.
So if you have verified, on your implementation, that
sizeof(float) is 4 and CHAR_BIT is 8, and that one float is
all that you want to encode, you should define an array of 6
characters, initialized to 0. Then you can memcpy() the float
into the beginning of the buffer. Then you can encode the 6
byte buffer.

Why not just skip the memcpy?

The important point about base64 (which you point out in the
first paragraph above) is that it isn't a primary format, but a
secondary one---it is used to convert a binary format into a
stream of ASCII characters, which can be transmitted over a link
which doesn't support binary transparently. You still have to
define the binary format: how you convert a float into a stream
of 8-bit octets. How you do this, of course, will depend on
what binary format you are using: XDR, for example requires four
octets, with the sign on bit 7 of the first octet, etc., etc.
Until you've defined this format, it's impossible to say how to
convert a float to base64. (More generally, the question
doesn't have an answer, because base64 itself doesn't encode
floats, etc., but only octet streams.)
 
J

James Kanze

Jack Klein wrote:

[...]
Isn't using unions the idiomatic thing to do in this case?

It's formally undefined behavior (although I suspect that most
compilers support it). The standard sanctionned way is with
reinterpret_cast, but this is not without problems, and is
probably less portable than the union in practice. memcpy is
guaranteed to work, everywhere (and also avoids any alignment
issues which might otherwise crop up).
Like this:

#include <iostream>
#include <cmath>
#include <algorithm>
#include <iterator>

int main()
{
const size_t size = sizeof(double);
union {
double f;
unsigned char s[size];
} u;
u.f = 4.0 * std::atan(1.0); // Pi
std::cout << u.f << '\n';
std::copy(&u.s[0], &u.s[size],
std::eek:stream_iterator<unsigned int>(std::cout, "-"));
std::cout << std::endl;

}

On IA32:
3.14159
24-45-68-84-251-33-9-64-
On PA-RISC:
3.14159
64-9-33-251-84-68-45-24-

On a Sun Sparc, if I try this somewhere in the middle of a
larger buffer (say at the second byte), I get a core dump:).
(Correctly aligned, or using memcpy, the results are the same as
those of the PA-RISC.) Try it on just about any mainframe, and
you'll get still other values.
 
P

Paul Brettschneider

James said:
Jack Klein wrote:
[...]
Then you can memcpy() the float into the beginning of the
buffer. Then you can encode the 6 byte buffer.
Isn't using unions the idiomatic thing to do in this case?

It's formally undefined behavior (although I suspect that most
compilers support it). The standard sanctionned way is with
reinterpret_cast, but this is not without problems, and is
probably less portable than the union in practice. memcpy is
guaranteed to work, everywhere (and also avoids any alignment
issues which might otherwise crop up).

Strange, I thought the memcpy() and casts have the problem of strict
aliasing.
Like this:

#include <iostream>
#include <cmath>
#include <algorithm>
#include <iterator>

int main()
{
const size_t size = sizeof(double);
union {
double f;
unsigned char s[size];
} u;
u.f = 4.0 * std::atan(1.0); // Pi
std::cout << u.f << '\n';
std::copy(&u.s[0], &u.s[size],
std::eek:stream_iterator<unsigned int>(std::cout, "-"));
std::cout << std::endl;

}

On IA32:
3.14159
24-45-68-84-251-33-9-64-
On PA-RISC:
3.14159
64-9-33-251-84-68-45-24-

On a Sun Sparc, if I try this somewhere in the middle of a
larger buffer (say at the second byte), I get a core dump:).

AFAIK, the union guarantees that alignment is correct for all of its
members. The idea was to allocate the union on the stack an copy from/to
there. But since it's undefined behaviour, the discussion is moot...
(Correctly aligned, or using memcpy, the results are the same as
those of the PA-RISC.) Try it on just about any mainframe, and
you'll get still other values.

And since they use EBCDIC you can't simply use text representation. Though I
guess most sensible transport layers can transform EBCDIC to ASCII on the
fly.
 
J

James Kanze

James said:
Jack Klein wrote:
[...]
Then you can memcpy() the float into the beginning of the
buffer. Then you can encode the 6 byte buffer.
Isn't using unions the idiomatic thing to do in this case?
It's formally undefined behavior (although I suspect that most
compilers support it). The standard sanctionned way is with
reinterpret_cast, but this is not without problems, and is
probably less portable than the union in practice. memcpy is
guaranteed to work, everywhere (and also avoids any alignment
issues which might otherwise crop up).
Strange, I thought the memcpy() and casts have the problem of
strict aliasing.

According to the standards (C and C++), you're allowed to access
any object as an array of unsigned char (or any char type in
C++). Which means that strict aliasing can't be applied as soon
as one of the pointers involved is a char type.

In practice, I think some compilers overlook this point. (I see
to recall hearing the g++ was one of them.) As I mentions, for
whatever reasons, and regardless of what the standard says, I
think the use of a union is probably somewhat more portable that
the use of reinterpret_cast here, although it wouldn't surprise
me if either caused problems with some compilers. (On the other
hand, I've cast

The issue with memcpy is different: you've let a pointer escape
to a function. Either the compiler knows the semantics of
memcpy somehow, in which case, it knows that it's modifying your
object, or it doesn't, in which case, it has to assume that it
might modify your object. Something like:
float f ;
void* p = &f ;
float* pf = static_cast< float* >( p ) ;
*pf = ...
is definitly legal, well defined, and actually not that uncommon
in C code. And the compiler must assume that if you pass the
address of a float to memcpy (converting it implicitly to
void*), that memcpy might do something like the last two lines
internally. And unlike the case of reinterpret_cast to unsigned
char*, I've never heard of a compiler getting this one wrong.

My xdrstream's make no use of aliasing whatsoever, using
something like:

bool isNeg = source < 0 ;
if ( isNeg ) {
source = - source ;
}
int exp ;
if ( source == 0.0 ) {
exp = 0 ;
} else {
source = ldexp( frexp( source, &exp ), 24 ) ;
exp += 126 ;
}
unsigned long mant = source ;
dest.put( (isNeg ? 0x80 : 0x00) | exp >> 1 ) ;
dest.put( ((exp << 7) & 0x80) | ((mant >> 16) & 0x7F) ) ;
dest.put( mant >> 8 ) ;
dest.put( mant ) ;

to output a float. In the end, I think it's the only way to be
100% sure. (But I suspect that it may have an unacceptable
performance hit in some cases, although at least on a Sun Sparc,
it's not nearly as slow as it look.)
Like this:
#include <iostream>
#include <cmath>
#include <algorithm>
#include <iterator>
int main()
{
const size_t size = sizeof(double);
union {
double f;
unsigned char s[size];
} u;
u.f = 4.0 * std::atan(1.0); // Pi
std::cout << u.f << '\n';
std::copy(&u.s[0], &u.s[size],
std::eek:stream_iterator<unsigned int>(std::cout, "-"));
std::cout << std::endl;
}
On IA32:
3.14159
24-45-68-84-251-33-9-64-
On PA-RISC:
3.14159
64-9-33-251-84-68-45-24-
;)
On a Sun Sparc, if I try this somewhere in the middle of a
larger buffer (say at the second byte), I get a core dump:).
AFAIK, the union guarantees that alignment is correct for all
of its members. The idea was to allocate the union on the
stack an copy from/to there.

OK. Exactly as you've written it, there's no problem. I
thought you were thinking more along the lines of artificially
placing the union over the buffer. (I've seen more than a few
programmers who try to do that.) When you define a variable
with a union type, of course, the compiler must ensure alignment
(or rather, it must ensure that you can access all of the
elements of the union without a core dump).
But since it's undefined behaviour, the discussion is moot...

Unless you're more concerned about what compilers actually do
that about what the standard says:).
And since they use EBCDIC you can't simply use text
representation. Though I guess most sensible transport layers
can transform EBCDIC to ASCII on the fly.

I've often wondered a bit about this myself. Usually, as you
say, code translation takes place during file transfer. But
what happens on shared disks. But do mainframes support
arbitrary disk sharing, say mounting a file system served by a
Unix machine? Somehow, I rather doubt it.

Note that some mainframes don't use 2's complement for integral
types either. (Unisys has two mainframe architectures. One is
36 bit 1's complement, the other 48 bit signed magnitude, with,
however, only 39 value bits in the 48, and no unsigned
arithmetic, so INT_MAX == UINT_MAX. I can imagine that more
than a few "portable" programs would have problems with one of
those.)
 
P

Paul Brettschneider

James said:
James said:
On Mar 17, 9:01 am, Paul Brettschneider <[email protected]>
wrote:
Jack Klein wrote:
[...]
Then you can memcpy() the float into the beginning of the
buffer. Then you can encode the 6 byte buffer.
Isn't using unions the idiomatic thing to do in this case?
It's formally undefined behavior (although I suspect that most
compilers support it). The standard sanctionned way is with
reinterpret_cast, but this is not without problems, and is
probably less portable than the union in practice. memcpy is
guaranteed to work, everywhere (and also avoids any alignment
issues which might otherwise crop up).
Strange, I thought the memcpy() and casts have the problem of
strict aliasing.

According to the standards (C and C++), you're allowed to access
any object as an array of unsigned char (or any char type in
C++). Which means that strict aliasing can't be applied as soon
as one of the pointers involved is a char type.

Interesting. I will only use uint8_t * from now on. (Not really.)
In practice, I think some compilers overlook this point. (I see
to recall hearing the g++ was one of them.)

Oh yes, I've definitely been bitten by aliasing issues with char pointers to
non-char data on g++.
As I mentions, for
whatever reasons, and regardless of what the standard says, I
think the use of a union is probably somewhat more portable that
the use of reinterpret_cast here, although it wouldn't surprise
me if either caused problems with some compilers. (On the other
hand, I've cast

The issue with memcpy is different: you've let a pointer escape
to a function. Either the compiler knows the semantics of
memcpy somehow, in which case, it knows that it's modifying your
object, or it doesn't, in which case, it has to assume that it
might modify your object. Something like:
float f ;
void* p = &f ;
float* pf = static_cast< float* >( p ) ;
*pf = ...
is definitly legal, well defined, and actually not that uncommon
in C code. And the compiler must assume that if you pass the
address of a float to memcpy (converting it implicitly to
void*), that memcpy might do something like the last two lines
internally. And unlike the case of reinterpret_cast to unsigned
char*, I've never heard of a compiler getting this one wrong.

My xdrstream's make no use of aliasing whatsoever, using
something like:

bool isNeg = source < 0 ;
if ( isNeg ) {
source = - source ;
}
int exp ;
if ( source == 0.0 ) {
exp = 0 ;
} else {
source = ldexp( frexp( source, &exp ), 24 ) ;
exp += 126 ;
}
unsigned long mant = source ;
dest.put( (isNeg ? 0x80 : 0x00) | exp >> 1 ) ;
dest.put( ((exp << 7) & 0x80) | ((mant >> 16) & 0x7F) ) ;
dest.put( mant >> 8 ) ;
dest.put( mant ) ;

to output a float. In the end, I think it's the only way to be
100% sure. (But I suspect that it may have an unacceptable
performance hit in some cases, although at least on a Sun Sparc,
it's not nearly as slow as it look.)

Sure, also if you have to handle different float representations you will
end up with code like this.
[...]
And since they use EBCDIC you can't simply use text
representation. Though I guess most sensible transport layers
can transform EBCDIC to ASCII on the fly.

I've often wondered a bit about this myself. Usually, as you
say, code translation takes place during file transfer. But
what happens on shared disks. But do mainframes support
arbitrary disk sharing, say mounting a file system served by a
Unix machine? Somehow, I rather doubt it.

I have a hard time imagining something like this, since the concept of files
is quite different on mainframes (highly structured) vs. Unix (stream of
bytes). In this case structured is a good idea since you know the datatype
of every field and can act on it accordingly. OTOH, there's nothing that
doesn't exist. ;)
Note that some mainframes don't use 2's complement for integral
types either. (Unisys has two mainframe architectures. One is
36 bit 1's complement, the other 48 bit signed magnitude, with,
however, only 39 value bits in the 48, and no unsigned
arithmetic, so INT_MAX == UINT_MAX. I can imagine that more
than a few "portable" programs would have problems with one of
those.)

Wow. This must be tough for compiler writers to get all the cases right!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top