Explanation needed of binary operators

N

NoNeYa

Howdy folks,
I am trying to read an int from a file written in binary. After
researching that archaic file structure, I have found that it is stored as
little endian/ least significant first. I have seen code to read this as an
int in Java but don't understand what is happening. I guess that I just
don't understand what is going on with the shifting of bits and "anding"
with other values. Could someone please explain this in detail? Or lead me
to a good source for in-depth reading? I don't just want code... I want to
understand. If this seems trivial, please excuse me... I'm only a second
semester Computer Science major who likes to keep his brain busy during the
summer off and would like to make a new front-end for an old program that I
use at work.

Thanks!
 
T

Twisted

Howdy folks,
I am trying to read an int from a file written in binary. After
researching that archaic file structure, I have found that it is stored as
little endian/ least significant first. I have seen code to read this as an
int in Java but don't understand what is happening. I guess that I just
don't understand what is going on with the shifting of bits and "anding"
with other values. Could someone please explain this in detail? Or lead me
to a good source for in-depth reading? I don't just want code... I want to
understand. If this seems trivial, please excuse me... I'm only a second
semester Computer Science major who likes to keep his brain busy during the
summer off and would like to make a new front-end for an old program that I
use at work.

Thanks!

Endianness is a tricky matter. A 16-bit int provides a simple example
because there's only two ways around it would normally go:

highbyte lowbyte

or

lowbyte highbyte

To reconstruct it you only need to read into byte variables named
"highByte" and "lowByte", the correct one into each, and then

short result = (((short)highByte)*256) | ((short)lowByte);

and Bob's your uncle. The high byte, times 256, logical-OR'd with the
low byte is correct. (Simply adding the low byte fails if the byte
type is signed, as I think it is in Java, though not in C with e.g.
"typedef unsigned char byte;" -- logical ORing it should work.
Likewise shifting left the high byte should work, but the sign bit of
the high byte is also the sign bit of the short so...)

With Java ints and C longs (32 bits) you've got 24 possible orderings
of the four bytes, because the first can be in any of four places, the
second in any of the remaining three, and the third in either of the
remaining two, before the fourth is forced into the only remaining
place -- 4*3*2 is 24. In practise, the orders you usually see are two
little-endian shorts or two big-endian shorts, with the high short
either first or last (so two independent endian choices and at most
four common byte-orderings).

Any order whatsoever can be dealt with by getting byte1, byte2, byte3,
and byte4 to refer to the LSB, next least significant, and so forth
reading them in whichever order they occur in the data stream (so you
might read byte3 first, depending on the byte order in the stream).
Then left shifting and oring:

int result = (((int)byte4)<<24) || (((int)byte3)<<16) ||
(((int)byte2)<<8) || ((int)byte1)

This should work as long as sign extension isn't used (I think that
requires <<< and is found in Java but not C or C++).

The basic explanation is that you have 32 bits in a line. The first
eight are the high byte, and the last eight are the low byte of the
int. (Java int here; C/C++ users must use long to ensure having 32
bits. Java long is always 64 bits, more than you need here.)

The high byte is cast to an int, which makes it an int with the eight
bits we're interested in the last eight. We need them in the first
eight, and the <<24 shifts them left 24, so the 7th bit (the 7th back
from the end, or first of the interesting eight) is shifted to become
bit 7+24=31, or the leftmost (as there are only 31 bits left of the
last, or zeroth, bit in an int). So the shift moves the eight
interesting bits into the top eight. The shifts on byte3 and byte2
make the bits in them move to the middle positions. The last one isn't
shifted and stays in the lowest position. So after the shifts but
before the logical-ors, we have changed say

file1: ZWXY

into

byte1: ...X (. = eight zero bits)
byte2: ...Y
byte3: ...Z
byte4: ...W

(by reading byte3, byte4, byte1, and byte1 in that order)

into

temp1: ...X
temp2: ..Y.
temp3: .Z..
temp4: W...

and now the logical OR operations just combine them by copying the
nonzero bits into the result at the same place, so Z or . is Z, . or W
is W, etc. and we get:

result: WZYX

with the correct byte order unscrambled from the file's ZWXY order.
 
A

Andreas Leitgeb

NoNeYa said:
I am trying to read an int from a file written in binary. After
researching that archaic file structure, I have found that it is stored as
little endian/ least significant first.

you've got two ways from here:
1.) read it in as an integer, and then do mask&shift-magic
on the integer to obtain an endian-swapped version of it.
2.) read four bytes separately, and compose them to an integer.

anyway, you need to be aware of how the separate bits
consitute the final result:
in the stream you have b1 b2 b3 b4 four bytes.
The integer value, you want, is:
0x1*b1 + 0x100*b2 + 0x10000*b3 + 0x1000000*b4
by nature of little ends :)

Multiplication by these constants is equivalent to
*left*-shifting by 0,8,16,24 bits respectively.
(division would be *right*-shifting)

If you read in the integer canonically from stream, you
actually get this number:
0x1000000*b1 + 0x10000*b2 + 0x100*b3 + 0x1*b4
by nature of big ends.

So you'd have to do shifting, masking and finally adding
to re-arrange the bit-patterns of the integer.

Sometimes, shifting does the masking for you: if you
divide the whole number by 0x1000000 ( >>24 ), it's obvious,
that only b1 remains.
If you right-shift by 8 bits, then obviously
0x10000*b1 + 0x100*b2 + 0x1*b3 remains, and after
masking with 0xff00, only 0x100*b2 remains, another one
of the parts which your desired result consists of.

That is: to extract the b1 from that wrong-endian int,
you'll just to first *right*-shift it by 24 bits, the
other bits vanishing themselves, so no masking necessary
here. For b2, you'd shift the original number only 8 bits
to the *right*, (to change it's factor from 0x10000 to 0x100),
and then mask it with 0xff00, (adapted to b2's bits' target
position.
For b3 you first mask (again the original value) with 0xff00
and then *left*-shift 8 bits and for b4 you only need to
*left*-shift the original number by 24 bits, like b1 no
masking necessary.
All these separately shifted octets are then re-assembled,
either with or-operator "|", or (in this case equivalently) by adding.

I hope, it helped and wasn't itself more complicated than the
original problem ;-)
 
L

Lew

Andreas said:
you've got two ways from here:
1.) read it in as an integer, and then do mask&shift-magic
on the integer to obtain an endian-swapped version of it.
2.) read four bytes separately, and compose them to an integer.

anyway, you need to be aware of how the separate bits
consitute the final result:
in the stream you have b1 b2 b3 b4 four bytes.
The integer value, you want, is:
0x1*b1 + 0x100*b2 + 0x10000*b3 + 0x1000000*b4
by nature of little ends :)

Multiplication by these constants is equivalent to
*left*-shifting by 0,8,16,24 bits respectively.
(division would be *right*-shifting)

If you read in the integer canonically from stream, you
actually get this number:
0x1000000*b1 + 0x10000*b2 + 0x100*b3 + 0x1*b4
by nature of big ends.

So you'd have to do shifting, masking and finally adding
to re-arrange the bit-patterns of the integer.

Sometimes, shifting does the masking for you: if you
divide the whole number by 0x1000000 ( >>24 ), it's obvious,
that only b1 remains.
If you right-shift by 8 bits, then obviously
0x10000*b1 + 0x100*b2 + 0x1*b3 remains, and after
masking with 0xff00, only 0x100*b2 remains, another one
of the parts which your desired result consists of.

That is: to extract the b1 from that wrong-endian int,
you'll just to first *right*-shift it by 24 bits, the
other bits vanishing themselves, so no masking necessary
here. For b2, you'd shift the original number only 8 bits
to the *right*, (to change it's factor from 0x10000 to 0x100),
and then mask it with 0xff00, (adapted to b2's bits' target
position.
For b3 you first mask (again the original value) with 0xff00
and then *left*-shift 8 bits and for b4 you only need to
*left*-shift the original number by 24 bits, like b1 no
masking necessary.
All these separately shifted octets are then re-assembled,
either with or-operator "|", or (in this case equivalently) by adding.

I hope, it helped and wasn't itself more complicated than the
original problem ;-)

You can also use a java.nio.IntBuffer, which "knows" about endianness through
its java.nio.ByteOrder.
 
N

Nigel Wade

NoNeYa said:
Howdy folks,
I am trying to read an int from a file written in binary. After
researching that archaic file structure, I have found that it is stored as
little endian/ least significant first.

An interesting viewpoint, binary data as an "archaic" file structure. I wonder
how you would store your data in a non-binary form. Don't forget that a "text"
file is simply binary interpreted in a very specific way, and one persons
(ASCII) "text" file may be another persons (EBCDIC) binary garbage.
I have seen code to read this as an
int in Java but don't understand what is happening. I guess that I just
don't understand what is going on with the shifting of bits and "anding"
with other values. Could someone please explain this in detail? Or lead me
to a good source for in-depth reading? I don't just want code... I want to
understand. If this seems trivial, please excuse me... I'm only a second
semester Computer Science major who likes to keep his brain busy during the
summer off and would like to make a new front-end for an old program that I
use at work.

Thanks!

I will do my best to explain - without using any code.

Endian-ness is fun. It adds excitement and joy to the otherwise tedious task of
developing portable code to read arbitrary binary data formats. Java has made
the life of the data processor much less interesting by taking this task and
wrapping it up in the ByteBuffer class. However, for the purposes of learning
it is a good thing to understand what it going on behind the scenes.

There are [essentially] two types of endianess, big-endian and little-endian.
Big-endian hardware stores bytes in memory in their "natural" format, with the
"big" end on the "left" (lower memory address). Little-endian hardware was
designed to do the opposite, just to be awkward.

Lets assume we have 3 variables, containing a char, a 16bit int ("short") and a
32bit int ("long"). We'll assign the hex. values of 0x11, 0x1122 and 0x11223344
to these variables respectively. If these variables occupied consecutive memory
addresses (or were output to binary file in sequence) on big-endian hardware
the contents of memory would be 0x11, 0x11, 0x22, 0x11, 0x22, 0x33, 0x44. On
little-endian hardware the values would be 0x11, 0x22, 0x11, 0x44, 0x33, 0x22,
0x11. As you can see, little-endian hardware has reversed the bytes of each
value. (NOTE: If you write binary data from Java it is *always* output in
big-endian order).

If you write the data to a file and read it back on the same hardware using the
same variable types [and the same language] then there is no problem. The bytes
will be stored in the correct locations and the variables will have the correct
contents. The fun comes when you read the data as a byte array, or attempt to
read it on the other type of hardware or use a language which makes different
assumptions about the type of data.

To see how it all goes horribly wrong lets try to read the little-endian data
file (written by some language other than Java) into Java. Remember, the order
of the bytes in the little-endian binary file is 11221144332211. So we read the
first byte and treat it as a byte, and this is ok. Next we read the two bytes
0x22 and 0x11 and get the short integer 0x2211, not what we wanted at all. The
situation is the same for the "long" integer which will contain 0x44332211.
This is where byte shifting and masking becomes necessary (if you don't use
Java or don't use ByteBuffer in Java), the contents of the "short" and "long"
integers have to be reversed.

You can do this more easily by reading into a byte array and extracting the
correct bytes. For example, for the 4-byte "long" integer, reading the bytes
into a byte array you will get array[0]=0x44, array[1]=0x33, array[2]=0x22 and
array[3]=0x11. To construct the correct integer (0x11223344) you need to shift
array[3] left 24 places so it becomes 0x11000000, combine that with array[2]
left shifted 16 bits (0x220000) etc. How you write the code to do this is up to
you, I said I wouldn't use any code.
 
N

NoNeYa

Nigel Wade said:
An interesting viewpoint, binary data as an "archaic" file structure. I
wonder
how you would store your data in a non-binary form. Don't forget that a
"text"
file is simply binary interpreted in a very specific way, and one persons
(ASCII) "text" file may be another persons (EBCDIC) binary garbage.

I think I may have worded that a little odd. I meant to say "The *.DBF file
structure itself is archaic, but it does use binary storeage in it's
header".


I have seen code to read this as an
int in Java but don't understand what is happening. I guess that I just
don't understand what is going on with the shifting of bits and "anding"
with other values. Could someone please explain this in detail? Or lead
me
to a good source for in-depth reading? I don't just want code... I want
to
understand. If this seems trivial, please excuse me... I'm only a second
semester Computer Science major who likes to keep his brain busy during
the
summer off and would like to make a new front-end for an old program that
I
use at work.

Thanks!

I will do my best to explain - without using any code.

Endian-ness is fun. It adds excitement and joy to the otherwise tedious
task of
developing portable code to read arbitrary binary data formats. Java has
made
the life of the data processor much less interesting by taking this task
and
wrapping it up in the ByteBuffer class. However, for the purposes of
learning
it is a good thing to understand what it going on behind the scenes.

There are [essentially] two types of endianess, big-endian and
little-endian.
Big-endian hardware stores bytes in memory in their "natural" format, with
the
"big" end on the "left" (lower memory address). Little-endian hardware was
designed to do the opposite, just to be awkward.

Lets assume we have 3 variables, containing a char, a 16bit int ("short")
and a
32bit int ("long"). We'll assign the hex. values of 0x11, 0x1122 and
0x11223344
to these variables respectively. If these variables occupied consecutive
memory
addresses (or were output to binary file in sequence) on big-endian
hardware
the contents of memory would be 0x11, 0x11, 0x22, 0x11, 0x22, 0x33, 0x44.
On
little-endian hardware the values would be 0x11, 0x22, 0x11, 0x44, 0x33,
0x22,
0x11. As you can see, little-endian hardware has reversed the bytes of
each
value. (NOTE: If you write binary data from Java it is *always* output in
big-endian order).

If you write the data to a file and read it back on the same hardware
using the
same variable types [and the same language] then there is no problem. The
bytes
will be stored in the correct locations and the variables will have the
correct
contents. The fun comes when you read the data as a byte array, or attempt
to
read it on the other type of hardware or use a language which makes
different
assumptions about the type of data.

To see how it all goes horribly wrong lets try to read the little-endian
data
file (written by some language other than Java) into Java. Remember, the
order
of the bytes in the little-endian binary file is 11221144332211. So we
read the
first byte and treat it as a byte, and this is ok. Next we read the two
bytes
0x22 and 0x11 and get the short integer 0x2211, not what we wanted at all.
The
situation is the same for the "long" integer which will contain
0x44332211.
This is where byte shifting and masking becomes necessary (if you don't
use
Java or don't use ByteBuffer in Java), the contents of the "short" and
"long"
integers have to be reversed.

You can do this more easily by reading into a byte array and extracting
the
correct bytes. For example, for the 4-byte "long" integer, reading the
bytes
into a byte array you will get array[0]=0x44, array[1]=0x33, array[2]=0x22
and
array[3]=0x11. To construct the correct integer (0x11223344) you need to
shift
array[3] left 24 places so it becomes 0x11000000, combine that with
array[2]
left shifted 16 bits (0x220000) etc. How you write the code to do this is
up to
you, I said I wouldn't use any code.

--
Nigel Wade, System Administrator, Space Plasma Physics Group,
University of Leicester, Leicester, LE1 7RH, UK
E-mail : (e-mail address removed)
Phone : +44 (0)116 2523548, Fax : +44 (0)116 2523555
 
N

NoNeYa

Mark Space said:
So did any of these explanations help you out?

Some have clarified the situation "some-what". I do realize that I am in
"way over my head" for my level of education in programming. I am
continuing to learn more elsewhere and have asked one of my professors for
addition sources of reading. I thank all that responded. To directly
answer your question, all of the replies have educated me somewhat, but I am
still underwater looking up at the top. I am using other sources to learn
more and have purchased another book to read. I just wish I could find a
book that deals with reading binary and using binary operators, in a "baby
step" process with great explanation and code examples. The problem isn't
resolved.... but I ain't givin' up yet!

Thanks.
 
J

John W. Kennedy

Andreas said:
you've got two ways from here:
1.) read it in as an integer, and then do mask&shift-magic
on the integer to obtain an endian-swapped version of it.
2.) read four bytes separately, and compose them to an integer.

3.) in Java 1.5 and up, read it as an integer and then use
Short.reverseBytes(), Integer.reverseBytes(), or Long.reverseBytes(), as
appropriate.
 
A

Andreas Leitgeb

John W. Kennedy said:
3.) in Java 1.5 and up, read it as an integer and then use
Short.reverseBytes(), Integer.reverseBytes(), or Long.reverseBytes(), as
appropriate.

You're of course right, but I (perhaps mis-)understood the
original poster that he wanted to understand the details of
bit-shifting used for byte-reversing an int.
 
M

Mike Schilling

Twisted said:
With Java ints and C longs (32 bits) you've got 24 possible orderings
of the four bytes, because the first can be in any of four places, the
second in any of the remaining three, and the third in either of the
remaining two, before the fourth is forced into the only remaining
place -- 4*3*2 is 24. In practise, the orders you usually see are two
little-endian shorts or two big-endian shorts, with the high short
either first or last (so two independent endian choices and at most
four common byte-orderings).

In fact, you're very unlikely to see anything other than strict
little-endian (LSB-B2-B3-MSB) or strict big-endian.(MSB-B3-B2-LSB) The only
exception I've ever seen was the ordering used by the PDP-11
floating-point-processor [1], which was B3-MSB-LSB-B2.

1. Which could process both floats and 32-bit integers.
 
B

Ben Phillips

Mike said:
With Java ints and C longs (32 bits) you've got 24 possible orderings
of the four bytes, because the first can be in any of four places, the
second in any of the remaining three, and the third in either of the
remaining two, before the fourth is forced into the only remaining
place -- 4*3*2 is 24. In practise, the orders you usually see are two
little-endian shorts or two big-endian shorts, with the high short
either first or last (so two independent endian choices and at most
four common byte-orderings).


In fact, you're very unlikely to see anything other than strict
little-endian (LSB-B2-B3-MSB) or strict big-endian.(MSB-B3-B2-LSB) The only
exception I've ever seen was the ordering used by the PDP-11
floating-point-processor [1], which was B3-MSB-LSB-B2.

That's the third of the four Twisted mentioned, the other being
B2-LSB-MSB-B3.

I can't recall ever seeing a byte sex other than one of those three
myself, and only the two strict-endian ones seem to be used in any
modern PC or server hardware architectures.

OTOH I can recall a proliferation of very incompatible systems back in
the good old days -- 9- and 10-bit bytes, 7-bit bytes, even 6-bit bytes
and binary-coded decimal (yuck!!), and character orderings for the basic
A-Z stuff other than ASCII (EBCDIC, notably) or various bastardized
forms of almost-ASCII. (Pop quiz -- which popular system's pseudo-ASCII
had no {}, rearranged !@#$%^&*(), had an actual up-arrow symbol for ^,
had a £ symbol in the low 127, and had control characters that
represented colours? It actually let you type these in, mostly with
shift-number or other-modifier-key-number.)

These days we have it *easy*, with big-endian and little-endian and
cr/lf/crlf as the only two spots of low level data conversion
awkwardness. At least a byte is a byte is a byte is eight bits long and
character 65 (0x41; 081; 01000001) is always 'A'! :)

(Imagine trying to write, edit, or otherwise work with C source on a
system with no {} characters! It's probably no coincidence the systems
with {} missing were mainly programmed in assembly, or sometimes in
something icky like BASIC, and absolutely never in anything portable.)
 
M

Mike Schilling

Ben Phillips said:
(Imagine trying to write, edit, or otherwise work with C source on a
system with no {} characters! It's probably no coincidence the systems
with {} missing were mainly programmed in assembly, or sometimes in
something icky like BASIC, and absolutely never in anything portable.)

Many European keyboards lacked both square and curly brackets, leading to
the use of digraphs. See
http://david.tribble.com/text/cdiffs.htm#C90-digraph. And yes, that's
another horror we no longer contend with.
 
R

Real Gagnon

(Pop quiz -- which popular system's pseudo-ASCII
had no {}, rearranged !@#$%^&*(), had an actual up-arrow symbol for ^,
had a œ symbol in the low 127, and had control characters that
represented colours? It actually let you type these in, mostly with
shift-number or other-modifier-key-number.)

Looks like the Sinclair ZX Spectrum to me!

Bye!
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Ben said:
These days we have it *easy*, with big-endian and little-endian and
cr/lf/crlf as the only two spots of low level data conversion
awkwardness. At least a byte is a byte is a byte is eight bits long and
character 65 (0x41; 081; 01000001) is always 'A'! :)

In the blue world EBCDIC is still used.

Arne
 
M

Martin Gregorie

Arne said:
In the blue world EBCDIC is still used.
which, on the AS/400 at least, lacked {} and used trigraphs instead.

Trivia: the reason the character ordering in EBCDIC is such a mess is
that the encodings are binary representations of the IBM 029 card
punch's hole patterns. That's why you get the odd gaps between I and J
and between R and S. This was extended to allow for lower case. And no,
I have no idea why 0-9 are F0-F9 rather than 00-09.
 
R

Roedy Green

I have no idea why 0-9 are F0-F9 rather than 00-09.

ASCII has the same strangeness.
0-9 are 30-39.

I suspect the reason was pedantry. It would be even harder to get
students to understand the difference between the number 0 and the
character 0 if they had the same binary representation.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,901
Latest member
Noble71S45

Latest Threads

Top