Read binary data file

L

Lew

Charles said:
Let's review what the OP stated

A struct is given in C++

Data needs to read from a file in Java.

You have the following data types

unsigned long
unsigned short

As previously stated by other posters the Endianness of the operating
system should affect how the output file is encoded. I assume this to
be true but have not verified it to be true.

We assume all unsigned longs and unsigned short will ALWAYS have the
same bytesize.

The complete struct is given as

unsigned long data1;
unsigned short data2;
unsigned short data3;
unsigned long data4;

Can we also assume that the data will always be sequenced as described
in the STRUCT?
I don't see any argument why the data will be out of sequence as
defined in the STRUCT.

But we do not know the padding, and the OP doesn't know what those sizes are,
nor the endianness of their files. They don't even know in what format the
floating-point values are stored: IEEE? We need all that information to craft
a Java equivalent, and we don't have it. The OP doesn't have it, by their
account.
Does the input file get modified when it is transported from one
operating system to another?
I assume NO. This is not verified.

But if endianness and padding matter, the fact that it is not modified will
make it unreadable on the second system.
Are there equivalents of unsigned long and unsigned short in Java?
No.

Are they the same byte size?

We do not know. The OP hasn't given us enough information.
Do they encode the data the same?

We do not know. The OP hasn't given us enough information.
Try to read in Java and verify with known data. If you don't know any
of the data values this becomes a harder task.

It's already impossible based on the information given. How much harder can
it get?
 
M

Martin Gregorie

Lew said:
It's already impossible based on the information given. How much harder
can it get?
If the OP *MUST* move binary data, at least do it in a platform and
language-independent manner and use ASN.1 encoding.
 
D

DRS.Usenet

I'm not sure if this is the same issue, but I'm trying to interpret
numeric values out of a chunk of data as follows:

int toBinary theValue
124 1111100 3.8
63 111111 4
224 11100000 4.8
63 111111 4
63 111111 4
224 11100000 4.8
64 1000000 3.2
63 111111 4
244 11110100 5
124 1111100 3.8

I can read "int" out of my blob of data, and I ran toBinaryString on
it just to visualize it. I manually typed "theValue" (that is what I
KNOW the test data is). Can someone help me figure out what code to
run in order to get "theValue"?

--Dale--
 
K

~kurt

Martin Gregorie said:
If the OP *MUST* move binary data, at least do it in a platform and
language-independent manner and use ASN.1 encoding.

I understand Hunter's comments, and and while I don't know much about
ASN.1 encoding, what I am pointing out is that binary files are usually
*not* intended to be used across sytems. Every binary data file I have
ever worked with was intended to be used either by the program that wrote
it, or separate applications that used the same utility libraries as the
application which wrote the data. There is nothing wrong with simply writing
the C structure to a file, and reading it in the same way. In this case
the code, and not some specification, drives the format of the data - and there
is *nothing* wrong with this. The lack of a need to share the data outside of
the application is what often drives the decision to use binary data in the
first place (why not take advantage of the efficiency binary files have to
offer).

Of course, every once in a while an outside user decides they want to use this
data. Well, then they have a choice. Either generate it themselves, or
spend a few hours writing something that can read it in - not a big price
to pay.

- Kurt
 
R

Roedy Green

int toBinary theValue
124 1111100 3.8
63 111111 4
224 11100000 4.8
63 111111 4
63 111111 4
224 11100000 4.8
64 1000000 3.2
63 111111 4
244 11110100 5
124 1111100 3.8

I can read "int" out of my blob of data, and I ran toBinaryString on
it just to visualize it. I manually typed "theValue" (that is what I
KNOW the test data is). Can someone help me figure out what code to
run in order to get "theValue"?

If you get enough samples you can create a
private static final double[] translate = new double[256];
to do the translation for you.

In what context did you see this code? It looks like it might be some
sort of sound encoding technique. You can read up the specs on the
encoding.

see http://mindprod.com/jgloss/sound.html to help get you started.

It might also be some sort of Huffman encoding. See
http://mindprod.com/jgloss/huffman.html
 
E

Esmond Pitt

~kurt said:
I understand Hunter's comments, and and while I don't know much about
ASN.1 encoding, what I am pointing out is that binary files are usually
*not* intended to be used across sytems.

Except for all the ones that are, e.g. protocol dumps; databases;
interpretive pseudo-code (e.g. .class files), ...
Every binary data file I have
ever worked with was intended to be used either by the program that wrote
it, or separate applications that used the same utility libraries as the
application which wrote the data.

Except for the ones that aren't: e.g. protocol dumps; databases;
interpretive pseudo-code (e.g. .class files), ...
There is nothing wrong with simply writing
the C structure to a file, and reading it in the same way. In this case
the code, and not some specification, drives the format of the data - and there
is *nothing* wrong with this.

There is plenty wrong with this. The format of binary data written
directly from a struct in memory depends on at least the following:

- the host hardware
- the compiler
- the compiler version
- the surrounding #pragmas
- the compiler options that were in effect when the binary that wrote
the file it was compiled

This is too many dependencies, on too many things that can't be controlled.

The only time writing a struct from memory to a file or a network can
sanely be justified is when the target application is constructed with
the same version of the same object file that wrote it. And this is not
a guarantee that in general can be met.
 
M

Martin Gregorie

~kurt said:
I understand Hunter's comments, and and while I don't know much about
ASN.1 encoding, what I am pointing out is that binary files are usually
*not* intended to be used across sytems.
>
I think its use is quite industry-dependent: I've never seen it used in
financial messaging (that's more likely to use SWIFT formats, which are
tagged text) but its common in the telecommunications industry.

Telcos (both fixed line and mobile) use a lot of binary data for control
and accounting purposes, mainly because this minimizes message size and
there's a LOT of stuff flying around controlling the network in real
time and accounting for its use. Switches from large vendors, e.g.
Erickson, tend to use proprietary, flat message formats but if the data
will be exchanged between different types of kit (e.g. roaming billing
data) they tend to use ASN.1: CCITT likes it.

ASN.1 has a lot in common with XML in that its a tagged field protocol,
allows nesting, and uses a tag dictionary to associate meanings with
tags. Compared with XML its a LOT more compact (tags are one byte, fixed
length fields don't have terminators, variable length fields are
preceded by a one or two byte length) and it has a number of predefined
field types as well as arrays. If you have the dictionary its easy to
interpret on the fly though, like XML, you can also use the dictionary
to generate code to encode and decode ASN.1 records.
Every binary data file I have
ever worked with was intended to be used either by the program that wrote
it, or separate applications that used the same utility libraries as the
application which wrote the data.
>
There's also a lot of binary data in large commercial systems. Formerly
it was in large serial files, then flat indexed files, now its probably
in a database. A really good reason for using an RDBMS is that it not
only hides implementation details (like endian conventions) from the
application, but the interfaces (SQL, JDBC, ODBC, etc) typically provide
field conversion facilities.
There is nothing wrong with simply writing
the C structure to a file, and reading it in the same way.
>
I'd probably use a CSV format any place where a database would be
obvious overkill, but ymmv.

Using CSV rather than binary makes debugging easier and (said with his
*NIX hat on) it allows the data to be handled by common scripted
utilities like awk, perl and even shell scripts. Oh yeah, Java too :)
 
M

Mike Schilling

Esmond said:
Except for all the ones that are, e.g. protocol dumps; databases;
interpretive pseudo-code (e.g. .class files), ...

How often to database *files* get moved from one system to another? In my
experience, they stay on the server where the DBMS engine is running.
 
K

~kurt

Esmond Pitt said:
The only time writing a struct from memory to a file or a network can

Who is talking about writing data to a network?
sanely be justified is when the target application is constructed with
the same version of the same object file that wrote it. And this is not
a guarantee that in general can be met.

Uh, this is pretty much what I just said other than I see no need for
the "guarantee" part - it is not necessary unless the *intent* is to
distribute the data externally.

As I said, my gripe is in calling the originator of the OP's data clueless.
That statement is simply clueless itself. Yes, if the original program had
been written in Java, then maybe that statement would be true. But this
is a C++ program. The data files are most likely "private", only to be
used internally. Sure, if you port the code to another platform, the
binary files between the two versions may not be compatible, but so what -
that usually isn't a problem. The new code will create binary files that
are compatible with itself. Creating some external specification that this
binary data must meet would be stupid because then, if you did port the
code, now you may have to modify it to be compatible with the original
specification, and this may require more processing of the data. Suddenly,
some specification is driving internal data, and robbing some degree of
performance from the application.

Just because a bureaucrat comes a long some time down the road and says
"though shalt write a Java program (not that Java is the best solution in
this case, but because it is the 'in' thing to do) that will use Program X's
internal data files" does not mean Program X was poorly designed.

- Kurt
 
M

Mike Schilling

~kurt said:
Who is talking about writing data to a network?


Uh, this is pretty much what I just said other than I see no need for
the "guarantee" part - it is not necessary unless the *intent* is to
distribute the data externally.

As I said, my gripe is in calling the originator of the OP's data
clueless. That statement is simply clueless itself. Yes, if the
original program had been written in Java, then maybe that statement
would be true. But this
is a C++ program. The data files are most likely "private", only to
be used internally. Sure, if you port the code to another platform,
the binary files between the two versions may not be compatible, but
so what - that usually isn't a problem. The new code will create
binary files that are compatible with itself. Creating some external
specification that this binary data must meet would be stupid because
then, if you did port the code, now you may have to modify it to be
compatible with the original specification, and this may require more
processing of the data. Suddenly, some specification is driving
internal data, and robbing some degree of performance from the
application.

The danger is that a different compiler (or different version of the same
compiler) would cause an incompatibility. The good news is that compiler
vendors tend not to change struct layouts for that very reason. Still, this
needs to be kept in mind and tested for whenever that sort of change is
made.

Another point, not yet mentioned (or if it has been, I missed that post.)
Any structured data that's saved persistently should contain a version
number. If it never changes, you've added a small amount of overhead. When
it does change, it's now straightforward to convert older versions and
recognize new ones, which, without the explicit versioning, can be difficult
or impossible.
 
?

=?ISO-8859-1?Q?Arne_Vajh=F8j?=

Mike said:
How often to database *files* get moved from one system to another? In my
experience, they stay on the server where the DBMS engine is running.

It has been attempted occasionally.

It is usually not supported and often it does not work.

Arne
 
M

Martin Gregorie

Mike said:
The danger is that a different compiler (or different version of the same
compiler) would cause an incompatibility. The good news is that compiler
vendors tend not to change struct layouts for that very reason. Still, this
needs to be kept in mind and tested for whenever that sort of change is
made.
Actually, there's a more subtle way of failing that can bite an
executable that reloads data that it wrote itself: there's not
necessarily a guarantee that the chunks of data will be read back to the
same virtual memory address that it was saved from so it had better not
contain pointers that are expected to remain valid.

I've been there: I had a program that did lookups on a few hundred
million phone numbers. It used a B-tree for in-memory lookups: the same
lookup using a database wouldn't run faster than 700 lookups/second and
we needed 3000, hence the B-tree which ran at 25,000/second. BUT startup
took 40 minutes to populate the B-tree from the database, so I saved the
B-tree by simply dumping its dataspace to files that were reloaded on
startup. The B-tree grew continuously, so it was split over a number of
multi-megabyte memory chunks: each was written to a separate file.
Reloading these reduced startup time to under 5 minutes. However, the
first iteration merely crashed because the OS (a Mach-based UNIX) didn't
reload the chunks into the same places in my process's virtual memory,
so the pointers were so much junk. FWIW the fix was to replace standard
pointers with my own addressing scheme: this occupied the same space,
but replaced pointers with structs containing two fields,
chunkno:chunk_offset. This sidestepped the problem and ran acceptably fast.

I know this is somewhat OT for c.j.j.p but knowing about it may save
somebody's hide one of these days.
 
M

Mike Schilling

Martin said:
I've been there: I had a program that did lookups on a few hundred
million phone numbers. It used a B-tree for in-memory lookups: the
same lookup using a database wouldn't run faster than 700
lookups/second and we needed 3000, hence the B-tree which ran at
25,000/second. BUT
startup took 40 minutes to populate the B-tree from the database, so
I saved the B-tree by simply dumping its dataspace to files that were
reloaded on startup. The B-tree grew continuously, so it was split
over a number of multi-megabyte memory chunks: each was written to a
separate file. Reloading these reduced startup time to under 5
minutes. However, the first iteration merely crashed because the OS
(a Mach-based UNIX) didn't reload the chunks into the same places in
my process's virtual memory, so the pointers were so much junk. FWIW the
fix was to replace
standard pointers with my own addressing scheme: this occupied the
same space, but replaced pointers with structs containing two fields,
chunkno:chunk_offset. This sidestepped the problem and ran acceptably
fast.

On some OS's you could have created a memory-mapped file at whatever address
you provided, which lets you both use absolute addresses and avoid the
startup overhead by letting the file page itself in. Yours is a nice "with
simple tools" solution.
 
G

Gordon Beaton

On some OS's you could have created a memory-mapped file at whatever
address you provided, which lets you both use absolute addresses and
avoid the startup overhead by letting the file page itself in. Yours
is a nice "with simple tools" solution.

There are many components that make up the address space of an
application, and there is no guarantee that the same block of
addresses will always be available to the application. A program that
depends on that particular feature of mmap() is extremely fragile and
can't be expected to work across upgrades of the software or any of
the libraries it depends on. That might be ok for hobby projects, but
I'd never ship such a beast to a customer.

/gordon

--
 
M

Mike Schilling

Gordon said:
There are many components that make up the address space of an
application, and there is no guarantee that the same block of
addresses will always be available to the application. A program that
depends on that particular feature of mmap() is extremely fragile and
can't be expected to work across upgrades of the software or any of
the libraries it depends on. That might be ok for hobby projects, but
I'd never ship such a beast to a customer.

I'm not really familiar with mmap(); wouldn't it be possible to choose a
starting address well out of the possible end address of the application
proper? I was actually thinking of VMS, where the address could be in a
part of virtual memory that isn't used by the application at all.

In any case, if it's possible to allocate enough contiguous virtual memory
at some location, all that's needed is to adjust the stored addresses by the
difference [1], and you can still page the file in as needed. If you're not
sure of contiguous memory, you effectively have the OP's solution of (chunk,
offset) pairs.

Though if you're doing this, it's more logical to store offsets to the start
of the file rather than addresses.
 
N

Nigel Wade

~kurt said:
I understand Hunter's comments, and and while I don't know much about
ASN.1 encoding, what I am pointing out is that binary files are usually
*not* intended to be used across sytems. Every binary data file I have
ever worked with was intended to be used either by the program that wrote
it, or separate applications that used the same utility libraries as the
application which wrote the data.

Pretty much all scientific data I have worked with over the past 25 years has
been written in binary, and is intended to be read on just about any platform
you'd care to use. The basic principle behind being able to do this is writing
the binary data in a well structured form, in a reliable and portable way.
There is nothing wrong with simply writing
the C structure to a file, and reading it in the same way.

There is everything wrong with this. This is the fundamental problem. The amount
of padding which is used internally within a struct is undefined by the
language - it is entirely up to the compiler developer. If you write a struct
in binary both the data *and the padding* will be output together, all
intermingled. Further, since the amount of padding is at the discretion of the
compiler writers they are free to change the amount they use in any release of
their compiler. So you could quite easily find that an upgrade to the compiler
causes your code, which you say is perfectly acceptable, to break even on the
same hardware and OS.
In this case
the code, and not some specification, drives the format of the data - and there
is *nothing* wrong with this.

Yes there is. Code which writes unspecified data to a binary file is bad code.
It will almost certainly break at some time in the future.
The lack of a need to share the data outside of
the application is what often drives the decision to use binary data in the
first place (why not take advantage of the efficiency binary files have to
offer).

But it is wise to know what is being written into your binary file so that you
can reliably read it back in. Otherwise it's reverse GIGO, it's GOGI - garbage
out, garbage in.
Of course, every once in a while an outside user decides they want to use this
data. Well, then they have a choice. Either generate it themselves, or
spend a few hours writing something that can read it in - not a big price
to pay.

But somewhat difficult if the original program's author didn't know what they
were writing into their binary files. I
 
L

Lew

Nigel said:
There is everything wrong with this. This is the fundamental problem. The amount
of padding which is used internally within a struct is undefined by the
language - it is entirely up to the compiler developer. If you write a struct
in binary both the data *and the padding* will be output together, all
intermingled. Further, since the amount of padding is at the discretion of the
compiler writers they are free to change the amount they use in any release of
their compiler. So you could quite easily find that an upgrade to the compiler
causes your code, which you say is perfectly acceptable, to break even on the
same hardware and OS.

A point which has been made several times in this thread.
Yes there is. Code which writes unspecified data to a binary file is bad code.
It will almost certainly break at some time in the future.

Most emphatically.
But it is wise to know what is being written into your binary file so that you
can reliably read it back in. Otherwise it's reverse GIGO, it's GOGI - garbage
out, garbage in.

Another point which has been made several times in this thread, in various ways.
But somewhat difficult if the original program's author didn't know what they
were writing into their binary files. I

Which is why we keep advising the OP (who seems to have lost interest in their
question) to determine exactly what that format they're using, then to code to
that specification. This point seems to have been lost repeatedly.

I would love for the OP to chime in and let us know that they've done this
step. How 'bout it, Windsor.Locks? Any luck with that analysis? What did
you find?
 
N

Nigel Wade

Lew said:
A point which has been made several times in this thread.


I know.

But certain posters in the thread still seem to be lacking the necessary clue.
So continuing to hit them again and again with the same clue-stick the message
might eventually begin to sink in.

Maybe we need to introduce lines, write 1000 times (without using the cut-paste
buffer): "I must not write C structs to binary files".

As to reading binary data, I prefer to use ByteBuffer to handle
big-/little-endian issues. Although it might not be particularly efficient for
reading large quantities of binary data it is convenient, reasonably
transparent, and it's part of the standard API so should always be available.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,276
Latest member
Sawatmakal

Latest Threads

Top