Binary file IO: Converting imported sequences of chars to desiredtype

James Kanze · Nov 9, 2009

[...]

That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?

I think he missed a "when possible", or something similar.
Binary formats are an optimization: you sometimes need this
optimization (and you certainly should be aware of the
possibility of using it), but you don't use them unless timing
or data size constraints make it necessary.

Alf P. Steinbach · Nov 9, 2009

* James Kanze:

I actually did some measures, to check the numbers. Your
numbers were wrong. More to the point, actual numbers will vary
enormously from one implemenation to the next.

Not every one reads that group. Not everyone agrees with its
moderation policy (as currently practiced).

Would you care to elaborate on that hinting, please.

Cheers,

- Alf

Rune Allnor · Nov 9, 2009

[...]

That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?

Click to expand...

I think he missed a "when possible", or something similar.

*You* are accusing *me* of missing the fine print??!!

Let's see what I have written. From my post

http://groups.google.no/group/comp.lang.c++/msg/1c4004bbac86a046

[RA] > > File I/O operations with text-formatted floating-point data
[JK] > A lot of time compared to what?

[RA] Wall clock time. Relative time, compared to dumping
binary data to disk. Any way you want.

...
[RA] > > The rule-of-thumb is 30-60 seconds per 100 MBytes of
[JK] > Try it on what machine

.

[RA] Any machine. The problem is to decode text-formatted numbers
to binary.

...
Here is a test I wrote in matlab a few years ago, to demonstrate
the problem (WinXP, 2.4GHz, no idea about disk):

[matlab code snipped]

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

...
The timing numbers (both absolute and relative) would be of
similar orders of magnitude if you repeated the test with C++.
...
The application I'm working with would need to crunch through
some 10 GBytes of numerical data per hour.

I think these excerpts should be sufficient to sketch what
kind of world I am living and working in.

Do note thet I never - unlike some other paricipants in this
thread - claimed my numbers to be exact. I am fairly certain
my English is good enough that the above would reasonably be
expected to be interpreted by a reader as *representative*
numbers. If you look closely, I also commented that coding
up a program in C++ instead of matlab as I had done, would
result in *different* numbers, but not solve the fundamental
problem.

So I can't see any reason why you attack me for my numbers
being "wrong"; I never stated they were exact.

A few posts further out:

http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6

[RA] So what does text-based formats actually buy you?
[JK] Shorter development times, less expensive development, greater
reliability...

In sum, lower cost.

[RA] As long as you keep two factors in mind:
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.

[JK] The user pays for your time. Spending it to do something
which
results in a less reliable program, and that he doesn't need,
is
irresponsible, and borders on fraud.

This one really pissed me off. Here I had explained to you
what application I am working with, made you aware of the users
requirements in the operational situation, and you explicitly
state that paying attention to such concerns is 'borderline fraud'!

So I can not interpret this in any other way than that you will
use text-based formats, come hell or high water. Which essentially
invalidate any otherwise relevant arguments you might have presented
throughout thread.

Binary formats are an optimization:

No, it's not. The selection of file formats is a strategic desing
decision on a par with using binary O(lgN) or linear O(N) search
engines; like choosing betweene a O(NlgN) quick sort or a O(N^2)
bubble sort algorithm.

Such factors govern what problems can be handled by the software
with reasonable effort and within reasonable time.

True, both binary and text-based numerical IO are O(N), but since
text-based numerical IO is orders of magnitude slower, the strategic
impact on design decisions is the same.

you sometimes need this
optimization (and you certainly should be aware of the
possibility of using it), but you don't use them unless timing
or data size constraints make it necessary.

Hipocrate!

This is exactly what I have been arguing for days and weeks already.
What changed?

Rune

James Kanze · Nov 9, 2009

On Nov 8, 11:11 am, Rune Allnor <[email protected]> wrote:

Click to expand...

[...]

A couple of weeks ago I posted a question on comp.lang.c++
about some technicality about binary file IO. Over the
course of the discussion, I discovered to my amazement -
and, quite frankly, horror - that there seems to be a school
of thought that text-based storage formats are universally
preferable to binary text formats for reasons of portability
and human readability.
That seems to me an inaccurate description of this thread.
Kanze has pointed out the strengths of text formats, but
has also noted that there are times when binary formats
are needed. Who has been saying that text formats are
"universally preferable" to binary formats?

Click to expand...

I think he missed a "when possible", or something similar.

Click to expand...

*You* are accusing *me* of missing the fine print??!!

[...]

I think these excerpts should be sufficient to sketch what
kind of world I am living and working in.

I fully understand what kind or world you're working in. As a
consultant, I've worked on seismic applications too, albeit not
recently.

Do note thet I never - unlike some other paricipants in this
thread - claimed my numbers to be exact.

Off by more than an order of magnitude is not just a question of
"exact".

I am fairly certain my English is good enough that the above
would reasonably be expected to be interpreted by a reader as
*representative* numbers. If you look closely, I also
commented that coding up a program in C++ instead of matlab as
I had done, would result in *different* numbers, but not solve
the fundamental problem.

So I can't see any reason why you attack me for my numbers
being "wrong"; I never stated they were exact.

First, I didn't "attack" you. On the whole, I understand your
problem. Stating that the difference is some 100 times is
misleading, however.

A few posts further out:

http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6

Click to expand...

[RA] So what does text-based formats actually buy you?
[JK] Shorter development times, less expensive development, greater
reliability...

In sum, lower cost.

[RA] As long as you keep two factors in mind:
1) The user's time is not yours (the programmer) to waste.
2) The users's storage facilities (disk space, network
bandwidth etc) are not yours (the programmer) to waste.

[JK] The user pays for your time. Spending it to do something
which
results in a less reliable program, and that he doesn't need,
is
irresponsible, and borders on fraud.

This one really pissed me off. Here I had explained to you
what application I am working with, made you aware of the
users requirements in the operational situation, and you
explicitly state that paying attention to such concerns is
'borderline fraud'!

I didn't say that. I said that ignoring issues of development
time and reliability is fraud. You have to make a trade off; if
text based IO isn't sufficiently fast for the users needs, or
requires too much additional space, then you use binary. But
you consider the cost of doing so, and weigh it against the
other costs.

So I can not interpret this in any other way than that you
will use text-based formats, come hell or high water.

How do you read that into anything I've said. I've simply
pointed out that using text does buy you something, or in other
words, using binary has a cost. There's no doubt that using
text has other costs. Engineering is about weighing the
difference costs; if you don't know what text based formats buy
you, then you can't weigh the costs accurately.

Which essentially invalidate any otherwise relevant arguments
you might have presented throughout thread.

No, it's not. The selection of file formats is a strategic desing
decision on a par with using binary O(lgN) or linear O(N) search
engines; like choosing betweene a O(NlgN) quick sort or a O(N^2)
bubble sort algorithm.

Which are also optimizations

.

There are optimizations and optimizations. Sometimes you do
know up front that you'll need the optimization; if you know
that you'll have to deal with millions of elements, you know up
front that a quadratic algorithm won't do the trick.

In the case of choosing binary, the motivation for doing so up
front is a bit different---after all, the difference will never
be other than linear. Partially, the motivation can be
calculated: if you know the number of elements, you can
calculate the disk space needed up front. In many cases,
however, you know that you'll be locked into the format you
choose, so you have to consider performance issues earlier.
Once you start considering performance issues, however, you're
talking about optimization.

Such factors govern what problems can be handled by the software
with reasonable effort and within reasonable time.

True, both binary and text-based numerical IO are O(N), but since
text-based numerical IO is orders of magnitude slower, the strategic
impact on design decisions is the same.

There you go exagerating again. It's not orders of magnitude
slower. At the most, it's around 10 times slower, and often the
difference is even less. That doesn't mean that its irrelevant,
and sometimes you will have to use a binary format (and
sometimes, you'll have to adapt the binary format, to make it
quicker).

James Kanze · Nov 9, 2009

* James Kanze:

Would you care to elaborate on that hinting, please.

"Not everyone" means "at least me". I stopped participating in
the group because I found the moderation was becoming too heavy
in some cases. Others, I know, aren't bothered with it. To
each his own.

Brian · Nov 9, 2009

There you go exagerating again. It's not orders of magnitude
slower. At the most, it's around 10 times slower, and often the
difference is even less. That doesn't mean that its irrelevant,
and sometimes you will have to use a binary format (and
sometimes, you'll have to adapt the binary format, to make it
quicker).

This Gianni Mariani quote indicates he saw some
differences of more than 10x.

"However, reading and writing binary files can have HUGE
performance gains. I once came across some numerical code
where it would read and write large datasets. These datasets
were 40-100MB. The performance was horrendus. Using mapped
files and binary data made the reading and writing virtually
zero cost and it improved the performance of the product by
nearly 10x times and in some tests over 1000x. Be careful -
this is one application and the bottle neck was clearly
identified. This may not be where your application spends
its time."

I hope to beef up the C++ Middleware Writer's support
for writing and reading data more generally. To begin
with I'm going to focus on integral types and assume
8 bit bytes. Currently we don't have support for uint8_t,
uint16_t, etc. I guess those are the types I'll start with.
I'm going through the newsgroup archives to find snippets
that are helpful in this area. If anyone has a link wrt
this, I'm interested.

Brian Wood
http://www.webEbenezer.net

James Kanze · Nov 9, 2009

This Gianni Mariani quote indicates he saw some
differences of more than 10x.

"However, reading and writing binary files can have HUGE
performance gains. I once came across some numerical code
where it would read and write large datasets. These datasets
were 40-100MB. The performance was horrendus. Using mapped
files and binary data made the reading and writing virtually
zero cost and it improved the performance of the product by
nearly 10x times and in some tests over 1000x. Be careful -
this is one application and the bottle neck was clearly
identified. This may not be where your application spends
its time."

I hope to beef up the C++ Middleware Writer's support
for writing and reading data more generally. To begin
with I'm going to focus on integral types and assume
8 bit bytes. Currently we don't have support for uint8_t,
uint16_t, etc. I guess those are the types I'll start with.
I'm going through the newsgroup archives to find snippets
that are helpful in this area. If anyone has a link wrt
this, I'm interested.

Brian Woodhttp://www.webEbenezer.net

reading binary file into memory. Converting from char to uint32,float, double, ASCII strings etc (st	37	Oct 15, 2011
Streaming file IO and binary files	3	Jul 25, 2007
binary file to CString	2	Nov 19, 2008
converting char to int (reading from a binary file)	12	May 16, 2008
converting char to float (reading binary data from file)	17	May 21, 2008
How to convert CSV to parquet file without RLE_DICTIONARY encoding?	0	Sep 2, 2022
Writing binary data from database to file	2	Sep 3, 2010
performance of script to write very long lines of random chars	15	Apr 11, 2013

Binary file IO: Converting imported sequences of chars to desiredtype

James Kanze

Alf P. Steinbach

Rune Allnor

James Kanze

James Kanze

Brian

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads