differance between binary file and ascii file

H

Herbert Rosenau

<snip>

I think you are in violent agreement with me. I was responding to a
questions about why C has text streams as well as binary streams with an
explanation of the problems if it did not. You are explaining why C
programs see an abstraction (e.g. text and binary streams with the
system) specifics handled at a lower level.

Yes - but in question it helps nothing. Some years ago I had the job
to write a program that hat to read text files, reformat them from
line mode to stream mode (means having a paragraph as a sinlge line
independant how many single lines it were in the soure. Problem: the
files to convert on a single mashine were coming in native text
- origination from DOS/WIN, OS/2, FTP text \r\n
- origination from 370 FTP binary mode \n
- origination from 370 virtual console \r
All found mixed up in a single directory tree on local disk
Some of them were created with a stange program using 0x8d as soft
line feed.
Reading anything as text failed to get clean output.

So reading it in binary mode and interpreting
\r\n\r\n as paragraph separator
\r\r "
\n\n "
convert 0x8d to either nothing or single space
convert (\r)\n\f to \n\n
convert \f to nothing or single space
\t as single space - except in tables
\t as sequence of spaces in tables to fill up the columns
remove any syllable (mens make a single word of the syllabled one) but
leave hyphen intact

and then reformat to 80 column fixed font, leaving tables intact.

No problem insofar but the different newline separators had it made
impossible to read that as text because the only way to get out the
different text modes was to read that as binary stream.

myungetc(), mygetc() was needed to unget multiple chars.

--
Tschau/Bye
Herbert

Visit http://www.ecomstation.de the home of german eComStation
eComStation 1.2 Deutsch ist da!
 
S

S.Tobias

P.J. Plauger said:
System compatibility is a damned important reason.

All right. But besides that, is there any advantage that text files/mode
offer that binary files/mode don't have?

Suppose I'm serializing data into a textual representation to be read on
another system (with the same charset). Does it matter whether I open
the file in text or binary mode?
Whitesmiths,
Ltd. introduced the text/binary dichotomy in 1978 when porting C to
dozens of non-Unix systems, and other companies did much the same thing
in the coming years. It was a slam dunk to put it in the draft C Standard
begun in 1983.

I've often heard in c.l.c. some systems (Mainframes) had complicated
internal representation of text files. Why was it that way? What did
it solve? Why couldn't they be replaced with simple "binary" files with
'\n' as record separator?

IMHO how text is represented could be viewed as a per-application
convention rather than system-wide. `Sendmail' doesn't have to read
`inetd' configuration files and v.v., so there is no reason why they
should follow the same text representation convention. It means that
on a system several (or even unlimited) conventions might be present.
Why is there in the C language room only for one type of text stream?

There is only one "binary" file (bytes are stored in the file exactly as
written to; no translation is done). Why isn't the default open mode
(such as "r+") binary? It seems to me more natural to have settled it
this way.
 
K

Keith Thompson

S.Tobias said:
All right. But besides that, is there any advantage that text files/mode
offer that binary files/mode don't have?

Um, yes. Text files represent text.
Suppose I'm serializing data into a textual representation to be read on
another system (with the same charset). Does it matter whether I open
the file in text or binary mode?

Absolutely. For example, as I'm sure you know, Windows represents an
end-of-line by two characters, a CR followed by an LF ('\r' followed
by '\n'). If you write a "text" file on one Windows system and read
it on another in binary mode, there are two possibilities: either the
program that reads the file has to explicitly discard the '\r'
characters, or the file won't be a valid Windows text file, and you
won't be able to process it with other tools, such as ordinary text
editors.
I've often heard in c.l.c. some systems (Mainframes) had complicated
internal representation of text files. Why was it that way? What did
it solve? Why couldn't they be replaced with simple "binary" files with
'\n' as record separator?

They could have. They weren't.

Historically, files on mainframes were typically stacks of 80-column
punch cards. The complex internal representations of text files were
based on that. (I'm not very familiar with this, so I could be
mistaken.) Changing to a Unix-style format would break compatibility.
IMHO how text is represented could be viewed as a per-application
convention rather than system-wide. `Sendmail' doesn't have to read
`inetd' configuration files and v.v., so there is no reason why they
should follow the same text representation convention. It means that
on a system several (or even unlimited) conventions might be present.

That sounds like a nightmare. Do I have to have one version of vi or
emacs to read sendmail config files and another to read indentd config
files?
Why is there in the C language room only for one type of text stream?

Why does there need to be more than one?
There is only one "binary" file (bytes are stored in the file exactly as
written to; no translation is done). Why isn't the default open mode
(such as "r+") binary? It seems to me more natural to have settled it
this way.

There is no default open mode. You always have to specify whether
you're opening the file in text or binary node. You specify binary
mode by including a 'b' in the mode argument; you specify text mode by
not includig a 'b' in the mode argument.
 
D

Dik T. Winter

> I've often heard in c.l.c. some systems (Mainframes) had complicated
> internal representation of text files. Why was it that way? What did
> it solve? Why couldn't they be replaced with simple "binary" files with
> '\n' as record separator?

Because they also did not have simple "binary" files. A "binary" file
consisted of (for instance) fixed length records of (say) 80 bytes
(whatever the size of a byte). This conformed to the Fortran and
Cobol models (also for text files), with only an implicit record
separator. And if there were variable length records available, they
were either represented by a length preceding the content or (on the
CDC Cyber) as a sequence of words, each containing 10 6-bit bytes or
5 12-bit bytes, where the last word in the sequence contained 12 zero
bits in the low order part. The ultimate reason was that I/O was
record oriented, because of speed.
 
B

Ben Bacarisse

I've often heard in c.l.c. some systems (Mainframes) had complicated
internal representation of text files. Why was it that way? What did
it solve? Why couldn't they be replaced with simple "binary" files with
'\n' as record separator?

If I can be informatively flippant: the problem was not a complicated
*internal* representation, but of a dominant *external* one -- punched
cards. In a world of cards, why would one waste one of the precious 72
character spaces (the last 8 were often reserved for sequence numbering)
for a marker to show the end of something that so obvious as the end of
the card? It ended -- the computer got an signal the card was done. What
was the point of a marker?

In that world, \n (and \r and \0 used to pad the output) were seen as
control characters sent only to a printer so that it would advance the
paper and re-position the head (if it had one!).

Another consequence was that spaces did not really exist (or more
precisely that there were a lot of implicit ones). Most card punches
(if I remember right) punched nothing where there was a space so you could
not tell where the "line" ended, except for the obvious: after 80
characters (or 72 if you were stripping sequence numbers).
 
R

Richard Heathfield

Keith Thompson said:
Do I have to have one version of vi or
emacs to read sendmail config files and another to read indentd config
files?

The indentd daemon (after just a quick pint of config down at the /etc)
faithfully chunters along in the background, waiting and watching for that
momentous occasion when the user decides to run indent, a thin client which
opens a connection to the daemon, hurls the C code down it, and says
"whaddya mek o' that, then, laddie?"

Nothing daunted, indentd bravely catches the code, and turns it from IOCCC
material into something approximately approaching readability. Handing it
back to the client with a smart salute and a "have a nice day", indentd
awaits the next urgent case of mangled layout, knowing that every readable
program it produces is another victory for God, Queen, and country.
 
R

Richard Bos

CBFalconer said:
However the user should be aware that everything breaks down if the
input system tries to handle a file as text when that file doesn't
adhere to the conventions for text on the system.

So, as the doctor said to the man who complained that his arm hurt when
he hit his elbow against the wall, Don't Do That, Then. That's why we
have FTP in A mode.

Richard
 
S

S.Tobias

Keith Thompson said:
....

Um, yes. Text files represent text.
Well, binary files can contain text, too.
Suppose I'm serializing data into a textual representation to be read on
another system (with the same charset). Does it matter whether I open
the file in text or binary mode?

Absolutely. For example, as I'm sure you know, Windows represents an
end-of-line by two characters, a CR followed by an LF ('\r' followed [...]
I wasn't clear enough, I have to restate the problem. Suppose the file
is not meant for interaction with other system tools (editors), but is
a means of transferring the results to another instance of a similar
program that will continue calculations (ie. the writing mode is known).

main()
{
double result = calculate();
FILE *fp = fopen("results.txt", "w" BINM);
fprintf(fp, "%f\n", result);
fclose(fp);

fp = fopen("results.txt", "r" BINM);
fscanf(fp, "%f", &result);
fclose(fp);
cont_calculation(result);
}

Will it matter if BINM is #defined as "b" or as nothing?
Can binary mode replace the text mode in this way? I'm reading from
the Standard that nul characters may be appended to a binary stream;
could this cause problems if I want to handle binary stream in text
manner, like in the above sketch? (Will append mode + multiple closing
and opening work correctly?)
That sounds like a nightmare. Do I have to have one version of vi or
emacs to read sendmail config files and another to read indentd config
files?
Or the editors would have to be able to read multiple text formats.
IIRC, Windows XP WordPad and Notepad can save to plain text, RTF, Unicode
and Utf-8 formats. Can't we consider all these formats as text formats?

It's not that uncommon that special configuration files have dedicated
editors, eg.: vipw, vigr, visudo (however, for different reasons than
text file format convention).

I was wrong here, actually C does not preclude multiple text modes.
One could specify one like this, as an extension: "r+:DOStext".


There is no default open mode. You always have to specify whether
you're opening the file in text or binary node. You specify binary
mode by including a 'b' in the mode argument; you specify text mode by
not includig a 'b' in the mode argument.
For me the main difference between binary and text modes is that the
first is untranslated and the other is translated. You can "not do"
something only in one way, therefore I feel it would have been more
logical to have the default (not including mode spec in the argument)
binary.
 
E

ena8t8si

Richard said:
osmium said:


Knuth says that the 8-bit "standardisation" happened in around 1975 or so.
By then, C was already well under way, and dmr was almost certainly
accustomed to using the word in its non-"standard" sense.

Speaking as someone who worked on System/360's, and other computers,
during the 1960's, the word byte was already established as meaning an
8-bit quantity during that time. For sure, there were machines with
other byte sizes, but those were explicitly qualified - "seven-bit
byte",
or whatever. In the absence of any indication otherwise, byte always
meant 8 bits, even when C was growing up.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,241
Latest member
Lisa1997

Latest Threads

Top