binary vs text mode for files

K

kerravon

Hello.

In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text
mode. If it was opened in binary mode there would
not be anything special about the sequence, and that
sequence just happened by random, when we're perhaps
dealing with a zip file.

My question is - how do other languages like BASIC,
Pascal, Fortran, Cobol, PL/1 deal with this
fundamental difference between binary and text file
processing?

People suggested that C was odd for having this
differentiation.

Thanks. Paul.
 
G

glen herrmannsfeldt

kerravon said:
In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text
mode. If it was opened in binary mode there would
not be anything special about the sequence, and that
sequence just happened by random, when we're perhaps
dealing with a zip file.
My question is - how do other languages like BASIC,
Pascal, Fortran, Cobol, PL/1 deal with this
fundamental difference between binary and text file
processing?

There are many different versions of BASIC, so it is hard to say
about that.

Fortran defines FORMATTED and UNFORMATTED I/O, where FORMATTED
means text and UNFORMATTED means not text. UNFORMATTED is record
oriented (for historical reasons) and so on non-record oriented
file systems normally had a length prefix (and usually suffix) on
each record. FORMATTED normally reads or writes whole (one or more)
records (lines) per execution of an I/O statement. (Recently,
something more stream-like has been added.)

PL/I has STREAM (text oriented) and RECORD (non-text) I/O.
STREAM is much like C text I/O, in that newlines are generated
when requested (the SKIP option on I/O statements). RECORD writes
and reads whole records into a single variable (usually a structure
or array).

Much of PL/I I/O was inherited from COBOL, but I can't say more
than that.
People suggested that C was odd for having this
differentiation.

What is unusual about C is that it mostly doesn't differentiate
text and binary. On unix and unix-like systems, there is no difference.
Other systems have to convert '\n' to the appropriate line or
record mark, and invert the conversion on input.

In the other languages, different I/O statements are used, or different
forms of the statement. For C, it is an option to fopen().

-- glen
 
B

BartC

kerravon said:
Hello.

In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text
mode. If it was opened in binary mode there would
not be anything special about the sequence, and that
sequence just happened by random, when we're perhaps
dealing with a zip file.

My question is - how do other languages like BASIC,
Pascal, Fortran, Cobol, PL/1 deal with this
fundamental difference between binary and text file
processing?

People suggested that C was odd for having this
differentiation.

It is odd. I've never come across this in other languages.

When doing I/O using C runtime functions for more serious programs, I tend
to use binary mode, and use my own higher level functions on top of the
basic C routines. These functions tend to separate the end-of-line handling
from the rest of the line, so it might end up using:

printf("Hello");
puts(""); // or sometimes printf("\x0d\x0a");

instead of printf("Hello\n"); While my simpler input routines are
line-oriented. In either case I try to avoid using "\n" in strings.

As you say, using text mode can lead to all sorts of nasty surprises. And
these days, any text file you are going to read that has been downloaded
from somewhere could easily have any of the cr, cr/lf, or lf line-endings
anyway. When reading in (binary) character mode, I just allow for any
combination.

When writing text, I might explicitly generate cr/lf anyway (since I'm
mainly on Windows), and trust whatever has to read it, to deal with it
sensibly. (After all systems using lf line-endings like to inflict their
files on the rest of us!)

(However, I remember when cr and lf were actual control characters used to
control the print head or cursor position, so it still seems odd to not have
both. Now it is just a way of separating one line from another.)
 
R

Richard Tobin

It is odd. I've never come across this in other languages.

Surely it's just a consequence of C being written for Unix, and later
ported to other platforms, some of which (notably MS-DOS) used CR-LF
and some plain CR.

By the time C was used on these platforms, there was a substantial
body of Unix C programs which assumed that lines could be identified
by a single linefeed character. To make these programs immediately
useful on these platforms, it was necessary for the standard library
to convert line endings on input and output. And of course it was
also necessary to have a way to avoid this, for non-text files. So
MS-DOS and other non-Unix C compilers did conversion by default and
provided the "b" binary option to disable it, and the ANSI C standard
followed suit.

-- Richard
 
O

Osmium

kerravon said:
In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text
mode. If it was opened in binary mode there would
not be anything special about the sequence, and that
sequence just happened by random, when we're perhaps
dealing with a zip file.

My question is - how do other languages like BASIC,
Pascal, Fortran, Cobol, PL/1 deal with this
fundamental difference between binary and text file
processing?

People suggested that C was odd for having this
differentiation.

The other languages comply with the _spirit_ of the ASCII code. C had
"inventors" who visualized a more perfect world. See footnote 18.

http://en.wikipedia.org/wiki/ASCII

The spirit (of ASCII) was to allow overprinting to produce the accenting
characters, such as tilde and produce a composite character. This requires
that CR and LF be separate characters. The composite notion didn't work
hardly at all, for example many (most?) font designers put the tilde
vertically centered in the cell available for printing. A combination of
poor communication, misguided ideas and ego. A perfect storm in the
standards arena. I hope UTF-8 puts a "tilde n" here, ñ.
That's what the ASCII guys dreamed of. But they actually got was n~.
Well, darrn it all, people are very poor mind readers. .

So in most languages pressing the "Return" key produces <CR><LF> (perhaps
backwards, I can never remember.
In C, pressing "Return" produces this *REVISED* ASCII character "new line".
Compare the dates in footnote 34 and footnote 36.

Keep in mind that the monitor is not connected to the keyboard, it is
connected to a computer.
 
B

BartC

Osmium said:
The other languages comply with the _spirit_ of the ASCII code. C had
"inventors" who visualized a more perfect world. See footnote 18.

http://en.wikipedia.org/wiki/ASCII

The spirit (of ASCII) was to allow overprinting to produce the accenting
characters, such as tilde and produce a composite character. This
requires that CR and LF be separate characters.

A typical teletype of the time had separate operations for carriage return,
and linefeed. I'd imagine because typewriters had the same (although with a
mechanism to easily do them as a pair in one manual operation).

In any case, the need to have to generate both CR and LF to get to the start
of the next line, was common to any computer connected to a teletype.
Presumably some systems decided to also store CR, LF within a text file and
others (Unix) decided not to.

I'd not really heard about over-printing much. But it was more useful on a
VDU so that you could update information on the same line. For this purpose
you need CR at least as a separate control character:

int i;
for (i=0; i<100000; ++i)
printf("%d\r",i);
Keep in mind that the monitor is not connected to the keyboard, it is
connected to a computer.

Yes, the keyboard would generate CR. On a teletype/monitor it would be
echoed as CR, LF.

I suppose, with some many spare control codes (as it seemed to me), they
could have had discrete CR, LF codes, /and/ a composite CRLF (or NL) code to
keep everyone happy and all the options open.
 
E

Eric Sosman

In any case, the need to have to generate both CR and LF to get to the
start of the next line, was common to any computer connected to a
teletype. Presumably some systems decided to also store CR, LF within a
text file and others (Unix) decided not to.

Another way of looking at it: Unix (and now C) decided to
deal in "logical" rather than "physical" lines, leaving it up
to the drivers and so on to translate appropriately. The program
says "Here is some data" and tacks on "\n" to show where the line
ends, and needn't worry about rendering the line in different ways
for different devices. The device-specific layers of the system
take care of all that stuff, so the applications don't need to.

(One horrendous system I suffered with long ago used CR as a
*start* of line character and LF to end the line. Its hard-
copy console would clonk over to the left, type-type-typeta
across the line, advance the paper one notch, and sit there.
Too bad about the little shield over the typing mechanism that
hid the final few characters printed ...)
I'd not really heard about over-printing much. But it was more useful on
a VDU so that you could update information on the same line. For this
purpose you need CR at least as a separate control character:

int i;
for (i=0; i<100000; ++i)
printf("%d\r",i);

You can also do it with BS, and in fact that may work better
if the stuff being rewritten is just the tail end of the line:

printf("Watch closely, now: ");
for (int i = 0; i < 100000; ++i) {
int n = printf("%d", i);
while (--n >= 0)
putchar('\b');
}
Yes, the keyboard would generate CR. On a teletype/monitor it would be
echoed as CR, LF.

Or as CR LF NUL NUL NUL to allow the mechanism enough time
for all that movement before being handed the next payload
character. That's the kind of detail the logical vs. physical
transformation handles for you.
 
G

glen herrmannsfeldt

(snip)
The other languages comply with the _spirit_ of the ASCII code. C had
"inventors" who visualized a more perfect world. See footnote 18.

(snip)

So in most languages pressing the "Return" key produces <CR><LF>
(perhaps backwards, I can never remember.)

Most languages have no concept of line terminating characters.

As far as I know, the use of CRLF line termination stored in
files goes back to early DEC systems.

Many earlier systems were record oriented, such that no special
characters were reserved for line termination. If needed for a
specific I/O device, they were added/removed.

For IBM OS/360 and successors, any of the 256 possible bit
patters can be stored in a character. Card readers supply 80 bytes
as one record. Line printers like the 1403 print a line (record)
one character in each print column. Normally CR, LF, and NL don't
map to a printable character and the column is blank.

Many years ago, I had Fortran programs with EBCDIC NL inside
character constants, but CR and LF can also be included.

IBM terminal, such as the 2741, use their own code, not ASCII
and not EBCDIC. Data is converted to the appropriate code when
printed.
In C, pressing "Return" produces this *REVISED* ASCII character
"new line". Compare the dates in footnote 34 and footnote 36.
Keep in mind that the monitor is not connected to the keyboard,
it is connected to a computer.

and the keyboard and monitor aren't connected to the disk, either.

Well, one additional complication on early systems was using
paper tape and the ASR33 terminal. If you wanted to punch a tape
that could be printed offline (no computer involved) then it
needed the CRLF sequence. (And possibly enough nulls.)

For some early systems, paper tape was the storage system,
and disk followed by storing the image of paper tape.

Early IBM systems were card oriented, and the disk file system
started out storing card images. Fixed length 80 character records
are still popular on IBM systems.

-- glen
 
S

Stephen Sprunk

This only works properly if you feed it files using native line
ending conventions. For example, feeding a C program running on
UNIX a text file with MSDOS CRLF line endings may upset the program
considerably if it's picky about extra strange characters it doesn't
expect to see.

There are some ways that files can be transferred and converted at
the same time (it might slow down the transfer a little, though).
FTP, for example, if you transfer files in the appropriate mode for
text and binary. The sender converts the line endings from
sender-native to network-standard, and the receiver converts the
line endings from network-standard to receiver-native.

Transport by carrying media (e.g. CD, DVD, floppies, USB memory
sticks, flash cards, etc.) from one system to another won't by
itself convert the format of text files.

OTOH, converting files isn't that difficult; many POSIX systems come
with unix2dos and dos2unix, which take care of the details for you.

Some languages also make this easier than others; for instance, at first
Perl only offered chop() to remove LF characters from the end of a
string but later added chomp(), which removes a CR, LF or CRLF, so you
can handle files using _any_ of the most common conventions.

S
 
S

Stephen Sprunk

(One horrendous system I suffered with long ago used CR as a *start*
of line character and LF to end the line. Its hard- copy console
would clonk over to the left, type-type-typeta across the line,
advance the paper one notch, and sit there. Too bad about the little
shield over the typing mechanism that hid the final few characters
printed ...)

Why not just print LFCR at the end of each line?

If we're discussing "odd" text conventions, I read some IBM system puts
text in fixed-width records and pads short lines with spaces, like
virtual punch cards; there is no need for EOL markers as long as you
don't need to preserve trailing spaces.

S
 
E

Eric Sosman

Why not just print LFCR at the end of each line?

Because the first line would lack its start-of-line CR,
and the last line would be followed by a CR starting a line
that wasn't there.

"I'm not making this up, you know!" -- Anna Russell
If we're discussing "odd" text conventions, I read some IBM system puts
text in fixed-width records and pads short lines with spaces, like
virtual punch cards; there is no need for EOL markers as long as you
don't need to preserve trailing spaces.

Fixed-length records aren't unique to IBM, either. I've
always supposed they are/were the reason for part of 7.21.2p2:

"[...] Whether space characters that are written out
immediately before a new-line character appear when read
in is implementation-defined."

(Fixed-length records may seem hopelessly old-fashioned
nowadays, but they do have a few good points to compensate for
their space- and bandwidth-consuming ways. For one thing, it's
easy to seek to line N of a file of such records: A simple offset
computation does the trick without the need for a search. Another
point is that once you've arrived at line N you can update it in
place without sliding the rest of the file around: Just overwrite
a record-length's worth of characters and you're done. Finally,
since a fixed-length record needs no delimiters, one need not
dedicate any particular character codes to have special meaning:
A fixed-length line could contain as many LF's and NUL's as you
like, at the risk of being difficult to process as C text.)
 
K

Keith Thompson

Osmium said:
The other languages comply with the _spirit_ of the ASCII code. C had
"inventors" who visualized a more perfect world. See footnote 18.

http://en.wikipedia.org/wiki/ASCII

What "footnote 18" are you referring to? Footnote 18 in that Wikipedia
article is a reference to a book called "A TeX Primer for Scientists".
Footnote 18 in the C standard is equally irrelevant, at least in the
N1570 draft.

[...]
 
K

Keith Thompson

Stephen Sprunk said:
On 21-Mar-14 11:01, Gordon Burditt wrote: [...]
OTOH, converting files isn't that difficult; many POSIX systems come
with unix2dos and dos2unix, which take care of the details for you.

Some languages also make this easier than others; for instance, at first
Perl only offered chop() to remove LF characters from the end of a
string but later added chomp(), which removes a CR, LF or CRLF, so you
can handle files using _any_ of the most common conventions.

No, that's not what Perl's chomp() function does.

The older chop() function removes a single character from the end of
a string. It was commonly used to remove the "\n" character from a
line of text read from a file (Perl's line-reading mechanism leaves
the trailing "\n" in place). But if the last line of a file doesn't
have a "\n" character, or if the string wasn't read from a file,
then chop() blindly removes the last character, whatever it is.

chomp() removes the last character if and only if it's a "\n".
If it's not, then it does nothing. It doesn't treat "\r" characters
specially.

Perl I/O distinguishes between text and binary in a manner very
similar to C; in text mode, each line ending is translated to a
single "\n" character, and vice versa on output. (And if you're
wondering about the double quotes, "\n" in Perl is a string
containing a newline character, and '\n' is a string containing a
backslash and an 'n'.)
 
K

Keith Thompson

BartC said:
When doing I/O using C runtime functions for more serious programs, I tend
to use binary mode, and use my own higher level functions on top of the
basic C routines. These functions tend to separate the end-of-line handling
from the rest of the line, so it might end up using:

printf("Hello");
puts(""); // or sometimes printf("\x0d\x0a");

instead of printf("Hello\n"); While my simpler input routines are
line-oriented. In either case I try to avoid using "\n" in strings.

As you say, using text mode can lead to all sorts of nasty surprises. And
these days, any text file you are going to read that has been downloaded
from somewhere could easily have any of the cr, cr/lf, or lf line-endings
anyway. When reading in (binary) character mode, I just allow for any
combination.

When writing text, I might explicitly generate cr/lf anyway (since I'm
mainly on Windows), and trust whatever has to read it, to deal with it
sensibly. (After all systems using lf line-endings like to inflict their
files on the rest of us!)

So your response to the problem of different systems using different
representations for line endings is to bypass C's mechanism that deals
with it, and instead impose the Windows CR-LF convention on anyone who
might have to read any file generated by your programs.

I submit that you are part of the problem.

On Windows, unless you've done something to put stdout in binary mode
(freopen?)

printf("\x0d\x0a");

when run on Windows will print "\r\r\n", which as far as I know is not
valid on *any* system. (A quick experiment with MinGW suggests that
putting stdout in binary mode doesn't help, though that might be a bug
either in MinGW or in my program.)
 
G

glen herrmannsfeldt

(snip)
If we're discussing "odd" text conventions, I read some IBM system puts
text in fixed-width records and pads short lines with spaces, like
virtual punch cards; there is no need for EOL markers as long as you
don't need to preserve trailing spaces.

RECFM=FB,LRECL=80 is still pretty common, but the alternative
is RECFM=VB, where each record has a four byte header with a length.
(And each block of records also has a four byte length header.)

Object programs for OS/360 and successors also have 80 character
fixed length records and can be punched on cards.

-- glen
 
B

BartC

Keith Thompson said:
So your response to the problem of different systems using different
representations for line endings is to bypass C's mechanism that deals
with it,

How well does it deal with it? I mean when dealing with non-native files
that may have any line-ending.
I submit that you are part of the problem.

I think that if a major program such as CLANG, a C compiler, which is a
specially built binary /for Windows/, generates text files with LF
line-endings when run under Windows, then it doesn't seem unreasonable for
me generate explicit CR, LF line-endings.

My applications can anyway generally deal with text files using any of the
main three combinations: cr (although that is rare now), cr/lf, and lf. It's
not difficult, provided they are not mixed up in the same file. So I suggest
the problem lies elsewhere in not being flexible enough.

And if I create a file under Windows via C, which generates cr/lf endings,
and that file is read in Linux for example, then it's the same problem;
whose fault is it then?
On Windows, unless you've done something to put stdout in binary mode
(freopen?)

printf("\x0d\x0a");

when run on Windows will print "\r\r\n", which as far as I know is not
valid on *any* system. (A quick experiment with MinGW suggests that
putting stdout in binary mode doesn't help, though that might be a bug
either in MinGW or in my program.)

I set stdout to binary mode (using _setmode(_fileno(stdout),_O_BINARY);).

Or I used to. I'll have to check that is still the case. It's one of those
things that you just forget about once it's set up.
 
G

glen herrmannsfeldt

If I remember correctly, in COBOL and PL/I a file contains a sequence
of records, not a sequence of bytes as in C.

Well, PL/I text I/O is STREAM (that is the keyword for it) but it
does know about record boundaries when needed, but other times
it ignores them. More similar to C than to Fortran FORMATTED.
In COBOL and PL/I there is no built-in support for a file consisting of
lines, with there being a special character or sequence of characters
indicating end-of-line. A sequence of lines can be represented as a
sequence of records by having each line be a record.

If the file convention uses special characters, it will have to
generate them. The SKIP keyword is used with I/O statements to
go to the next line on input or output. Otherwise, it looks like
a stream of characters.

It isn't so easy to read a text file a character at a time, and
follow line endings. It is easy to ignore line endings, by

GET EDIT(C)(A(1));

I believe it will then skip to the next line after the end of
the previous one.

GET EDIT(S)(A);

should read a line into a CHARACTER string variable, though it
has been some years since I tried it.
Record-oriented output statements have some way of accepting
the length of the record being written. Record-oriented
input statements would have some way of providing the
length of the record that was read.

RECORD I/O reads or writes one variable, which is often a
structure. It can also be a VARYING length character string,
in which case the length is set based on the input record.
(That might be an IBM extension.)

-- glen
 
O

Osmium

Keith Thompson said:
What "footnote 18" are you referring to? Footnote 18 in that Wikipedia
article is a reference to a book called "A TeX Primer for Scientists".
Footnote 18 in the C standard is equally irrelevant, at least in the
N1570 draft.

It should say footnote 34. The references at the end of the message look
OK, 34 and 36 on ASCII. 34 is LBJ mandating ASCII (on which IBM, last I
heard, was taking a lifetime waiver) and 36 is some guys I never heard of
justifying the UNIX way at FJCC in 1970. I carried a copy of that damned
LBJ memo in my briefcase for months.

I can't figure out what I did wrong there, the scratch paper I used ffor
notes is still on my desk and there is not a number 18 anywhere on it. .

Here's another link I found in trying to clean up my mess. Anyone really
interested should read it, I didn't.

http://en.wikipedia.org/wiki/Newline
 
K

Keith Thompson

BartC said:
How well does it deal with it?

Quite well, when reading and writing native files.
I mean when dealing with non-native files
that may have any line-ending.

A better way to deal with that is to translate non-native files before
feeding them to native programs. Let a translation utility do the work,
and let everything else just deal with text.
I think that if a major program such as CLANG, a C compiler, which is a
specially built binary /for Windows/, generates text files with LF
line-endings when run under Windows, then it doesn't seem unreasonable for
me generate explicit CR, LF line-endings.

I haven't used clang on Windows. Assuming it behaves as you
describe, I'd say that's a bug in clang. That doesn't make your
approach reasonable.
My applications can anyway generally deal with text files using any of the
main three combinations: cr (although that is rare now), cr/lf, and lf. It's
not difficult, provided they are not mixed up in the same file. So I suggest
the problem lies elsewhere in not being flexible enough.

It happens that, given the set of format currently in common use, it's
not too difficult to allow for varying input formats. If your program
can be reasonably flexible *on input* without breaking anything, I have
no objection to that. But if your program's *output* is inconsistent
with native file formats, that's a problem.
And if I create a file under Windows via C, which generates cr/lf endings,
and that file is read in Linux for example, then it's the same problem;
whose fault is it then?

It's probably the fault of whoever copied the file without thinking
about the format.
 
K

Keith Thompson

Osmium said:
It should say footnote 34. The references at the end of the message look
OK, 34 and 36 on ASCII. 34 is LBJ mandating ASCII (on which IBM, last I
heard, was taking a lifetime waiver) and 36 is some guys I never heard of
justifying the UNIX way at FJCC in 1970. I carried a copy of that damned
LBJ memo in my briefcase for months.

That would be:

Lyndon B. Johnson (March 11, 1968). Memorandum Approving the Adoption by
the Federal Government of a Standard Code for Information
Interchange. The American Presidency Project. Accessed 2008-04-14.

http://www.presidency.ucsb.edu/ws/index.php?pid=28724

Since your article is likely to last longer than that Wikipedia article
is likely to remain in its current form, I suggest quoting the relevant
footnote rather than citing it by number (even if you get the number
right).
I can't figure out what I did wrong there, the scratch paper I used ffor
notes is still on my desk and there is not a number 18 anywhere on it. .

Here's another link I found in trying to clean up my mess. Anyone really
interested should read it, I didn't.

http://en.wikipedia.org/wiki/Newline

This is one of my "If I had a time machine" projects: going back
to the 1960s and persuading everyone to standardize on a single
text file format, with a single character (neither "linefeed" nor
"carriage return") representing the end of a line. And consistent
byte order (flip a coin, I don't care which). And no EBCDIC.
And no UTF-16. And plain char is unsigned.

(There's probably something else I should do with a time machine;
I'll think about it if I ever acquire one.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top