binary vs text mode for files

BartC · Mar 24, 2014

James Kuyper said:
On 03/24/2014 05:45 AM, BartC wrote:

An array of thousands of pointers to strings containing file names is
not going to be much larger than a single string containing all of those
file names. If you used C, rather than one of your own languages, it's
an array that would have to be constructed anyway,

I don't think so. If I run this command:

program *.c

The argument would either be a single "*.c" string, or an argv array
containing the one entry "*.c".

It would be up to the application as to whether it expanded the wildcard
expression into an actual array. (For example, it can access them one at a
time, or maybe it was only interested in the "*.c" as it is without needing
to know what the individual files might. Or maybe it doesn't need to expand
until much later.

This is a more grown-up way of dealing with it. It doesn't make it alright
to expand into 28,491 argv entries just because there is plenty of memory to
handle it. (How do you quickly access the 3 arguments following the 28,491?
How do you even know there are 3 extra arguments, as they will be
indistinguishable from the any of the 28,491 file names? How do you
distinguish the files from the two expanded arguments a*.c* and ab*.c?)

BartC · Mar 24, 2014

James Kuyper said:
It's been a long time since I used Microsoft C, so long that my
experience includes versions earlier than 5.1, and I'm pretty sure they
did the same. The argc/argv interface for main() was already well
established when "The C Programming Language" was first published.

Under Windows, a C program has the choice of using main() or WinMain() as
the entry point. This is for any compiler, not just the MS one.

main() gives you the normal argc, argv parameters.

WinMain() presents the command line as a single string.

Eric Sosman · Mar 24, 2014

(snip)

Sorry, Keith, but you're completely in the wrong here.
A "line" is one hundred thirty-three characters, the first
being metadata that isn't actually printed but governs the
vertical spacing, and the rest being payload.[*] There are
no end-of-line characters, no control characters of any kind
for that matter.

Click to expand...

Do the metadata characters indicate what to do before or
after printing the specified data? (FBA or FBM, to be specific.)

I'm unfamiliar with your TLA's. The metacharacters (as
I recall them; it's been a while...) were

' ': Advance one line, then print
'0': Advance two lines, then print
'1': Advance to top of next page, then print
'+' or maybe it was '-': print without advancing

Those were the "usual suspects;" additional metacharacters could
be used with a custom "carriage control tape" to do things like
"advance to two lines above the page bottom, then print."

I once had someone send me some files from an IBM system with
variabe length records and no line termination. After I complained
about that, they resent them with fixed length and no termination.

The main point of all this is that it's shortsighted to imagine
that LF, CR, and CR+LF are the only line markers a program might
need to deal with.

0x05 0x00 0x48 0x65 0x6c 0x6c 0x6f 0x00
0x06 0x00 0x77 0x6f 0x72 0x6c 0x64 0x21

is one of the seven different ways one system I've used might
represent a familiar message.

Malcolm McLean · Mar 24, 2014

This is a more grown-up way of dealing with it. It doesn't make it alright
to expand into 28,491 argv entries just because there is plenty of memory to
handle it. (How do you quickly access the 3 arguments following the 28,491?
How do you even know there are 3 extra arguments, as they will be
indistinguishable from the any of the 28,491 file names? How do you
distinguish the files from the two expanded arguments a*.c* and ab*.c?)

If you're passing 28,491 files to a program to process, then normally you'd
expect to have a computer that can parse 28,491 shortish strings "in the
twinkling of an eye".
If you allow extra arguments, simply go to the end of the array and read off
those starting with '-' or '/'. Obviously if you've got files with names
like "-optionx" you've got problems, but the only way to solve that is to
use a non-legal file name character to introduce options. As MS_DOS did,
interestingly enough.

James Kuyper · Mar 24, 2014

There is a "raw" and "cooked" mode for all versions of DOS
( including Win8 ).

SFAIK, when you do an fopen("x.bin","rb") it uses raw mode.

If you're calling fopen(), you're using the C standard library; kerravon
specified that he was asking about other languages. I'll grant you, it's
possible to call fopen() from other other languages, but the behavior
you would get if you did so (assuming you handled it correctly) is C
behavior.

SFAIK, it's only 'C' on DOS/Windows machines, but this was down to the
BIOS level ( int21 ) so who knows?

What is it you're saying is specific to DOS/Windows? Using "\r\n" in
text mode isn't, though the other systems supporting that convention are
far less popular. Support for a distinction between binary and text
modes isn't specific to DOS/Windows systems, though the details of what
that distinction means does vary from one OS to another.

You can configrue a file in Tcl with "fconfigure $f -translation binary"
and ti works the same cross platform.

There is a similar thing in 'C' in Linux - never can remembe rwhat it's
called.

On Unix-like systems, while binary and text modes are supported, they
are equivalent. Why would such a feature be needed?

glen herrmannsfeldt · Mar 24, 2014

James Kuyper said:
On 03/24/2014 12:05 AM, ralph wrote:
(snip)

What MSC has (had) and I forgot in a previous post is a routine
that you link in (just link, don't actually call it) that will
glob (unix term) the command line. I don't remember now how well
it actually worked, especially in the case of quotes.

It's been a long time since I used Microsoft C, so long that my
experience includes versions earlier than 5.1, and I'm pretty sure they
did the same. The argc/argv interface for main() was already well
established when "The C Programming Language" was first published.

However, that's irrelevant - you're talking about the compiler; he's
talking about the OS. A conforming implementation of C must support the
argc/argv interface, though not necessarily in any meaningful sense. For
example, I remember that under VMS, special actions needed to be taken
to set up a program so that it could actually take command line
arguments, though I thankfully no longer remember the details. IIRC, it
was necessary to tell VMS something about how many command line
arguments could be passed, and some details about about the syntax for
each one.

I remember something like that from many years ago. As well as I
remember, you only need that if you want VMS-like parsing of the
command line.

Installing TeX and associated programs, one had to also install the
appropriate files to tell DCL (the VMS shell) about the commands,
and their command line options. Not only the number, but the actual
names and type of values that they take. It would then allow the
appropriate abbreviation for the options, and do some checking
of them.

As far as I remember, without that you get the normal argc/argv.
(TeX was written in Pascal with a Knuth specific preprocessor.)

-- glen

glen herrmannsfeldt · Mar 24, 2014

Eric Sosman said:
(snip)

Sorry, Keith, but you're completely in the wrong here.
A "line" is one hundred thirty-three characters, the first
being metadata that isn't actually printed but governs the
vertical spacing, and the rest being payload.[*] There are
no end-of-line characters, no control characters of any kind
for that matter.

Click to expand...

Do the metadata characters indicate what to do before or
after printing the specified data? (FBA or FBM, to be specific.)

Click to expand...

I'm unfamiliar with your TLA's. The metacharacters (as
I recall them; it's been a while...) were

FBA is Fixed Blocked with Asa control characters.

' ': Advance one line, then print
'0': Advance two lines, then print
'1': Advance to top of next page, then print
'+' or maybe it was '-': print without advancing

Those were the "usual suspects;" additional metacharacters could
be used with a custom "carriage control tape" to do things like
"advance to two lines above the page bottom, then print."

Yes, those are ASA characters. IBM extends them with 2 through 9
specifying lines on the page controlled by a paper tape loop inside
the printer (or an electronic version of one). Those are especially
convenient when printing forms and such. They were (and later ones
still are) popular for printing statements and bills.

OS/360 and successors also have FBM, Fixed Blocked with Machine
control characters. The characters are the actual command sent
to the printer and, unlike ASA characters, indicate the paper movement
after printing the line. There are also characters that specify
movement without printing.

One convenience I have noted with machine characters, is that when
spooling the output on an emulator, you know when the last line is
printed on a page. With ASA control characters, you don't know that
until the next job begins.

(snip)

-- glen

Les Cargill · Mar 24, 2014

James said:
If you're calling fopen(), you're using the C standard library; kerravon
specified

understood - I was using that as an illustration. Tcl has fconfigure $f
-translation binary.

that he was asking about other languages. I'll grant you, it's
possible to call fopen() from other other languages, but the behavior
you would get if you did so (assuming you handled it correctly) is C
behavior.

I used 'C' as the example because it's been my experience
that 'C' behavior is frequently passed on to other languages.

What is it you're saying is specific to DOS/Windows? Using "\r\n" in
text mode isn't,

No, but there are other subtle differences in raw/cooked mode in DOS,
things like ctrl-Z handling.

He was also very specifically MS-DOS, so...

though the other systems supporting that convention are
far less popular. Support for a distinction between binary and text
modes isn't specific to DOS/Windows systems, though the details of what
that distinction means does vary from one OS to another.

I'm pretty sure the scope of the thread is MS-DOS.

On Unix-like systems, while binary and text modes are supported, they
are equivalent. Why would such a feature be needed?

Exactly.

James Kuyper · Mar 25, 2014

....
I used 'C' as the example because it's been my experience
that 'C' behavior is frequently passed on to other languages.

Unless you can actually be more specific about other languages to which
this behavior has been passed on, that's non-responsive to his question,
which was about those other languages.

I'm pretty sure the scope of the thread is MS-DOS.

Re-reading his original message, I'm uncertain whether he's asking about
binary-vs-text modes in general (as implied by the subject line), and
using MS only as an example, or whether he's very specifically asking
about MS. He's asking about whether C is unusual in making that
distinction. C makes that distinction on all platforms (on some
platforms, including all Unix-line ones, it's a distinction without a
corresponding difference, but the distinction is still made). All of the
other languages he asked about target multiple platforms, just like C,
so I had not considered that the question was platform-specific, but I
supposed that could have been his intent.

Exactly.

I'm confused. You asserted that such a thing exists, and I'm disagreeing
with you. I'm not aware of any such thing, and the fact that there's no
apparent need for such a thing makes me fairly confident that my
unawareness of it is more likely to be due to it's non-existence, than
to my ignorance. So when you say "exactly", what are you referring to?

kerravon · Mar 25, 2014

Re-reading his original message, I'm uncertain whether he's asking about
binary-vs-text modes in general (as implied by the subject line), and
using MS only as an example, or whether he's very specifically asking
about MS. He's asking about whether C is unusual in making that
distinction. C makes that distinction on all platforms (on some
platforms, including all Unix-line ones, it's a distinction without a
corresponding difference, but the distinction is still made). All of the
other languages he asked about target multiple platforms, just like C,
so I had not considered that the question was platform-specific, but I
supposed that could have been his intent.

My question was intended to apply to all systems
in existence, because C's binary vs text fopen
mode applies to all systems in existence.

I then deliberately chose MSDOS as an example
because I:

1. Wanted a non-mainframe (which uses records
instead of line endings) environment. I am aware other
languages use records so work on the mainframe
environment, and I didn't want anyone to come
back and say "records work fine". I'm interested
in text files (CRLF etc), not records.

2. Wanted a non-Unix environment so that no-one
would come back and say "on Unix there is no
difference between text and binary, so there is
nothing to discuss, end of story.

I hope that clears up the question. With the
answers so far I still don't understand how CRLF
are swallowed on input. The only thing I saw
was that PL/1 has a "skip" which will generate
a CRLF on output, and it's not specified what
will happen on input.

Thanks. Paul.

glen herrmannsfeldt · Mar 25, 2014

(snip)

I hope that clears up the question. With the
answers so far I still don't understand how CRLF

are swallowed on input. The only thing I saw
was that PL/1 has a "skip" which will generate
a CRLF on output, and it's not specified what
will happen on input.

PL/I has SKIP on both input and output.
Java has System.out.println(); or, specially for
C programmers, System.out.format("%n"); for output
to go to the next record (line).

More specific to above, PL/I GET SKIP; will ignore the rest of
the record, such that the next operation is at the beginning
of the next record (line).

-- glen

kerravon · Mar 25, 2014

More specific to above, PL/I GET SKIP; will ignore the rest of
the record, such that the next operation is at the beginning
of the next record (line).

Hi glen. Thanks for your reply.

So can PL/1 do the same as C, where you define
a buffer as to how big you expect the longest
line to be, and then read that buffer?

Something like:

VARCHAR BUF(2000)

and then get skip list buf

And then length(buf) will tell you how long
your line of text is?

Thanks. Paul.

BartC · Mar 25, 2014

kerravon said:
On Tuesday, March 25, 2014 3:26:51 PM UTC+11, James Kuyper wrote:

(Which can sometimes lead to bugs, or to sloppy coding, if there is no
practical difference between binary and text modes. For example, if you
accidentally use binary mode instead of text mode, there is no difference.
Until the code is run on another system.)

I hope that clears up the question. With the
answers so far I still don't understand how CRLF
are swallowed on input.

As I understand it, if:

(1) You are using C runtime functions

(2) A file is open in text mode

(3) You are running on a system that uses CRLF in text files

(4) The file you are reading is native to that system so also uses CRLF
(which C will not know).

*Then* a CRLF input sequence is converted to just LF. So a if a whole file
is read in, then all CRs are stripped out, and the file is correspondingly
smaller in memory. Reading a character at a time, CRs are ignored (you get
the following LF instead). If you do random accesses (fseek etc) to a such a
file while it is still on disk, then, AIUI, all offsets are adjusted
automatically so as to pretend the file is a logical one using LF endings
only, and not CRLF or any other scheme.

I've no idea how it might achieve that nor how efficient or otherwise it
might be. (It sounds sounds pretty hairy to me.)

Keith Thompson · Mar 25, 2014

BartC said:
As I understand it, if:

(1) You are using C runtime functions

(2) A file is open in text mode

(3) You are running on a system that uses CRLF in text files

(4) The file you are reading is native to that system so also uses CRLF
(which C will not know).

*Then* a CRLF input sequence is converted to just LF. So a if a whole file
is read in, then all CRs are stripped out, and the file is correspondingly
smaller in memory. Reading a character at a time, CRs are ignored (you get
the following LF instead).

Not all CR characters are necesarily stripped out; a CR not immediately
followed by LF probably won't be. (Though a CR not followed by LF might
make for an invalid text file.)

If you do random accesses (fseek etc) to a such a
file while it is still on disk, then, AIUI, all offsets are adjusted
automatically so as to pretend the file is a logical one using LF endings
only, and not CRLF or any other scheme.

The value returned by ftell() is typically a byte offset into the
physical file. It needn't correspond directly to the number of
characters read by fgetc(). A conforming implementation *could*
theoretically adjust the offsets as you describe, but it isn't
required to.

James Kuyper · Mar 25, 2014

On 03/25/2014 02:14 AM, kerravon wrote:
....

My question was intended to apply to all systems
in existence, because C's binary vs text fopen
mode applies to all systems in existence.

I then deliberately chose MSDOS as an example
because I:

1. Wanted a non-mainframe (which uses records
instead of line endings) environment. I am aware other
languages use records so work on the mainframe
environment, and I didn't want anyone to come
back and say "records work fine". I'm interested
in text files (CRLF etc), not records.

On those platforms, text files are stored using records, rather than
CRLF, so that's the wrong way to make the distinction you're trying to make.

2. Wanted a non-Unix environment so that no-one
would come back and say "on Unix there is no
difference between text and binary, so there is
nothing to discuss, end of story.

I hope that clears up the question. With the
answers so far I still don't understand how CRLF
are swallowed on input.

Calling any <stdio.h> function to read data from a text mode stream
requires that the function use whatever method is native for that
platform to identify when a line ends. It is then required to replace
whatever the native method is, with a single '\n' at the end of each
line. In practice, all other input routines are required to behave as if
they used fgetc() to get individual characters from the stream, so
conceptually, at least, you can think in terms of the actual logic that
implements this behavior as being part of fgetc(). On unix-like systems,
that's a trivial requirement, but in general it's more complicated,
which is why the standard says:

"Data read in from a text stream will necessarily compare equal to the
data that were earlier written out to that stream only if: the data
consist only of printing characters and the control characters
horizontal tab and new-line; no new-line character is immediately
preceded by space characters; and the last character is a new-line
character. Whether space characters that are written out immediately
before a new-line character appear when read in is
implementation-defined." (7.21.2p2)
For each of the provisions in that clause, there were known platforms
hosting an implementation of C at the time the standard was written,
where violating that provision could cause problems for the native
method for identifying line lengths, which could manifest themselves by
cause there to be a difference between the data written out and the data
read back in.

In the case of using CRLF to delimit lines, if the current character is
CR, fgetc() must get the next character (which might entail a wait for
more input, even if the stream is unbuffered). If the next character is
LF, move the current position in the stream to after the LF, and return
a single '\n'. Otherwise, fgetc() does whatever the implementor wants it
to do, since a CR not paired with LF violates the provisions I cited above.

glen herrmannsfeldt · Mar 25, 2014

(snip)

So can PL/1 do the same as C, where you define
a buffer as to how big you expect the longest
line to be, and then read that buffer?

Something like:

VARCHAR BUF(2000)

DCL BUF CHAR(2000) VAR;

(CHAR and VAR allowed abbreviations for CHARACTER and VARYING)

and then get skip list buf

For DATA and LIST directed input, CHAR (and BIT) data is single
quoted, but for EDIT I believe it will set the length appropriately.

GET EDIT(BUF)(A);

In the not-so-uncommon (for IBM) case of cards or fixed record length
disk input, the data is already padded with blanks. For RECFM=V and VB
(varying length) I would expect it not to pad out with blanks, but it
might be implementation specific.

And then length(buf) will tell you how long
your line of text is?

The other choice is to use RECORD oriented I/O, which is more like
C binary, and, for example, corresponds to Fortran UNFORMATTED.
It is specfically designed for the case where the record length might
be longer than the block size, where the block size is restricted
by buffer length or disk track length. Again, it might be
implementation dependent, but the ones I knew would READ into a single
varying length character variable, and set the length to the record
length.

RECORD (but not STREAM) I/O also allow for direct access. For OS/360,
it was usual to distinguis sequential and direct access. Direct access
was done on unblocked fixed length data sets, such that one disk block
(which could be from one byte to the length of the disk track)
corresponded to one record.

-- glen

glen herrmannsfeldt · Mar 25, 2014

Richard Damon said:
On 3/25/14, 5:25 AM, BartC wrote:

(snip on fseek() and ftell())

Random Access on Text mode files has some restrictions for defined
behavior. You are only supposed to fseek to a location you have gotten
via ftell. A seek to location "n", may not get you to the "n"th
character as you have read it, but will generally get you to the "n"th
character in the file (including the CRs that will be stripped out). If
the file is "record" based, it can be even more complicated, but as
long as you go to spots you got via ftell, it is supposed to work.

For record-oriented file systems, like IBM's MVS and VM/CMS, and
also VMS (DEC/Compaq/HP) for some types of files the value returned
isn't a byte offset.

For some IBM systems, there is an option to fopen() that will generate
byte offsets, which I believe requires it to read the file from the
beginning to ftell() or fseek().

This is for file systems that keep track of block and byte offset
within block, but not byte offset into the file. (Consider how
tapes work on unix, and you should get the right idea.)

-- glen

Les Cargill · Mar 25, 2014

James said:
Unless you can actually be more specific about other languages to which
this behavior has been passed on, that's non-responsive to his question,
which was about those other languages.

I've given the example of Tcl multiple times now... Python follows the
'C' convention. I presume Perl does same....

Re-reading his original message, I'm uncertain whether he's asking about
binary-vs-text modes in general (as implied by the subject line), and
using MS only as an example, or whether he's very specifically asking
about MS. He's asking about whether C is unusual in making that
distinction. C makes that distinction on all platforms (on some
platforms, including all Unix-line ones, it's a distinction without a
corresponding difference, but the distinction is still made). All of the
other languages he asked about target multiple platforms, just like C,
so I had not considered that the question was platform-specific, but I
supposed that could have been his intent.

I read it as wanting to emulate behavior that programs on
MS-DOS exhibited.

I'm confused. You asserted that such a thing exists, and I'm disagreeing
with you. I'm not aware of any such thing, and the fact that there's no
apparent need for such a thing

There isn't. That's why it's confusing. This was a feature of MS-DOS
down to the BIOS.

"In order for an MSDOS CRLF sequence to be converted
into a single NL, a file needs to be opened in text
mode."

"Text mode" is also called "cooked" mode.

makes me fairly confident that my
unawareness of it is more likely to be due to it's non-existence, than
to my ignorance. So when you say "exactly", what are you referring to?

There were raw and cooked mode file handling in MS-DOS, which was
totally unnecessary but there it was still. Since the framing of the
OP's question was mostly related to DOS...

I can easily see how this would make no sense to anybody unless they'd
encountered it.

ftell() arithmetic vs. text files read as binary	7	Nov 20, 2006
Reading in cooked mode (was Re: Python MSI not installing, log fileshowing name of a Viatnemese comm	8	Mar 22, 2014
Musatov claims "Mode/Code"	2	Oct 31, 2009
Writing <lf> verbatim to file opened in text mode	1	May 29, 2006
Musatov's 'Mode/Code' Primary method call	4	Oct 31, 2009
Buffer pair for lexical analysis of raw binary data	3	Jun 27, 2009
Q: send_file with Ruby 1.9.1 only works for text files	3	Feb 1, 2009
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006

binary vs text mode for files

BartC

BartC

Eric Sosman

Malcolm McLean

James Kuyper

glen herrmannsfeldt

glen herrmannsfeldt

Les Cargill

James Kuyper

kerravon

glen herrmannsfeldt

kerravon

BartC

Keith Thompson

James Kuyper

glen herrmannsfeldt

glen herrmannsfeldt

Les Cargill

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads