CR-NL, NL and ftell

Martin Johansen · Feb 20, 2005

Hello

When opening a CR-NL file, ftell returns the length of the file with the
CR-NL as two bytes, is it supposed to do so?

I am comparing two file-sizes, one CR-NL and one NL using ftell to get the
filesize. Any alternative suggestion is welcomed.

Thanks
- Martin Johansen

infobahn · Feb 20, 2005

Martin said:
Hello

When opening a CR-NL file, ftell returns the length of the file with the
CR-NL as two bytes, is it supposed to do so?

No. It is not ftell's job to return the length of a file.

I am comparing two file-sizes, one CR-NL and one NL using ftell to get the
filesize. Any alternative suggestion is welcomed.

What do you mean by file size? The number of disk clusters occupied by
the file, multiplied by the cluster size? The number of bytes that
while((ch = getc(fp)) != EOF) { ++count; } would count when the file
is opened in binary mode? Or text mode? The number of tape blocks the
file occupies multiplied by the tape block size?

Until the C world can agree on what "file size" means, there will
continue to be no standard way of finding out.

Chris Croughton · Feb 20, 2005

When opening a CR-NL file, ftell returns the length of the file with the
CR-NL as two bytes, is it supposed to do so?

I assume that you did a seek to the end of the file first.

Yes, it can.

If the file is in binary mode, then yes, it will report the number of
characters in the file. This will be the number of characters which you
can read using getc() and counting them (which may be different from the
allocated space on disk or whatever). If it's in text mode, the only
thing guaranteed about ftell() is that it returns a value which can be
used later by fseek() to get to the same position, the value may have no
other relation to the size of the file at all.

I am comparing two file-sizes, one CR-NL and one NL using ftell to get the
filesize. Any alternative suggestion is welcomed.

If you actually want the filesize, you'll have to use operating system
specific functions (on many systems, look for stat() and fstat()). If
you want to know the number of characters which can be read from a file
opened in text mode, the only way is to read it and count them. You
can't even guarantee that the value will be less than that returned by
opening it in binary mode.

Chris C

infobahn · Feb 20, 2005

Chris said:
I assume that you did a seek to the end of the file first.

Yes, it can.

If the file is in binary mode, then yes, it will report the number of
characters in the file.

"A binary stream need not meaningfully support fseek calls with a
whence value of SEEK_END."

SM Ryan · Feb 20, 2005

# Hello
#
# When opening a CR-NL file, ftell returns the length of the file with the
# CR-NL as two bytes, is it supposed to do so?

ftell returns a magic cookie that is only required to be sensible to fseek.
Whether the implementation chooses to make it sensible to you is the implementor's
prerogative. If you want a sure, portable way to count entities in a file, open it
and read it from beginning to end, counting whatever your entities are.

Thomas Matthews · Feb 20, 2005

Martin said:
Hello

When opening a CR-NL file, ftell returns the length of the file with the
CR-NL as two bytes, is it supposed to do so?

I am comparing two file-sizes, one CR-NL and one NL using ftell to get the
filesize. Any alternative suggestion is welcomed.

Thanks
- Martin Johansen

The best method to find the size of a file is to use
a platform specific method; not very portable though.

To get the number of characters in a file
open the file in binary mode, which disables
translations, and read each character using fread
while incrementing a counter.

As others have said, the ftell function returns the
current position in the file, which may not reflect
the number of characters in the file.

--
Thomas Matthews

C++ newsgroup welcome message:
http://www.slack.net/~shiva/welcome.txt
C++ Faq: http://www.parashift.com/c++-faq-lite
C Faq: http://www.eskimo.com/~scs/c-faq/top.html
alt.comp.lang.learn.c-c++ faq:
http://www.comeaucomputing.com/learn/faq/
Other sites:
http://www.josuttis.com -- C++ STL Library book
http://www.sgi.com/tech/stl -- Standard Template Library

Bart C · Feb 21, 2005

No. It is not ftell's job to return the length of a file. ...
What do you mean by file size? The number of disk clusters occupied by ...
Until the C world can agree on what "file size" means, there will
continue to be no standard way of finding out.

I already know that feof does not work as one would expect.

Now I'm learning that ftell may not give an accurate file position and that
there is no guaranteed way to find the size of a normal disk file. You'd
think these are fairly basic file operations but C seems to want to make
life difficult.

Why doesn't feof work as expected, ie return True when positioned at
end-of-file? Is coding 'currfileposition' >= 'filesize' really that
difficult.

Why is it so awkward to get the size of a file anyway? What do cluster sizes
(on a modern OS) have to do with it?

And why bother with text mode with all it's pitfalls; reading or writing
cr-lf explicitly is not that hard is it?

Bart

Ben Pfaff · Feb 21, 2005

Bart C said:
Why doesn't feof work as expected, ie return True when positioned at
end-of-file? Is coding 'currfileposition' >= 'filesize' really that
difficult.

The problem is that your suggested formulation requires feof() to
predict the future. If another process comes along and appends
to the file, then a read might succeed that you thought would
fail. If the stream is connected to an interactive device, such
as a keyboard, you'd have to know whether the user was going to
enter EOF next or not, and that's impossible in general.

Peter Nilsson · Feb 21, 2005

Bart said:
by

I already know that feof does not work as one would expect.

feof is no different to any other function, be it a standard function
or
not. It will do what its specification says it will do. No more, no
less.

If you program by guesswork and assumptions, without reading the
relevant specifications, then you can _expect_ programs to fail
in (often capricious) ways.

Now I'm learning that ftell may not give an accurate file position

It will give an accurate file position in most cases, particularly
those
where you need it to.

and that there is no guaranteed way to find the size of a normal disk
file.

That's because C does not assume 'normal disk files'.

You'd think these are fairly basic file operations but C seems to want
to make life difficult.

No. C is just more generic than most people wish it to be. But if you
only ever want to program vanilla machines, then you are free to do so.

Why doesn't feof work as expected, ie return True when positioned at
end-of-file?

How often do you *need* it to? Think about this carefully before
answering.

Is coding 'currfileposition' >= 'filesize' really that
difficult.

What if you don't (and can't) know what filesize is?

Why is it so awkward to get the size of a file anyway?

Again, how often do you *need* it?

What do cluster sizes (on a modern OS) have to do with it?

On some file systems of old, there was no recording of 'logical' file
size, merely 'physical' file size. Files were precisely as big as the
space occupied on the disk. C allows null bytes to fill the trailing
disk space.

Hence, fseek-ing to the end of a binary file doesn't always get you
to the 'logical' end.

And why bother with text mode with all it's pitfalls; reading or
writing cr-lf explicitly is not that hard is it?

Again, you show your naivity. Some old file systems didn't use end-
of-line characters. Instead, they used fixed width records, padded
with either null or space characters. A text mode is needed for C
programs to operate consistently across a range of file systems.

All that said, most of what you think you *need* is available in
POSIX. So there's no need to get too upset.

Jack Klein · Feb 21, 2005

# Hello
#
# When opening a CR-NL file, ftell returns the length of the file with the
# CR-NL as two bytes, is it supposed to do so?

ftell returns a magic cookie that is only required to be sensible to fseek.
Whether the implementation chooses to make it sensible to you is the implementor's
prerogative. If you want a sure, portable way to count entities in a file, open it
and read it from beginning to end, counting whatever your entities are.

True for files opened in text mode, but:

"The ftell function obtains the current value of the file position
indicator for the stream pointed to by stream. For a binary stream,
the value is the number of characters from the beginning of the file."

So for a binary file, it is not a magic cookie but an actual value, as
long as the current file position is within the range of a positive
signed long.

Jack Klein · Feb 21, 2005

Hello

When opening a CR-NL file, ftell returns the length of the file with the
CR-NL as two bytes, is it supposed to do so?

That depends. If the files are opened in binary mode, ftell() is
supposed to return a file position which is the exact number of
characters from the beginning of the file. There are no special
characters at all in binary mode, so most certainly every '\r' is
counted as a character whether or not it is immediately followed by a
'\n'.

But there is no way to guarantee that you are at the end of a binary
file unless you have read every single character in the file. fseek()
is not guaranteed to work for binary files in the way you expect.

For files opened in text mode, the value returned by ftell() is not
guaranteed to be useful for anything other than passing to fseek() to
return to the same point in the file. It need not have any
relationship to the size of the file in any meaningful way that your
program can use.

I am comparing two file-sizes, one CR-NL and one NL using ftell to get the
filesize. Any alternative suggestion is welcomed.

If two text files, containing one or more lines, differ in the fact
that one contains only "\n" at the end of each line and the other
contains "\r\n", then they are indeed different sizes. What do you
expect?

Eric Sosman · Feb 21, 2005

Bart said:
[...]
Why doesn't feof work as expected, ie return True when positioned at
end-of-file? Is coding 'currfileposition' >= 'filesize' really that
difficult.

Yes. Keep in mind that feof() et al. operate on FILE*
streams, which may be connected to data sources (and sinks)
that are not fixed-size files. Explain, if you will, how
you would implement a "predictive" feof() on a stream taking
data from a TCP/IP socket, or even from your keyboard.

Why is it so awkward to get the size of a file anyway? What do cluster sizes
(on a modern OS) have to do with it?

And why bother with text mode with all it's pitfalls; reading or writing
cr-lf explicitly is not that hard is it?

"There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy."

Your experience with different file formats is clearly
not very extensive. Here are a few of the byte sequences
you might find in a file after puts("Hello") -- all of these
are from my own experience and none is a fabrication, although
I may have mis-remembered a detail here and there:

H e l l o \n
H e l l o \r
H e l l o \r \n
\005 \000 H e l l o \000
H e l l o \040 \040 \040... (75 spaces all told)
\006 \000 \001 H e l l o
H e l l o \n \032... (plus 121 garbage characters)

Thought question: Would you prefer to learn all the rules of
these (and many other) file formats and write that knowledge
into all your programs, or would it make more sense to use a
text stream to mediate between these and a standardized format?

<off-topic>

In a way, your inexperience can be seen as a Good Thing
and a sign of progress: You have seen few file formats because
the industry has learned to value simplicity, and inventing
strange new formats is not the cottage industry it once was.
In truth, simplicity is not the answer to all things; complex
formats have their purposes and can handle some circumstances
better than simple alternatives. Yet, it has turned out that
the simple designs are more widely applicable than was thought
(in large part because today's computers can afford to spend
more memory and processing power on interpreting them), so the
more complex formats are marginalized to the special purposes
where their strengths are indispensable. The casual and even
not-so-casual user encounters only the simple formats, and
begins to believe no others exist.

May I introduce you to my pet crow, "Whitey?"

</off-topic>

Bart C · Feb 21, 2005

Eric Sosman said:
Bart said:

[...]
Why doesn't feof work as expected, ie return True when positioned at
end-of-file? Is coding 'currfileposition' >= 'filesize' really that
difficult.

Click to expand...

Yes. Keep in mind that feof() et al. operate on FILE*
streams, which may be connected to data sources (and sinks)
that are not fixed-size files. Explain, if you will, how
you would implement a "predictive" feof() on a stream taking
data from a TCP/IP socket, or even from your keyboard.

I wouldn't. I would treat disk files differently from devices such as
keyboards or i/o ports. The two kinds of data are different enough to
warrant a separate set of functions

You may want to use C to implement another language and to emulate the
behaviour of that language's equivalent of feof(). C, being general purpose,
should be up to the job but sometimes it's not that easy.

I also found some time back on this newgroup that reading a single key from
the keyboard was not part of standard C! This is a problem I remember from
mainframes in the 70s. It went away with microcomputers in the 80s, and now
with C it's come back again. 2 major revisions of the C standard and
something so basic is not in?

Your experience with different file formats is clearly
not very extensive. Here are a few of the byte sequences
you might find in a file after puts("Hello") -- all of these
are from my own experience and none is a fabrication, although
I may have mis-remembered a detail here and there:

H e l l o \n
H e l l o \r
H e l l o \r \n
\005 \000 H e l l o \000
H e l l o \040 \040 \040... (75 spaces all told)
\006 \000 \001 H e l l o
H e l l o \n \032... (plus 121 garbage characters)

My specs for puts() say that '\n' is appended after the string argument.
Whether that means cr-lf, cr or lf I'm not sure, but it's best to assume any
of these when reading such a file. If you're getting all this extra garbage
after your data (I don't mean padding bytes to fill up a disk sector) then
I'd complain.

Thought question: Would you prefer to learn all the rules of
these (and many other) file formats and write that knowledge
into all your programs, or would it make more sense to use a
text stream to mediate between these and a standardized format?

I've invented plenty of file formats. But to the OS or the C runtime, my
file should be just a bunch of data, namely a set of N bytes. And a text
file is set of bytes sprinkled with cr and/or lf characters. The total size
of N bytes should (naturally) include those characters.

If the OS, disk controller, modem, whatever wants to add extra bytes to
that, that's fine provided they are transparent.

Bart

infobahn · Feb 21, 2005

Bart said:
I wouldn't. I would treat disk files differently from devices such as
keyboards or i/o ports. The two kinds of data are different enough to
warrant a separate set of functions

On the other hand, the stream model (complete with stdin, stdout,
and stderr) is extremely convenient much of the time. On platforms
where it makes sense to have a separate set of functions for
disk I/O or the console, a good implementation will typically
provide such functions as an extension. That way, you get the
best of both worlds - you can ignore console I/O in favour of
stream I/O if you need portability, or you can take advantage
of console I/O (at the expense of portability). The choice is
yours.

You may want to use C to implement another language and to emulate the
behaviour of that language's equivalent of feof(). C, being general purpose,
should be up to the job but sometimes it's not that easy.

Take Pascal (where the equivalent of feof() is predictive). How
does it know whether the user is about to terminate keyboard
input? Unless Pascal can read minds, it simply can't know this.
So it can't do predictive feof on stdin except in cases where
it knows for sure that the input is being redirected from a
data source of known size.

I also found some time back on this newgroup that reading a single key from
the keyboard was not part of standard C!

Correct. You can read a single key from stdin, of course, but that
might not be attached to the keyboard. Or if your system's stream
I/O is line-buffered, you might not get the instant response for
which you may have been hoping.

This is a problem I remember from mainframes in the 70s.

Line-buffered I/O.

It went away with microcomputers in the 80s, and now
with C it's come back again. 2 major revisions of the C standard and
something so basic is not in?

Well, in a way it is (see above). IIRC the Standard does not
/insist/ on line-buffered I/O for stdin. That's just the way
it normally turns out. On some systems (eg Linux) you get to
choose. But the microcomputer BASICs of the 80s didn't have
to worry about portability to all kinds of bizarre systems.
Programs only had to work "RIGHT HERE"; so unbuffered I/O was
easily supplied. Similarly, you can have that with C on
platforms where it makes sense. Witness the getch() of Borland
and Microsoft, the rather different getch() of ncurses, the
Conin() (if I remember rightly) of the Atari, and so on - all
available in C programs for their respective platforms and
implementations.

My specs for puts() say that '\n' is appended after the string argument.
Whether that means cr-lf, cr or lf I'm not sure, but it's best to assume any
of these when reading such a file. If you're getting all this extra garbage
after your data (I don't mean padding bytes to fill up a disk sector) then
I'd complain.

That wouldn't do much good. The reality is that the world is wider
than many programmers realise, and there's a lot of variety out there.

If the OS, disk controller, modem, whatever wants to add extra bytes to
that, that's fine provided they are transparent.

In text mode, they are. In binary mode, C *must* be able to see
every byte. That's part of its job!

Eric Sosman · Feb 21, 2005

Bart said:
Bart said:

[...]
Why doesn't feof work as expected, ie return True when positioned at
end-of-file? Is coding 'currfileposition' >= 'filesize' really that
difficult.

Click to expand...

Yes. Keep in mind that feof() et al. operate on FILE*
streams, which may be connected to data sources (and sinks)
that are not fixed-size files. Explain, if you will, how
you would implement a "predictive" feof() on a stream taking
data from a TCP/IP socket, or even from your keyboard.

Click to expand...

I wouldn't. I would treat disk files differently from devices such as
keyboards or i/o ports. The two kinds of data are different enough to
warrant a separate set of functions

IMHO such a step would be a backwards step. The first
programming language I used had different I/O verbs for
different devices: "READ CARD," "PUNCH PAPER TAPE," and so
on. A program that had been written to generate output on
punched cards could not be persuaded to write a magnetic
tape instead; a program capable of both card and tape output
needed conditional logic at every output-generating point to
decide which verb was appropriate, which (as you can imagine)
was both ugly and bug-prone.

C's uniform I/O model relieves the programmer of such
headaches by pushing the details of handling different device
types out of the application arena and into the implementation.
Devices differ, of course, and this has two consequences: first,
some of the differences get "filed off" in the sense that the
unified I/O model can't exploit them (e.g., there's no portable
way for a C program to stream data to your sound card), and
second, some of the differences obtrude themselves through the
uniformity in ugly ways (e.g., setvbuf(), fflush() ...). Still
and all, I feel it's a pretty good trade most of the time.

I also found some time back on this newgroup that reading a single key from
the keyboard was not part of standard C!

"The keyboard" itself is not part of standard C.

This is a problem I remember from
mainframes in the 70s. It went away with microcomputers in the 80s, and now
with C it's come back again. 2 major revisions of the C standard and
something so basic is not in?

C has a perfectly good way of reading a single character
(or getting an EOF or error indication) from any input source:
it's called getc(). However, different input sources have
different ideas about how and when to provide their characters.
Users of systems with keyboards often like to be able to edit
their input while composing it, so most platforms gather keyed
input in a "batch and commit" mode. Some platforms provide a
way to change the mode, but since different platforms do it
differently, the C language Standard cannot legislate one
platform's solution over the others'.

Remember always that the C Standard carries no authority
beyond that voluntarily granted it by organizations that choose
to adopt and/or require it. If the Standard had required a
discipline for keyboard handling that was difficult for some
systems to support, the users of those systems would have been
less inclined to adopt the Standard.

My specs for puts() say that '\n' is appended after the string argument.
Whether that means cr-lf, cr or lf I'm not sure, but it's best to assume any
of these when reading such a file.

All my examples have the '\n' present. You can't see
it in some of them because the '\n' is encoded differently
in different storage schemes, but that doesn't mean it's
missing.

If you're getting all this extra garbage
after your data (I don't mean padding bytes to fill up a disk sector) then
I'd complain.

The point is that you *don't* get "all this extra garbage"
if you use a text stream to read the data back again: the C
library understands the local conventions for files, and mediates
between their idiosyncrasies and a simpler convention.

If the OS, disk controller, modem, whatever wants to add extra bytes to
that, that's fine provided they are transparent.

The text stream makes them so. And that's reason enough
(referring to your earlier message) to "bother with text mode
and all it's [sic] pitfalls."

Flash Gordon · Feb 21, 2005

Bart said:
Bart said:

[...]
Why doesn't feof work as expected, ie return True when positioned at
end-of-file? Is coding 'currfileposition' >= 'filesize' really that
difficult.

Click to expand...

Yes. Keep in mind that feof() et al. operate on FILE*
streams, which may be connected to data sources (and sinks)
that are not fixed-size files. Explain, if you will, how
you would implement a "predictive" feof() on a stream taking
data from a TCP/IP socket, or even from your keyboard.

Click to expand...

I wouldn't. I would treat disk files differently from devices such as
keyboards or i/o ports. The two kinds of data are different enough to
warrant a separate set of functions

That would make it far harder to implement programs that can take input
from a number of possible sources including a file based on command line
switches.

You may want to use C to implement another language and to emulate the
behaviour of that language's equivalent of feof(). C, being general purpose,
should be up to the job but sometimes it's not that easy.

There is no language in which everything is easy.

I also found some time back on this newgroup that reading a single key from
the keyboard was not part of standard C! This is a problem I remember from
mainframes in the 70s. It went away with microcomputers in the 80s, and now
with C it's come back again.

Actually it never went away.

> 2 major revisions of the C standard and
something so basic is not in?

I agree that it would be useful, but it was not added to the standard.

My specs for puts() say that '\n' is appended after the string argument.
Whether that means cr-lf, cr or lf I'm not sure, but it's best to assume any
of these when reading such a file.

It adds a \n which then gets translated to by the C implementation to
whatever the file system wants to indicate a new line.

> If you're getting all this extra garbage
after your data (I don't mean padding bytes to fill up a disk sector) then
I'd complain.

Without all that other stuff if you loaded your "text" file in to a text
editor on the system, or passed it to anything else expecting a text
file, it would not work. That is because text files on those systems are
*defined* as using fixed length space padded lines, or lines where the
line length is indicated by the first byte of the line record or whatever.

Try writing a simple text processing application that works on all
systems including those with strange (to you) native text file formats
*without* having the implementation taking care of the details would be
a *major* problem.

I've invented plenty of file formats. But to the OS or the C runtime, my
file should be just a bunch of data, namely a set of N bytes.

*Your* file formats are. Just open them in binary mode and that is what
you get.

> And a text
file is set of bytes sprinkled with cr and/or lf characters. The total size
of N bytes should (naturally) include those characters.

No, a text file is whatever the OS defines a text file as being, which
can be a *lot* more complex.

If the OS, disk controller, modem, whatever wants to add extra bytes to
that, that's fine provided they are transparent.

The whole point of the way text streams are handled in C is that it
*does* make it transparent. You don't have to worry about whether it is
CR, CR/LF, LF, explicit length records, padded fixed length lines or
what. However, this means the file size on disk is *not* always the
number of characters you will read if you read all the way through it.

Christian Kandeler · Feb 22, 2005

Eric said:
The point is that you *don't* get "all this extra garbage"
if you use a text stream to read the data back again

Unless, of course, it was written on a different platform. I can see how
that could surprise people: It is well known that a binary format is not
very portable, but text files are often thought of as generic.

Christian

Flash Gordon · Feb 22, 2005

Christian said:
Eric Sosman wrote:

Unless, of course, it was written on a different platform. I can see how
that could surprise people: It is well known that a binary format is not
very portable, but text files are often thought of as generic.

<OT>
Which is why things like ftp have an ASCII or text mode for copying text
files, where obviously it performs any required transformation.
</OT>

Eric Sosman · Feb 22, 2005

Christian said:
Eric Sosman wrote:

Unless, of course, it was written on a different platform. I can see how
that could surprise people: It is well known that a binary format is not
very portable, but text files are often thought of as generic.

When you move the file from Platform A to Platform B,
you must convert the content from A's conventions to B's.
The widely-used FTP protocol, for example, does such a
conversion for "text mode" file transfers. Phil Katz' ZIP
format provides a facility to tag archive members for
conversion between local text conventions and "in flight"
representation.

Text files are not perfectly portable -- heck, the
media on which the files are written are not perfectly
portable! -- but are *much* more easily exchanged between
dissimilar systems than are "binary" files.

questions on ftell and fopen	25	Mar 2, 2007
what is the least amount of typing to assign the same value to multiple variables	21	Dec 24, 2006
ftell() arithmetic vs. text files read as binary	7	Nov 20, 2006
help with regexp	5	Feb 7, 2013
REXML - controlling whitespaces and inserting CR	6	Aug 1, 2010
action_page.php form	2	Oct 25, 2020
Why does this incorrect CRTP static_cast compile?	2	Apr 25, 2013
Question about change of "fp" in function "fseek"and "ftell"	3	Aug 24, 2006

CR-NL, NL and ftell

Martin Johansen

infobahn

Chris Croughton

infobahn

SM Ryan

Thomas Matthews

Bart C

Ben Pfaff

Peter Nilsson

Jack Klein

Jack Klein

Eric Sosman

Bart C

infobahn

Eric Sosman

Flash Gordon

Christian Kandeler

Flash Gordon

Eric Sosman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads