Getting file size of binary file

A

Arnold

Is using fseek and ftell a reliable method of getting the file size on a
binary file? I thought I remember reading somewhere it wasn't... If not what
would be the "right" and portable method to obtain it? Thanks.
 
R

Richard Bos

Arnold said:
Is using fseek and ftell a reliable method of getting the file size on a
binary file?

No. From 7.19.9.2#3: "A binary stream need not meaningfully support
fseek calls with a whence value of SEEK_END".

To say that this irks me would be a bit of an understatement.
I thought I remember reading somewhere it wasn't... If not what
would be the "right" and portable method to obtain it?

There is none, in ISO C.

To say that _this_ irks me would be a bit of an understatement, as well.
It should at least be possible to get the value of "what the OS thinks
the file size is", but apparently there are reasons why it isn't; I've
never heard one that is convincing, though.

Richard
 
R

Richard Head

Is using fseek and ftell a reliable method of getting the file size on a
binary file? I thought I remember reading somewhere it wasn't... If not what
would be the "right" and portable method to obtain it? Thanks.

try fstat()
 
C

CBFalconer

Richard said:
to obtain it? Thanks.

try fstat()

No, don't. There is no fstat() in standard C. Please do not give
off-topic answers in this newsgroup, where there may be nobody to
make corrections.
 
K

Kevin Goodsell

Richard said:
No. From 7.19.9.2#3: "A binary stream need not meaningfully support
fseek calls with a whence value of SEEK_END".

From the FAQ for this group:

http://www.eskimo.com/~scs/C-faq/q19.12.html

---
How can I find out the size of a file, prior to reading it in?

If the ``size of a file'' is the number of characters you'll be able to
read from it in C, it is difficult or impossible to determine this
number exactly).

Under Unix, the stat call will give you an exact answer. Several other
systems supply a Unix-like stat which will give an approximate answer.
You can fseek to the end and then use ftell, but these tend to have the
same problems: fstat is not portable, and generally tells you the same
thing stat tells you; ftell is not guaranteed to return a byte count
except for binary files. Some systems provide routines called filesize
or filelength, but these are not portable, either.

Are you sure you have to determine the file's size in advance? Since the
most accurate way of determining the size of a file as a C program will
see it is to open the file and read it, perhaps you can rearrange the
code to learn the size as it reads.
---

Does this look strange to anyone else? There's that lone closing paren
in the first paragraph, but the part that really bothers me is "ftell is
not guaranteed to return a byte count except for binary files." It seems
to be suggesting that the fseek/ftell method would be OK for a binary
file, but line from the standard that Richard quoted suggests the opposite.
To say that this irks me would be a bit of an understatement.




There is none, in ISO C.

To say that _this_ irks me would be a bit of an understatement, as well.
It should at least be possible to get the value of "what the OS thinks
the file size is", but apparently there are reasons why it isn't; I've
never heard one that is convincing, though.

I suppose that it's partly because C deals with streams, not files
directly (for the most part). Many things may not make sense for a
stream, size included. How could the size of stdin be meaningful, for
example? At the same time, there are at least a few standard functions
that only make sense for certain types of streams. Seems like it
wouldn't be such a bad idea to have a few more.

-Kevin
 
G

glen herrmannsfeldt

Richard Bos wrote:

(snip)
No. From 7.19.9.2#3: "A binary stream need not meaningfully support
fseek calls with a whence value of SEEK_END".

To say that this irks me would be a bit of an understatement.
(snip)

To say that _this_ irks me would be a bit of an understatement, as well.
It should at least be possible to get the value of "what the OS thinks
the file size is", but apparently there are reasons why it isn't; I've
never heard one that is convincing, though.

I was reading not so long ago what one of IBM's C compilers for
VM/CMS or MVS does for fseek/ftell. For files with variable length
records, text or binary, ftell returns the block number in the
upper 17 bits, and position in the block in the lower 15 bits.
(OS restrictions tend to keep blocks less than 32K.) I think
it wraps at 128K blocks.

MVS keeps track of files in tracks, which can't reliably be
converted to bytes. CMS maps variable length blocks onto
a fixed block file system, but also doesn't accurately
keep track of bytes of file data.

On traditional IBM mainframe OS's, tracks are formatted when
written. The block size is determined by the program, and can
either fixed fixed or variable length. As an added complication,
files with fixed length blocks will usually have a short block
at the end. If opened for append, this short block stays in
place, so even for fixed length blocks a block count can't
reliably indicate file size.

-- glen
 
R

Richard Bos

Kevin Goodsell said:
I suppose that it's partly because C deals with streams, not files
directly (for the most part). Many things may not make sense for a
stream, size included. How could the size of stdin be meaningful, for
example? At the same time, there are at least a few standard functions
that only make sense for certain types of streams. Seems like it
wouldn't be such a bad idea to have a few more.

Exactly; the function could always return -1 for "not available".

Richard
 
R

Richard Bos

glen herrmannsfeldt said:
I was reading not so long ago what one of IBM's C compilers for
VM/CMS or MVS does for fseek/ftell. For files with variable length
records, text or binary, ftell returns the block number in the
upper 17 bits, and position in the block in the lower 15 bits.
(OS restrictions tend to keep blocks less than 32K.) I think
it wraps at 128K blocks.

MVS keeps track of files in tracks, which can't reliably be
converted to bytes. CMS maps variable length blocks onto
a fixed block file system, but also doesn't accurately
keep track of bytes of file data.

On traditional IBM mainframe OS's, tracks are formatted when
written. The block size is determined by the program, and can
either fixed fixed or variable length. As an added complication,
files with fixed length blocks will usually have a short block
at the end. If opened for append, this short block stays in
place, so even for fixed length blocks a block count can't
reliably indicate file size.

That doesn't convince me, either.

The OS has _some_ idea of how large the file is, if only to prevent the
user from writing past the end of it. It should be possible to pass this
knowledge on to the C implementation. If the result is approximate, that
is inherent in the OS, and the user will be expecting it.

Richard
 
G

glen herrmannsfeldt

Richard said:
(snip)

(snip)

That doesn't convince me, either.
The OS has _some_ idea of how large the file is, if only to prevent the
user from writing past the end of it. It should be possible to pass this
knowledge on to the C implementation. If the result is approximate, that
is inherent in the OS, and the user will be expecting it.

The OS keeps track of how many tracks are allocated, but now how many
bytes are written to each one. The number of bytes you can fit on a
track with a BLKSIZE of 1 is about 1% of the maximum. There also
could be empty tracks allocated but not yet used, after the data.

There is no standard (or non-standard) way to say approximately how
much space a data set takes.

Assuming that every file system is like unix is not a good idea.

-- glen
 
D

Dik T. Winter

Note the "variable length records". I think that records can not span
track boundaries, and so each track contains unused data.

No. The OS only has to have some idea where the end of a file is.
> The OS keeps track of how many tracks are allocated, but now how many
> bytes are written to each one. The number of bytes you can fit on a
> track with a BLKSIZE of 1 is about 1% of the maximum. There also
> could be empty tracks allocated but not yet used, after the data.

The empty tracks are no problem I think, it is the partly filled tracks
that will give problems.
> There is no standard (or non-standard) way to say approximately how
> much space a data set takes.

There is a non-standard way. Take each allocated track in succession
and find the number of allocated bytes for each track (that number is
available). Add them and you are done. However, this does not tell
you where the next byte should be written. You could of course write
an ftell and fseek that would use byte-numbers, but implementation
would be slow as for each execution of such a routine you have to
consult a table containing the size of each track.
> Assuming that every file system is like unix is not a good idea.

Indeed.
 
M

Mantorok Redgormor

Arnold said:
Is using fseek and ftell a reliable method of getting the file size on a
binary file? I thought I remember reading somewhere it wasn't... If not what
would be the "right" and portable method to obtain it? Thanks.

no. because a binary stream may have padding.
though i'm not so sure why a binary stream would
have padding. This is just what the standard says.
 
G

glen herrmannsfeldt

Dik T. Winter wrote:

(snip regarding a file system used on some IBM machines that have
a C compiler)
There is a non-standard way. Take each allocated track in succession
and find the number of allocated bytes for each track (that number is
available). Add them and you are done.

I believe the only way to do that is to read all the tracks up
to the EOF. But why would you want to do that? You can't fseek()
with it, but you can with the block/offset form. Though I
am not sure that it doesn't need to read them even for that form.

It might be that the C library reads it once and keeps track of the
length of each block, and the track it is on for later use.
> However, this does not tell
you where the next byte should be written. You could of course write
an ftell and fseek that would use byte-numbers, but implementation
would be slow as for each execution of such a routine you have to
consult a table containing the size of each track.

It might be that it does keep track of where the last track with
data on it is.

-- glen
 
G

glen herrmannsfeldt

no. because a binary stream may have padding.
though i'm not so sure why a binary stream would
have padding. This is just what the standard says.

I believe that there are some file systems that use fixed blocks,
such as 512 bytes, and keep track of the number of blocks but not
the number of bytes in the last block.

Rumors are that CP/M did this, and used X'26' on text files to mark
the real end.

Some tape systems also can only write 512 byte blocks.

-- glen
 
D

Dave Thompson

On Thu, 08 Jan 2004 19:51:46 GMT, Kevin Goodsell
Under Unix, the stat call will give you an exact answer. Several other
systems supply a Unix-like stat which will give an approximate answer.
You can fseek to the end and then use ftell, but these tend to have the
same problems: fstat is not portable, and generally tells you the same
thing stat tells you; ftell is not guaranteed to return a byte count
except for binary files. Some systems provide routines called filesize
or filelength, but these are not portable, either. <snip>
---

Does this look strange to anyone else? There's that lone closing paren
in the first paragraph, but the part that really bothers me is "ftell is
not guaranteed to return a byte count except for binary files." It seems
to be suggesting that the fseek/ftell method would be OK for a binary
file, but line from the standard that Richard quoted suggests the opposite.
What it's trying to say, but doesn't spell out well, is that ftell()
of a binary stream, if it works at all, must return a byte count --
and similarly fseek() of a binary stream if it works must accept a
byte count, however much extra work the C runtime must do to deal with
radically non-Unix-like files -- but ftell() of a text stream may
return, and fseek() accept, a "cookie" on which arithmetic does not
work, and (in this context) does not even resemble a file size
measure; 7.19.9.4p2.

As an extreme example, I think someone reliable posted a few months
back (or maybe in c.s.c) that VMS C couldn't fit the necessary info in
a long so it allocated memory space where it stored the RMS record
info and returned the address of that space (on VAX all addresses were
flat 32 bit, with a break at 2 up 31, and so fit in 32-bit long).

In other words, it is saying: if you want to try the fseek(END),ftell
method, only try it on a binary stream; and it should but doesn't note
that even that may fail (at runtime, but at least noisily).

- David.Thompson1 at worldnet.att.net
 
D

Dave Thompson

Of course even in this case it could and probably would give you the
size allocated, it's just that that's increased from the size written.
I believe that there are some file systems that use fixed blocks,
such as 512 bytes, and keep track of the number of blocks but not
the number of bytes in the last block.

Rumors are that CP/M did this, and used X'26' on text files to mark
the real end.
CP/M used 128-byte block = 1 sector on floppy; Dan Pop has said it
used at least one larger size (maybe several?) on harddisks and I
believe him, but the CP/M system I used had no harddisk.

And 0x1A = (dec) 26 for EOF. From whence MS-DOS seems to have picked
it up, even though MS-DOS has and IIRC always had exact byte counts.

RT-11 used 512-byte blocks (on everything), and I *think* the same
character but I don't remember for sure as TECO took care of that for
me (and PIP, but if I did DK:FOO=TT:/A it was so rare I've forgotten);
crosspost added for check.
Some tape systems also can only write 512 byte blocks.
Including DECtape <G!>. Although you can still have labels or other
metadata that tells you how much padding to ignore.

- David.Thompson1 at worldnet.att.net
 
B

Brian Inglis

only on disk files -- skip to EOF is not good on other devices

*text* streams may have padding (CRs) or no carriage control (IBM VB
or DEC implied CR) from the POV of ftell()/fseek(), which I believe
are deprecated in favour of the more opaque fgetpos()/fsetpos();
fixed record length binary files should have no padding on most
systems; variable record length binary files may have padding on some
systems where the record metadata is stored with the file data
Of course even in this case it could and probably would give you the
size allocated, it's just that that's increased from the size written.

allocated => blocks / clusters
bytes stored on disk >= (| <=) bytes written to disk
 
G

glen herrmannsfeldt

Dave Thompson wrote:

(snip)
What it's trying to say, but doesn't spell out well, is that ftell()
of a binary stream, if it works at all, must return a byte count --
and similarly fseek() of a binary stream if it works must accept a
byte count, however much extra work the C runtime must do to deal with
radically non-Unix-like files -- but ftell() of a text stream may
return, and fseek() accept, a "cookie" on which arithmetic does not
work, and (in this context) does not even resemble a file size
measure; 7.19.9.4p2.
As an extreme example, I think someone reliable posted a few months
back (or maybe in c.s.c) that VMS C couldn't fit the necessary info in
a long so it allocated memory space where it stored the RMS record
info and returned the address of that space (on VAX all addresses were
flat 32 bit, with a break at 2 up 31, and so fit in 32-bit long).

Previously in this thread, I had indicated that MVS and VM/CMS on
variable length block files, even opened in binary mode, return
32768*(block number)+(offset into block). Standard access methods
limit blocksize to less than 32768, but files can have more than
131071 blocks, especially if they are small.

I don't know how much work it is to come up with that. I don't
believe that the number of blocks is stored, though I am not sure
about that. (MVS keeps track of the number of tracks allocated, but
not the number of blocks on each track.)

-- glen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top