Programming in standard c

A

Al Balmer

)ug. in band error signalling. How do you tell the difference between
)a zero length) or very large file and an error?

Having an error condition equate to a zero-length file is workable when the
error is likely rare and unimportant.

But yes all 1's is better, and in this case won't clash with the largest
size returnable.
I missed the beginning of this thread, but why is "size" unsigned?
ftell returns long int, with -1L indicating failure. Your function
could return long int as well.
 
G

Gordon Burditt

So which is the correct size of a partially compressed sparse file? The

The file size being discussed was *the size of the file when it is
read into memory* (and you have to specify text or binary mode).
This is not the only "file size" definition, but it's the one
relevant to reading the whole file into memory. The size on disk
is irrelevant for this problem (note also that C provides no way
to get "free disk space", another term that is difficult to define
exactly). A sparse file might take more memory than has ever been
manufactured to read the whole thing in.
uncompressed size it would be if it was actually "full"?

The number of bytes when you read it in. This can change with time.
To accurately measure it, you open, read, and close it without any
intervening accesses by another program. Some systems let you
PREVENT such accesses with mandatory file locking.
The compressed
size, based on what's actually in it?

No. That doesn't affect the size of the file when you read it,
assuming you are talking about transparent compression.
The size it currently occupies on
the disk?

No. That doesn't affect the size of the file when you read it.
If the latter, keep in mind that it bears absolutely no
relationship to the actual number of data bytes in the file.

So do tell, which is the "correct" size.

For this problem, the size of the file when you read it into memory
is the size of the file when you read it into memory in a particular
mode (text or binary) and at a particular time. To get a consistent
value, you need to do all your accesses in one sequence with no
intervening accesses from other programs.

There are plenty of other definitions of "file size" for other problems.
 
B

Bart C

Al said:
I missed the beginning of this thread, but why is "size" unsigned?
ftell returns long int, with -1L indicating failure. Your function
could return long int as well.

Probably my fault, using 12-year-old docs and source originally in non-C.

(Oh, and thanks for the OE-Quotefix link posted elsewhere. Usenet in
technicolor now..)

Bart
 
B

Bart C

Kelsey said:
[snips]

Taking the size of a rapidly changing file like that is asking for
problems. But they need not be serious. Ask the OS to copy that file
to a unique filename.

This assumes you can. If the file is larger than available free
space, how do you plan to manage this?

You write software that requires certain resources and if they can't be met
then it fails. In this case you need enough space to duplicate this file.
Obviously that isn't ideal then you have to look at other ways of dealing
with these files in a low-disk situation..
And how does he know it's supposed to be static? Simple example: a
text file viewer/editor. It's the user's call what file to use it
on, how does the code know whether the file is supposed to be static?

Good point. On my little test of a slowly expanding file, one Editor denied
me access to the file (perhaps a good idea except this denial could be
intermittent), one of my own editors allowed me to view the file so far.
Probably wrong but which approach is more useful to someone who urgently
needs to look at the file?

Actually editing the file would be problematical but multiple write access
to such data files needs certain approaches and that's probably outside the
scope of the C file functions we're talking about.

<snip lots of stuff about OS/compressed/sparse files>

You obviously know a lot about this but I'd still like my right to a simple
interface to the file system where these details are hidden, unless I call
the appropriate functions.
Exactly what I'm saying: if "you" want a function that determines the
size of a file, how about "you" define what "the size of a file"
means. Oddly, the ones most insistent upon having such a function
refuse to solve these issues.

I have my own ideas which are adequate most of the time, but you will likely
come up with some unusual scenario where these ideas will break down.

I notice ms-windows has a native function GetFileSize, and I would happy to
go along with whatever that means (although I use only C functions). This
returns the expanded sizes of compressed files, and they have a separate
function for the compressed sizes of those.

Bart
 
B

Ben Bacarisse

Kelsey Bjarnason said:
Actually, most systems simply won't let you write the billionth byte
unless you've already written all the bytes before it - meaning you have a
billion bytes on disk.

I don't know about most systems, but on my Linux box (and on most
Unix boxes that I remember using) this program:

#include <stdlib.h>
#include <stdio.h>

int main(int argc,char *argv[])
{
int rc = 0;
if (argc == 2) {
FILE *fp = fopen(argv[1], "wb");
if (fp != NULL) {
rc = fseek(fp, 100L * 1024 * 1024 - 1, SEEK_SET) == 0 &&
fputc(' ', fp) != EOF;
if (fclose(fp) != 0)
rc = 0;
}
}
return rc ? EXIT_SUCCESS : EXIT_FAILURE;
}

makes a file with size 100M but which uses only 32 blocks. Obviously this
is very file-system specific, but I think your comment is overly general.
 
K

Kelsey Bjarnason

[snips]

Good point. On my little test of a slowly expanding file, one Editor denied
me access to the file (perhaps a good idea except this denial could be
intermittent), one of my own editors allowed me to view the file so far.
Probably wrong but which approach is more useful to someone who urgently
needs to look at the file?

Another editor monitors the file, notes it has changed and asks if you
want to reload. Lots of ways of dealing with this sort of thing, none of
them universally correct.
I notice ms-windows has a native function GetFileSize, and I would happy
to go along with whatever that means (although I use only C functions).
This returns the expanded sizes of compressed files, and they have a
separate function for the compressed sizes of those.

And there's an example of what I'm talking about. The discussion has been
about a C routine to determine file sizes, yet here there are two distinct
sizes, each perfectly correct - and depending on what you want to do with
the file, arguments can be made in favour of both sizes. Yet such a
function can, presumably, only return one - so which one?

The simple fact is there isn't a general solution to the problem, because
the problem simply cannot be solved in the general case. At best you can
make an arbitrary decision: "file size means bytes available to be read
when the file is opened in binary mode, using naive file functions,
treating the entire process as atomic and as it it were being performed at
time of determining size."

Yeah, that'll give you a size, and yeah, that size will work on a whole
lotta files, but it'll crap out on a whole lot, too.
 
K

Kelsey Bjarnason

The file size being discussed was *the size of the file when it is
read into memory* (and you have to specify text or binary mode).

If you can actually specify the mode, in many cases the size will be
different. Is the OS supposed to keep track of both and report back the
one you want? If we're defining a standard C routine to determine file
size, do we pass a parameter specifying which mode to use?
This is not the only "file size" definition, but it's the one relevant
to reading the whole file into memory.

It's at least two distinct sizes already.
The number of bytes when you read it in. This can change with time. To
accurately measure it, you open, read, and close it without any
intervening accesses by another program. Some systems let you PREVENT
such accesses with mandatory file locking.

Sure, and some don't, and in the context of C, you can't rely on such
measures existing, and even if they do, we're still talking a minimum of
two different sizes.
For this problem, the size of the file when you read it into memory is
the size of the file when you read it into memory in a particular mode
(text or binary) and at a particular time.

So two distinct sizes, then. Again - which is the correct one? Which
will our file size function return? Should it have a parameter which lets
you specify? Is the size you determine _now_ the size of the file at the
time you read it? Several examples have been given where this won't be
the case.
To get a consistent value,
you need to do all your accesses in one sequence with no intervening
accesses from other programs.

Sounds good - can you explain how to ensure this in standard C code? If
you can't, then whether you can determine the file size or not sorta
becomes irrelevant, as it may well change at the drop of a hat.
 
K

Kelsey Bjarnason

[snips]

Sure, thats one of the nice and simple things about unices. Once you
have a file handle, how could you tell the difference?

Mostly you can't.
You say file-like again. What is the the difference between "file
manner" and "file-like manner"?

That's kinda my question. To me, a serial port is not a file, regardless
of how you access it. Nor is a video card or sound card. These are
devices, not files.
Problem is, the term "file" is not well defined.
Is a 1:1 copy of there entire contents of a harddrive a file, e.g. "dd
if=/dev/hdd of=hardrive.backup"?

Er, you're creating a file, why wouldn't it be a file? :)
Is an archive (e.g. zip-file, tar-file) a file or a filesystem? Does it
automagicly change its status if an implementation of open accepts
something like "/home/me/archive.zip/folder/somefile"?

Make it simpler: a loopback. Say, for example, mounting an ISO image as
if it were a device. mount -o loop -t iso9660 media.iso /path/to/mount/at.

Obviously, you're dealing with a file. Then again, equally obviously,
you're not dealing with a device; you're dealing with a file mimicking a
device.

Perhaps the simplest way to differentiate is asking "can I duplicate
this?" No matter how many times you try to dupe your sound card, you're
only going to drive two speakers. No matter how many times you try to
dupe your hard drive, you're not going to turn a 10GB drive into a 1TB
drive. A file, by contrast, can be duplicated endlessly, with the result
that you do, in fact, have multiple distinct - and distinctly usable -
copies.

Granted, from a C perspective, you might not be able to tell the
difference, but C doesn't _quite_ encompass all reality. Yet. :)
 
G

Gordon Burditt

So which is the correct size of a partially compressed sparse file? The
If you can actually specify the mode, in many cases the size will be
different. Is the OS supposed to keep track of both and report back the
one you want? If we're defining a standard C routine to determine file
size, do we pass a parameter specifying which mode to use?

Yes, you'd have to pass a parameter specifying which mode to use,
or open the file and let the system use the same mode as what you
opened it with. Or have two different functions, like filetextsize()
and filebinarysize(). On POSIX and Windows, stat() could be used
to provide filebinarysize(). On POSIX, filetextsize() is the same
as filebinarysize(). And in any case, you can obtain the required
value by opening the file in the appropriate mode, calling fgetc()
repeatedly, and counting the number of calls. I didn't say it would
be fast.
It's at least two distinct sizes already.

Which file mode is relevant to the file you intend reading into
memory? If you pass it an open FILE * handle, it should already
know what mode you are interested in (if it makes a difference).
If you pass it a file name, you'll need a mode also.
Sure, and some don't, and in the context of C, you can't rely on such
measures existing, and even if they do, we're still talking a minimum of
two different sizes.


So two distinct sizes, then. Again - which is the correct one? Which

If you are intending to read the file into memory, which mode to you
intend to use when reading it? That is the correct mode to use for
computing the size.
will our file size function return? Should it have a parameter which lets
you specify? Is the size you determine _now_ the size of the file at the
time you read it? Several examples have been given where this won't be
the case.

The function returns a value as of a particular time. What gets
nasty is when the function's accesses to the file are interleaved
with accesses by other program or programs unknown.
Sounds good - can you explain how to ensure this in standard C code? If

An *implementation* of a proposed function to add to standard C can
use non-standard hooks which standard C code can't (like file locking).
you can't, then whether you can determine the file size or not sorta
becomes irrelevant, as it may well change at the drop of a hat.

You're right. Files can change size, and trying to get the file
size ahead of time, no matter how you define it, is a problem. You
also see the same problem when people ask "how can I find out if/how
many other programs have the file open?". That won't work either:
unless you can get the function to return an accurate value *AND
PREVENT IT FROM CHANGING* until the caller of the function lets go.
That tends to be a significant opening for a denial-of-service
attack by a buggy or malicious caller of the file size function.

There are other uses of the file size, such as comparing the output
size with the expected output size in a regression test. (Next step,
if the file sizes match, is to compare them).
 
D

David Thompson

Yes, and furthermore you can still read from the file, or write to it,
*AFTER* someone else deletes it (until you fclose() it).


UNIX or POSIX.
Real Unix (at least) doesn't truly delete the file (=inode), only the
(last) direntry for it. I'm not sure how closely a POSIX simulation or
wrapper on something else must, or does, track this.
[Windows example of deleting failing on open file deleted.]

At least some versions of NFS in at least some configurations had(?)
the problem that a file can truly be deleted out from under an open,
causing real breakage.

Multics had files (or formally segments) mapped rather than opened; it
allowed you to truly delete, or just retract permission to, a mapped
file, invalidating the existing mappings so that subsequent attempts
to access it from the already-running process(es) would fail.

- formerly david.thompson1 || achar(64) || worldnet.att.net
 
D

David Thompson

Nobody needs to modify the OS. But if those systems support C, they
MUST support

FILE *f = fopen("foo","a+");

And they HAVE to know where the end of the file is somehow. I am
amazed how you and the others just ignore the basic facts.
They have to know where the end of file is, but that doesn't
necessarily mean knowing the size of the file. I once worked on a
(custom) system where files were stored as double-linked lists of
sectors, with only head and tail pointers in the directory. You could
add to _or take (read&delete) from_ the end, _or the beginning_,
_only_; full editing used a variant of the TECO/Emacs 'buffer gap'
technique where you:
- open oldfile at beginning and create empty newfile open at end;
- to move forward, read&delete from oldfile and append to newfile;
- to insert just append to newfile;
- to delete just read&delete without writing;
- to replace should be obvious at this point;
- to move backward read/delete from (end of) newfile and write
'before' beginning of oldfile;
- repeat until positioned at end, with oldfile empty; then delete
oldfile and keep (i.e. catalog) newfile.
In C, any file is conceptually a sequence of bytes. Some file systems
do not support this well. But if they support it, THEN they must
ALREDY support this abstraction so that filesize wouldn't mean any effort.
A sequence, but not a vector. C files needn't be randomly or directly
addressable: fseek() can fail (and in text mode can use other than
byte positions); fsetpos() too. On some files, like magtape or serial
(or pseudo) or pipe, they can't work and thus must fail.

Disk file systems generally do support direct positioning, because
that was originally one of the main benefits of having disks and disk
files. But they don't inherently require it, and neither does C.

- formerly david.thompson1 || achar(64) || worldnet.att.net
 
K

Kelsey Bjarnason

[snips]

Yes, you'd have to pass a parameter specifying which mode to use,
or open the file and let the system use the same mode as what you
opened it with. Or have two different functions, like filetextsize()
and filebinarysize().

Which means the OS, when writing a block of data, no longer has to merely
write it, but parse it - look for any embedded characters which would be
translated into greater or lesser sequences, and record that value as
well. I suspect this is going to have an impact on performance - assuming
you can get 'em to do it at all.

Alternatively, the function itself could do the job, by opening the file
and reading the file, in the appropriate mode, beginning to end. Can you
say performance hit?
If you are intending to read the file into memory, which mode to you
intend to use when reading it? That is the correct mode to use for
computing the size.

This assumes I will only ever read the file in one mode, or determine size
by reading the file, btyewise, at time of determining the size. The
former isn't reliable, the latter is hellishly inefficient.
The function returns a value as of a particular time.

Yes, but again - which value?
There are other uses of the file size, such as comparing the output size
with the expected output size in a regression test.

Sure. Now, again, *which* file size? Determined *how*?
 
G

Gordon Burditt

Yes, you'd have to pass a parameter specifying which mode to use,
Which means the OS, when writing a block of data, no longer has to merely
write it, but parse it - look for any embedded characters which would be
translated into greater or lesser sequences, and record that value as
well. I suspect this is going to have an impact on performance - assuming
you can get 'em to do it at all.

I did not say the OS has to have either or both of the sizes
precalculated. If the result impacts performance more than reading
the whole file and counting bytes, taking into account things like
how often file sizes of either type are needed and how often writes
are done, then someone made a poor decision of pessimization.

I consider having the text file size used for reading the file into
memory to be used insufficiently often to make it worth caching it.
Your opinion may differ.

POSIX happens to keep *both* precalculated, since there's no
difference between binary and text mode. Windows keeps the binary-mode
size precalculated. Thus, performance for getting the text-mode
size may suck significantly more than getting the binary-mode size
on Windows.
Alternatively, the function itself could do the job, by opening the file
and reading the file, in the appropriate mode, beginning to end. Can you
say performance hit?

If you don't need the correct answer, you can do it in zero bytes
and zero time. But if the performance hit is so bad, maybe you
shouldn't use a method that needs a precalculated file size. That
approach of reading the file in chunks (this is to read it into
memory, NOT precalculate the size) and realloc()ing when needed
(say, doubling each time, with fallback if you run out of memory)
is starting to look more and more efficient all the time, even with
the copying (if any).

While we're at it, how about revisiting the strategy of reading the
entire file into memory? Is it really a good idea? If the file is
large, you may force parts of this program or other programs to page
out. Slow. Now, depending on what you are doing with the file, reading
it in chunks might be worse. Or better. If you're just dumping the
file in hex, reading chunks at a time lets your program run in much
less memory, and makes it work on files MUCH larger than what you can
fit in memory.
This assumes I will only ever read the file in one mode, or determine size
by reading the file, btyewise, at time of determining the size. The
former isn't reliable, the latter is hellishly inefficient.

Each time you read the file into memory, you read it in *one* mode,
I hope (no switching in the middle of the file). When you want the
file size for that buffer, you read it in that one mode. How you
read it last time or will read it next time is irrelevant.

You made a bad decision, performance-wise, to use a precalculated
file length, especially in text mode if the OS doesn't keep the
value handy and text mode != binary mode. Stick with that decision,
and performance is going to suck.

If you *must* have a precalculated value, have the OS save the one
involving the same mode that the file was written in (and which
kind it is). My guess is that this will cover at least 80% of the
times that file size is needed for the purpose of reading the file
into memory.
Yes, but again - which value?

It returns the one associated with the mode you intend to use to read
the file into memory. You have to make up your mind which mode to use
before you start reading. Use the same decision when you determine
the file size.
Sure. Now, again, *which* file size? Determined *how*?

The size associated with the mode the file was written in, if your
application knows what that is (and no, I don't expect the OS to
keep track of it). It's up to your application to know what mode
to open its own files in. Either it knows from what prompt was
answered (e.g. text editors always do text files; graphics editors
always do binary files), or the file extension, or it asks the user,
or it just handles generic files and can do everything in binary
mode.

(This assumes that the reference "correct" output was generated on
THIS system or was converted to the local file format. If it wasn't,
well, size comparisons may be totally worthless). Since here,
you're using file size as a shortcut for comparing the files for
equality to quickly find a mismatch, you can dispense with the step
entirely and proceed to reading the files byte-by-byte and comparing
them if finding the size is a performance bottleneck.
 
K

Kelsey Bjarnason

[snips]

The size associated with the mode the file was written in, if your
application knows what that is

I see; you're one of those folks under the impression only one application
is ever allowed to be used on a file. I think we can dispense with any
further discussion, as your beliefs and reality have no bearing on each
other.

In the real world, we're still left with the questions: what size,
determined how, and with what performance penalty?

Apparently, a file size function is perfectly acceptable if it returns
multiple distinct values for the same file (even unmodified) with runtimes
ranging from, oh, a millisecond to, say, 10 minutes or more.

Sorry, not gonna work out here in reality.
 
E

Eric Sosman

Kelsey said:
I see; you're one of those folks under the impression only one application
is ever allowed to be used on a file. I think we can dispense with any
further discussion, as your beliefs and reality have no bearing on each
other.

The issue is that the behavior of a file being used by
more than one program is well outside the scope of a language
standard. Different operating environments do in fact define
different semantics (and even define "used" differently), and
it is not the proper purview of the C Standard to try to
dictate the environments in which it is used.

Some other languages have different and more limiting
goals. Java, for example, is quite the martinet and dictates
everything about the operating environment that it thinks it
can get away with: This is both a strength and a weakness.
S/370 assembly language is rather definite about the sizes and
representations of its data types: it relieves its users of
worrying about trap representations, but it doesn't interoperate
very well with MacOS. C's "I'll run anywhere" permissive approach
comes at a cost in specificity, but has proven to be of wide
applicability. Learn to appreciate the trade-off -- or learn
to hate it, and turn to languages more suited to your temperament.
 
R

Rui Maciel

jacob said:
What is important is that we can use a portable function in MOST file
systems. I am tired of this levelling through the worst that is going on
here.

This insistence of getting the WORST system then FORCING all other
people to adapt themselves to that pile of sh...!

In that case I do not believe there is a problem at all. If you are willing
to abandon the support for some platforms in order to get portability, then
you could adopt a solution where portability is accomplished if the target
platform respects certain standards' requirements regarding portability. We
already have that in POSIX.

Moreover, I believe it would be unwise to drop the support of a certain
programming language for some platforms if they worked in a different way.


Rui Maciel
 
G

Gordon Burditt

Sure. Now, again, *which* file size? Determined *how*?
I see; you're one of those folks under the impression only one application
is ever allowed to be used on a file.

Absolutely not, and I think that idea is downright silly.

If you are comparing the results of a test run on this system vs.
a reference correct output, then you know what format it is in,
because the test IS PART OF YOUR OWN APPLICATION that does
the regression testing.

I think we can dispense with any
further discussion, as your beliefs and reality have no bearing on each
other.
In the real world, we're still left with the questions: what size,
determined how, and with what performance penalty?

This is determined entirely by what use you are putting a size result
to. You need a definition of the size you are trying to get, and
there are at least dozens of such definitions, probably thousands.
And they may not be portable across systems (e.g. "the number of
bytes the file occupies on disk, including the directory and the
inode" doesn't make a whole lot of sense on a filesystem that doesn't
have inodes).

If you're going to read the file into memory, the size you want is
the size it's going to take to read it into memory in the particular
mode you intend to use to read it. The space occupied ON DISK is
irrelevant. If you are trying to determine the size occupied on
disk to determine if it will fit, that's a whole different definition
of size.
Apparently, a file size function is perfectly acceptable if it returns
multiple distinct values for the same file (even unmodified) with runtimes
ranging from, oh, a millisecond to, say, 10 minutes or more.

C has two distinct modes in which it can read the same file, and
they don't necessarily yield the same number of bytes for the same
unmodified file. That's an issue that the C standard created: it
is not a decision that the writer of a file size function gets to
change. That's reality. And remember, I suggested using two
different functions for it, filetextsize() and filebinarysize().

The C standard also specifies characteristics of the two modes. In
many cases, those characteristics pretty much demand that an
application use a particular mode. If you want arbitrary bytes in
sequence with nobody messing them up (e.g. deleting \r's on Windows),
use binary. If your output consists of lines and characters that
are intended to be displayed or printed and interface with tools
like text editors, use text. The line between text and binary files
is not a 100% division (some files could be either type), but it's
still a useful distinction. Most output files can be classified
as text or binary, based on the format and how it can be used at
an application level.
Sorry, not gonna work out here in reality.

Reality is that C has two file modes, and files may have different
lengths when read in different modes. Reality is that performance
can vary drastically depending on what the implementor decided was
worth optimizing. (It is my opinion that the text file size is NOT
used enough to make maintaining a count on the fly as the file is
modified, and it's a performance killer, not an optimization. If
calculating file size is such a performance problem, consider using
a method that does NOT calculate file size in advance.)

Reality is that there is not one file size but many, and that you
have to use a definition that matches your intended use in order
to get sensible results. Reality is that files change size - there
are even tools called editors that exist to do that.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top