ftell() arithmetic vs. text files read as binary

  • Thread starter Hallvard B Furuseth
  • Start date
H

Hallvard B Furuseth

I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.
 
E

Eric Sosman

Hallvard said:
I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.

Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...
In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position, and one can
do arithmetic on ftell() positions within one line. I.e.:

- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")
- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

The VAR file format was <length, contents> or <length, contents,
padding byte> to make an even total. I think the padding byte was
always a zero, but I don't remember whether that was guaranteed or
just "usual practice."

The VFC format was weirder: <length, prefix, contents> or
<length, prefix, contents, padding byte>. The "prefix" portion was
of fixed length (usually two bytes), and indicated "carriage control"
to be applied before and after "printing" the line: single-advance,
double-advance, skip to new page, and so on. On text-mode input,
the C library translated these by synthesizing LF's and FF's and
such before and after the "payload" of the line.

If you read any of these things in binary mode, you'd get the
raw, uninterpreted data: length, prefix, payload, and padding, as
one undifferentiated stream of bytes.
- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

You might want to make that "an unsigned byte < 32."
 
R

Random832

2006-11-20 said:
I'm trying to clean up a program which does arithmetic on text
file positions, and also reads text files in binary mode. I
can't easily get rid of it all, so I'm wondering which of the
following assumptions are, well, least unportable.

In particular, do anyone know if there are real-life systems
where the text file assumptions below don't hold?

For text mode FILE*s,

* input lines will be ordered by ftell() position,

and one can do arithmetic on ftell() positions within one line.

one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.
- getc() adds 1 to the ftell() position, except possibly at
the end of a line and EOF.

Multibytes again
- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]
Or for binary mode FILE*s,

* getc() data looks like it does from a text mode FILE*, except:

- lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
(Fails for fixed-size line records, I know. Or lines stored
as <length, contents>, if there are such files around.)

- files end at EOF or with ^Z (yuck). Or maybe that should be
"a byte < 32 for which isspace()==0". I can assume ASCII or
a superset, otherwise the file must be preprocessed anyway.

Don't forget the extra zero-padding permitted at the end of binary files
(for systems where native file size is stored in units > 1 byte)
 
H

Hallvard B Furuseth

Eric said:
I can (dimly) recall some OpenVMS file formats that may have
violated some of your assumptions. Not too surprising: OpenVMS
had seven basic file formats, with variations -- and that was
just for the sequential file organization, never mind the others
that departed even further from C's I/O model. Text files would
almost always be sequential, though, so the other organizations
can probably be ignored.

Sounds interesting, I'll see if I can dig out some more info about that.
Whether this affects the portability of your program depends
on the likelihood that you'll need to get it running on VMS. If
that likelihood is zero, then ...

Low, but it's not unlikely that the program will meet _some_ esoteric
system. And what one system can do, others can do as well.

I think I'll downgrade my expectations a bit and instead ask:

Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*? I'm not about to support things like <length,
contents, padding> anyway. Sounds like the binary formats that will
break my "text mode assumptions" will break just as badly in binary
mode, which is a relief in a way:)

In any case, I guess a user option which makes the program read the
file as a text file and save it to a tmpfile() would be a good idea.
Then it'll be the user's worry instead of mine...
ISTR that on at least some VMS file formats, fseek() could
only position to the start of a line ("record") and hence ftell()
would return the same value all through a single line. This was
back in the pre-Standard days, though, and since this behavior
doesn't meet the requirements of the Standard (or so I believe),

Correct. fgetc() "advances the associated file position indicator" in
both C89 and C99.
it may have been fixed sometime in the many intervening years.
(Of course, the fix may simply have been a documentation change:
"Don't use XYZ format with C programs.")

You might want to make that "an unsigned byte < 32."

Good point. But I think I'm currently hoping to drop binary mode and
stay with ftell() in text mode.
 
H

Hallvard B Furuseth

Random832 said:
one can _do_ arithmetic, perhaps... one isn't guaranteed to get
meaningful results, particularly with multibyte streams.

As far as I know, streams are not multibyte unless I make them so.
C99 7.19.2p4 says: "Once a wide character input/output function has
been applied to a stream without orientation, the stream becomes a
wide-oriented stream."

Though it's a point, such a program can't be extended to handle
wide-oriented streams.
- at the end of a line, getc() increments the position with a
small positive number. (Or moderately small, if the file
consists of fixed-size space-padded line records.)

If the file is record-oriented, it could plausibly instead bump it to
the next multiple of an arbitrarily large power of two [say, record
number and offset are separate fields]

True. I don't know of an example though?
Don't forget the extra zero-padding permitted at the end of binary
files (for systems where native file size is stored in units > 1 byte)

Good point.
 
E

Eric Sosman

Hallvard B Furuseth wrote On 11/20/06 13:18,:
Am I likely to encounter a system where acessing a text file in binary
mode will give me less headaches than ftell() arithemtic on a line in
a text-mode FILE*?

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?
 
H

Hallvard B Furuseth

Eric said:
Hallvard B Furuseth wrote On 11/20/06 13:18,:

My (unscientific) feeling is that text files should be
read in text mode, to take advantage of whatever format
translation the system may need. But much depends on how
the program (ab)uses the ftell() arithmetic.

Can you offer some examples of the kinds of ftell()
arithmetic the program engages in? Are the jumps "short"
(intra-line) or "long" (inter-line)? Frequent or occasional?

Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.

There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.
 
E

Eric Sosman

Hallvard B Furuseth wrote On 11/21/06 05:52,:
Frankly I'm not entirely sure yet, but I think it can be reduced to
something like:
Walk through the file and save info about each character, with
index (ftell() position of line + character's index in line).
Next,
for (i = 0; i < {max ftell() position}; i++)
if (there is a character #i)
handle(getc());
I suppose that for loop can be changed to read line by line, but
that change looks a bit messy.

The "walk through," I guess, is probably line by line?
(If it were character by character you could forget about
saving the intra-line index and just save each character's
ftell() position, then fseek() back to it. That would make
everything legitimate except the "max ftell() position"
calculation, which isn't guaranteed to make sense but very
likely will.)

But it looks like the arithmetic on ftell() values is
strictly within a line, right? That is, the loop looks
more like

for (i = 0; i < max; i++) {
if (something_about_position(i)) {
fseek(stream, ftellpos + offset,
SEEK_SET);
ch = getc(stream);
...
}
}

If that's it, you may be out of the woods. Most crudely:

for (i = 0; i < max; i++) {
if (something_about_position(i)) {
fseek(stream, ftellpos, SEEK_SET);
for (j = 0; j < offset; j++)
(void)getc(stream);
ch = getc(stream);
...
}
}

A slightly fancier version would remember what line it
was in and what the previous offset was, to avoid seeking
over and over again to the start of the same line and
getc()'ing past longer and longer prefixes.
There are some ugly cases like fseek(arbitrary position) as well,
but I think they can be eliminated without too much fuss.

Good luck!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top