ftell() arithmetic vs. text files read as binary

Discussion in 'C Programming' started by Hallvard B Furuseth, Nov 20, 2006.

  1. I'm trying to clean up a program which does arithmetic on text
    file positions, and also reads text files in binary mode. I
    can't easily get rid of it all, so I'm wondering which of the
    following assumptions are, well, least unportable.

    In particular, do anyone know if there are real-life systems
    where the text file assumptions below don't hold?

    For text mode FILE*s,

    * input lines will be ordered by ftell() position, and one can
    do arithmetic on ftell() positions within one line. I.e.:

    - getc() adds 1 to the ftell() position, except possibly at
    the end of a line and EOF.

    - at the end of a line, getc() increments the position with a
    small positive number. (Or moderately small, if the file
    consists of fixed-size space-padded line records.)

    Or for binary mode FILE*s,

    * getc() data looks like it does from a text mode FILE*, except:

    - lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
    (Fails for fixed-size line records, I know. Or lines stored
    as <length, contents>, if there are such files around.)

    - files end at EOF or with ^Z (yuck). Or maybe that should be
    "a byte < 32 for which isspace()==0". I can assume ASCII or
    a superset, otherwise the file must be preprocessed anyway.

    --
    Hallvard
     
    Hallvard B Furuseth, Nov 20, 2006
    #1
    1. Advertising

  2. Hallvard B Furuseth

    Eric Sosman Guest

    Hallvard B Furuseth wrote:

    > I'm trying to clean up a program which does arithmetic on text
    > file positions, and also reads text files in binary mode. I
    > can't easily get rid of it all, so I'm wondering which of the
    > following assumptions are, well, least unportable.


    I can (dimly) recall some OpenVMS file formats that may have
    violated some of your assumptions. Not too surprising: OpenVMS
    had seven basic file formats, with variations -- and that was
    just for the sequential file organization, never mind the others
    that departed even further from C's I/O model. Text files would
    almost always be sequential, though, so the other organizations
    can probably be ignored.

    Whether this affects the portability of your program depends
    on the likelihood that you'll need to get it running on VMS. If
    that likelihood is zero, then ...

    > In particular, do anyone know if there are real-life systems
    > where the text file assumptions below don't hold?
    >
    > For text mode FILE*s,
    >
    > * input lines will be ordered by ftell() position, and one can
    > do arithmetic on ftell() positions within one line. I.e.:
    >
    > - getc() adds 1 to the ftell() position, except possibly at
    > the end of a line and EOF.


    ISTR that on at least some VMS file formats, fseek() could
    only position to the start of a line ("record") and hence ftell()
    would return the same value all through a single line. This was
    back in the pre-Standard days, though, and since this behavior
    doesn't meet the requirements of the Standard (or so I believe),
    it may have been fixed sometime in the many intervening years.
    (Of course, the fix may simply have been a documentation change:
    "Don't use XYZ format with C programs.")

    > - at the end of a line, getc() increments the position with a
    > small positive number. (Or moderately small, if the file
    > consists of fixed-size space-padded line records.)
    >
    > Or for binary mode FILE*s,
    >
    > * getc() data looks like it does from a text mode FILE*, except:
    >
    > - lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
    > (Fails for fixed-size line records, I know. Or lines stored
    > as <length, contents>, if there are such files around.)


    The VAR file format was <length, contents> or <length, contents,
    padding byte> to make an even total. I think the padding byte was
    always a zero, but I don't remember whether that was guaranteed or
    just "usual practice."

    The VFC format was weirder: <length, prefix, contents> or
    <length, prefix, contents, padding byte>. The "prefix" portion was
    of fixed length (usually two bytes), and indicated "carriage control"
    to be applied before and after "printing" the line: single-advance,
    double-advance, skip to new page, and so on. On text-mode input,
    the C library translated these by synthesizing LF's and FF's and
    such before and after the "payload" of the line.

    If you read any of these things in binary mode, you'd get the
    raw, uninterpreted data: length, prefix, payload, and padding, as
    one undifferentiated stream of bytes.

    > - files end at EOF or with ^Z (yuck). Or maybe that should be
    > "a byte < 32 for which isspace()==0". I can assume ASCII or
    > a superset, otherwise the file must be preprocessed anyway.


    You might want to make that "an unsigned byte < 32."

    --
    Eric Sosman
    lid
     
    Eric Sosman, Nov 20, 2006
    #2
    1. Advertising

  3. Hallvard B Furuseth

    Random832 Guest

    2006-11-20 <>,
    Hallvard B Furuseth wrote:
    > I'm trying to clean up a program which does arithmetic on text
    > file positions, and also reads text files in binary mode. I
    > can't easily get rid of it all, so I'm wondering which of the
    > following assumptions are, well, least unportable.
    >
    > In particular, do anyone know if there are real-life systems
    > where the text file assumptions below don't hold?
    >
    > For text mode FILE*s,
    >
    > * input lines will be ordered by ftell() position,
    >
    > and one can do arithmetic on ftell() positions within one line.


    one can _do_ arithmetic, perhaps... one isn't guaranteed to get
    meaningful results, particularly with multibyte streams.

    > I.e.:


    > - getc() adds 1 to the ftell() position, except possibly at
    > the end of a line and EOF.


    Multibytes again

    >
    > - at the end of a line, getc() increments the position with a
    > small positive number. (Or moderately small, if the file
    > consists of fixed-size space-padded line records.)


    If the file is record-oriented, it could plausibly instead bump it to
    the next multiple of an arbitrarily large power of two [say, record
    number and offset are separate fields]

    > Or for binary mode FILE*s,
    >
    > * getc() data looks like it does from a text mode FILE*, except:
    >
    > - lines end with CR/LF/CRLF/LFCR, maybe preceded with spaces.
    > (Fails for fixed-size line records, I know. Or lines stored
    > as <length, contents>, if there are such files around.)
    >
    > - files end at EOF or with ^Z (yuck). Or maybe that should be
    > "a byte < 32 for which isspace()==0". I can assume ASCII or
    > a superset, otherwise the file must be preprocessed anyway.


    Don't forget the extra zero-padding permitted at the end of binary files
    (for systems where native file size is stored in units > 1 byte)
     
    Random832, Nov 20, 2006
    #3
  4. Eric Sosman wrote:
    >Hallvard B Furuseth wrote:
    >> I'm trying to clean up a program which does arithmetic on text
    >> file positions, and also reads text files in binary mode. I
    >> can't easily get rid of it all, so I'm wondering which of the
    >> following assumptions are, well, least unportable.

    >
    > I can (dimly) recall some OpenVMS file formats that may have
    > violated some of your assumptions. Not too surprising: OpenVMS
    > had seven basic file formats, with variations -- and that was
    > just for the sequential file organization, never mind the others
    > that departed even further from C's I/O model. Text files would
    > almost always be sequential, though, so the other organizations
    > can probably be ignored.


    Sounds interesting, I'll see if I can dig out some more info about that.

    > Whether this affects the portability of your program depends
    > on the likelihood that you'll need to get it running on VMS. If
    > that likelihood is zero, then ...


    Low, but it's not unlikely that the program will meet _some_ esoteric
    system. And what one system can do, others can do as well.

    I think I'll downgrade my expectations a bit and instead ask:

    Am I likely to encounter a system where acessing a text file in binary
    mode will give me less headaches than ftell() arithemtic on a line in
    a text-mode FILE*? I'm not about to support things like <length,
    contents, padding> anyway. Sounds like the binary formats that will
    break my "text mode assumptions" will break just as badly in binary
    mode, which is a relief in a way:)

    In any case, I guess a user option which makes the program read the
    file as a text file and save it to a tmpfile() would be a good idea.
    Then it'll be the user's worry instead of mine...

    > ISTR that on at least some VMS file formats, fseek() could
    > only position to the start of a line ("record") and hence ftell()
    > would return the same value all through a single line. This was
    > back in the pre-Standard days, though, and since this behavior
    > doesn't meet the requirements of the Standard (or so I believe),


    Correct. fgetc() "advances the associated file position indicator" in
    both C89 and C99.

    > it may have been fixed sometime in the many intervening years.
    > (Of course, the fix may simply have been a documentation change:
    > "Don't use XYZ format with C programs.")


    > (...)
    >> - files end at EOF or with ^Z (yuck). Or maybe that should be
    >> "a byte < 32 for which isspace()==0". I can assume ASCII or
    >> a superset, otherwise the file must be preprocessed anyway.

    >
    > You might want to make that "an unsigned byte < 32."


    Good point. But I think I'm currently hoping to drop binary mode and
    stay with ftell() in text mode.

    --
    Hallvard
     
    Hallvard B Furuseth, Nov 20, 2006
    #4
  5. Random832 wrote:
    >Hallvard B Furuseth wrote:
    >> In particular, do anyone know if there are real-life systems
    >> where the text file assumptions below don't hold?
    >>
    >> For text mode FILE*s,
    >>
    >> * input lines will be ordered by ftell() position,
    >>
    >> and one can do arithmetic on ftell() positions within one line.

    >
    > one can _do_ arithmetic, perhaps... one isn't guaranteed to get
    > meaningful results, particularly with multibyte streams.


    As far as I know, streams are not multibyte unless I make them so.
    C99 7.19.2p4 says: "Once a wide character input/output function has
    been applied to a stream without orientation, the stream becomes a
    wide-oriented stream."

    Though it's a point, such a program can't be extended to handle
    wide-oriented streams.

    >> - at the end of a line, getc() increments the position with a
    >> small positive number. (Or moderately small, if the file
    >> consists of fixed-size space-padded line records.)

    >
    > If the file is record-oriented, it could plausibly instead bump it to
    > the next multiple of an arbitrarily large power of two [say, record
    > number and offset are separate fields]


    True. I don't know of an example though?

    >> Or for binary mode FILE*s,
    >> (...)

    > Don't forget the extra zero-padding permitted at the end of binary
    > files (for systems where native file size is stored in units > 1 byte)


    Good point.

    --
    Hallvard
     
    Hallvard B Furuseth, Nov 20, 2006
    #5
  6. Hallvard B Furuseth

    Eric Sosman Guest

    Hallvard B Furuseth wrote On 11/20/06 13:18,:
    >
    > Am I likely to encounter a system where acessing a text file in binary
    > mode will give me less headaches than ftell() arithemtic on a line in
    > a text-mode FILE*?


    My (unscientific) feeling is that text files should be
    read in text mode, to take advantage of whatever format
    translation the system may need. But much depends on how
    the program (ab)uses the ftell() arithmetic.

    Can you offer some examples of the kinds of ftell()
    arithmetic the program engages in? Are the jumps "short"
    (intra-line) or "long" (inter-line)? Frequent or occasional?

    --
     
    Eric Sosman, Nov 20, 2006
    #6
  7. Eric Sosman writes:
    > Hallvard B Furuseth wrote On 11/20/06 13:18,:
    >> Am I likely to encounter a system where acessing a text file in binary
    >> mode will give me less headaches than ftell() arithemtic on a line in
    >> a text-mode FILE*?

    >
    > My (unscientific) feeling is that text files should be
    > read in text mode, to take advantage of whatever format
    > translation the system may need. But much depends on how
    > the program (ab)uses the ftell() arithmetic.
    >
    > Can you offer some examples of the kinds of ftell()
    > arithmetic the program engages in? Are the jumps "short"
    > (intra-line) or "long" (inter-line)? Frequent or occasional?


    Frankly I'm not entirely sure yet, but I think it can be reduced to
    something like:
    Walk through the file and save info about each character, with
    index (ftell() position of line + character's index in line).
    Next,
    for (i = 0; i < {max ftell() position}; i++)
    if (there is a character #i)
    handle(getc());
    I suppose that for loop can be changed to read line by line, but
    that change looks a bit messy.

    There are some ugly cases like fseek(arbitrary position) as well,
    but I think they can be eliminated without too much fuss.

    --
    Hallvard
     
    Hallvard B Furuseth, Nov 21, 2006
    #7
  8. Hallvard B Furuseth

    Eric Sosman Guest

    Hallvard B Furuseth wrote On 11/21/06 05:52,:
    > Eric Sosman writes:
    >
    >>Hallvard B Furuseth wrote On 11/20/06 13:18,:
    >>
    >>>Am I likely to encounter a system where acessing a text file in binary
    >>>mode will give me less headaches than ftell() arithemtic on a line in
    >>>a text-mode FILE*?

    >>
    >> My (unscientific) feeling is that text files should be
    >>read in text mode, to take advantage of whatever format
    >>translation the system may need. But much depends on how
    >>the program (ab)uses the ftell() arithmetic.
    >>
    >> Can you offer some examples of the kinds of ftell()
    >>arithmetic the program engages in? Are the jumps "short"
    >>(intra-line) or "long" (inter-line)? Frequent or occasional?

    >
    >
    > Frankly I'm not entirely sure yet, but I think it can be reduced to
    > something like:
    > Walk through the file and save info about each character, with
    > index (ftell() position of line + character's index in line).
    > Next,
    > for (i = 0; i < {max ftell() position}; i++)
    > if (there is a character #i)
    > handle(getc());
    > I suppose that for loop can be changed to read line by line, but
    > that change looks a bit messy.


    The "walk through," I guess, is probably line by line?
    (If it were character by character you could forget about
    saving the intra-line index and just save each character's
    ftell() position, then fseek() back to it. That would make
    everything legitimate except the "max ftell() position"
    calculation, which isn't guaranteed to make sense but very
    likely will.)

    But it looks like the arithmetic on ftell() values is
    strictly within a line, right? That is, the loop looks
    more like

    for (i = 0; i < max; i++) {
    if (something_about_position(i)) {
    fseek(stream, ftellpos + offset,
    SEEK_SET);
    ch = getc(stream);
    ...
    }
    }

    If that's it, you may be out of the woods. Most crudely:

    for (i = 0; i < max; i++) {
    if (something_about_position(i)) {
    fseek(stream, ftellpos, SEEK_SET);
    for (j = 0; j < offset; j++)
    (void)getc(stream);
    ch = getc(stream);
    ...
    }
    }

    A slightly fancier version would remember what line it
    was in and what the previous offset was, to avoid seeking
    over and over again to the start of the same line and
    getc()'ing past longer and longer prefixes.

    > There are some ugly cases like fseek(arbitrary position) as well,
    > but I think they can be eliminated without too much fuss.


    Good luck!

    --
     
    Eric Sosman, Nov 21, 2006
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Leslaw Bieniasz

    How to speed up ftell()/fseek()

    Leslaw Bieniasz, Jun 6, 2005, in forum: C++
    Replies:
    7
    Views:
    4,924
    Lionel B
    Jun 8, 2005
  2. GiM

    ftell()

    GiM, Dec 15, 2003, in forum: C Programming
    Replies:
    3
    Views:
    523
    those who know me have no need of my name
    Jan 20, 2004
  3. joshc
    Replies:
    5
    Views:
    583
    Keith Thompson
    Mar 31, 2005
  4. Kenneth Brody

    Text mode fseek/ftell

    Kenneth Brody, Mar 31, 2006, in forum: C Programming
    Replies:
    10
    Views:
    1,118
    Ben Bacarisse
    Apr 2, 2006
  5. utab
    Replies:
    3
    Views:
    886
Loading...

Share This Page