The tellg bug

Discussion in 'C++' started by Eivind Grimsby Haarr, Sep 2, 2004.

  1. I know that this has been posted before on several other newsgroups, but I
    need to make sure I got this right, so I hope you can forgive me for
    posting this.

    In MVSC6.0, and also in several Borland c++ compilers from what I can see
    from newsgroup postings, ifstream::tellg() alters the position of the file
    reading pointer when reading UNIX files (only LF character, not CRLF) in
    text mode. I can see why it does this, keeping consistency while treating
    CRLF as a single character.

    Using subsequent getline(...)-calls, no problems arises, but once I need
    to save a position with tellg, to be able to seek back to this position
    with seekg later, problems arises if the file accidentially has been
    converted to UNIX LF-format. I know I can solve this by opening the file
    in binary mode, but then I have to write my own code handling the
    reading of lines and different newline characters.

    My questions are:
    * Is this compiler-dependent, or a general problem with text-mode file
    reading? Does the standard specify anything about this?
    * Is it impossible to write a program using only standard library
    functions, that handles tellg/seekg positioning with both UNIX/DOS files
    in text mode? (Not to mention Mac-files...)

    I know I'm not the first one that has encountered this problem, so I would
    expect that somewhere someone has solved this before...

    Finally, another question: Do anyone know about a good online
    tutorial/reference for Windows programming with C++? Or can
    someone alternatively tell me which newsgroup I rather should have posted
    that question to...


    - Eivind Grimsby Haarr

    "Trying is the first step towards failure."
    - Homer Simpson
    Eivind Grimsby Haarr, Sep 2, 2004
    #1
    1. Advertising

  2. Eivind Grimsby Haarr

    Mike Wahler Guest

    "Eivind Grimsby Haarr" <> wrote in message
    news:p...
    >
    > I know that this has been posted before on several other newsgroups, but I
    > need to make sure I got this right, so I hope you can forgive me for
    > posting this.
    >
    > In MVSC6.0, and also in several Borland c++ compilers from what I can see
    > from newsgroup postings, ifstream::tellg() alters the position of the file
    > reading pointer when reading UNIX files (only LF character, not CRLF) in
    > text mode. I can see why it does this, keeping consistency while treating
    > CRLF as a single character.
    >
    > Using subsequent getline(...)-calls, no problems arises, but once I need
    > to save a position with tellg, to be able to seek back to this position
    > with seekg later, problems arises if the file accidentially has been
    > converted to UNIX LF-format. I know I can solve this by opening the file
    > in binary mode, but then I have to write my own code handling the
    > reading of lines and different newline characters.
    >
    > My questions are:
    > * Is this compiler-dependent, or a general problem with text-mode file
    > reading? Does the standard specify anything about this?
    > * Is it impossible to write a program using only standard library
    > functions, that handles tellg/seekg positioning with both UNIX/DOS files
    > in text mode? (Not to mention Mac-files...)
    >
    > I know I'm not the first one that has encountered this problem, so I would
    > expect that somewhere someone has solved this before...


    Since I have little experience with 'tellg()', I'll let
    someone else address that issue.

    > Finally, another question: Do anyone know about a good online
    > tutorial/reference for Windows programming with C++?


    I like the tutorials at www.relisoft.com
    YMMV. In any case, I'd recommend going through the Petzold book
    (5th edition) first (which uses C) for learning the fundamentals.

    > Or can
    > someone alternatively tell me which newsgroup I rather should have posted
    > that question to...


    Good advice r.e. Windows programming is available at newsgroup
    comp.os.ms-windows.programmer.win32

    -Mike
    Mike Wahler, Sep 2, 2004
    #2
    1. Advertising

  3. "Eivind Grimsby Haarr" <> wrote in message
    news:p...
    >
    > I know that this has been posted before on several other newsgroups, but I
    > need to make sure I got this right, so I hope you can forgive me for
    > posting this.
    >
    > In MVSC6.0, and also in several Borland c++ compilers from what I can see
    > from newsgroup postings, ifstream::tellg() alters the position of the file
    > reading pointer when reading UNIX files (only LF character, not CRLF) in
    > text mode. I can see why it does this, keeping consistency while treating
    > CRLF as a single character.
    >
    > Using subsequent getline(...)-calls, no problems arises, but once I need
    > to save a position with tellg, to be able to seek back to this position
    > with seekg later, problems arises if the file accidentially has been
    > converted to UNIX LF-format. I know I can solve this by opening the file
    > in binary mode, but then I have to write my own code handling the
    > reading of lines and different newline characters.
    >
    > My questions are:
    > * Is this compiler-dependent, or a general problem with text-mode file
    > reading? Does the standard specify anything about this?


    The standard specfies that if you open a file in text mode then only four
    versions of seekg are going to work.

    1) Seek to the start of a file
    2) Seek to the end of a file
    3) Seek to the current position
    4) Seek to a position previously saved with tellg.

    This last one seems to be the one you are interested in. Although I don't
    get the bit about 'accidentally converted to UNIX LF-format'. If you're
    writing the program you should be able to stop anything being accidentally
    converted.

    One some systems with some compilers you may get other possibilites to work,
    but these are the only ones guaranteed by the standard.

    > * Is it impossible to write a program using only standard library
    > functions, that handles tellg/seekg positioning with both UNIX/DOS files
    > in text mode? (Not to mention Mac-files...)


    It's prefectly possible provided you stick to the four possibilites above.

    john
    John Harrison, Sep 2, 2004
    #3
  4. I can see I did not explain the problem thoroughly enough in the previous
    posting.

    The problem arises when reading a UNIX text file, where line feeds are
    represented by the line feed character (one byte, '\n' or LF) only. In
    DOS text files, the line feeds are represented by two characters ("\r\n",
    carriage return and line feed).

    An example:

    If I have a file in UNIX text format, whith line feed represented by a
    single character, e.g:

    Line 1 in file\n
    Line 2 in file\n
    Line 3 in file

    Using this code:

    --------------

    std::ifstream fstrm("filename.txt");
    std::ios::pos_type tellg_result(0);
    std::string str("");

    // Save position in file before reading the line
    tellg_result = fstrm.tellg();
    getline(fstrm, str);
    std::cout << str << std::endl;
    // Save position again
    tellg_result = fstrm.tellg();
    getline(fstrm, str);
    std::cout << str << std::endl;

    --------------

    This code would output:
    Line 1 in file
    ine 2 in file

    Without the calls to tellg(), the ouput would be correct, similar to
    the file. Since the stream expects line feed to consist of two characters,
    tellg() actually moves the internal file pointer one byte when
    encountering the UNIX type single line feed character.

    Usually, somewhere internally in the stream classes, the two-character
    line-feed in DOS files is converted to the single line feed character '\n'
    when writing and reading. I guess this is done for portability, and it
    also suggests that it should be possible to enable/disable this feature.

    I'm reading a big set of text files that is shared on the net among many
    users, and it often occurs that the files are converted to and from UNIX
    and DOS formats, some files ending up in UNIX format on my Windows system.
    It seems very bothersome to have to write my own binary mode
    read-functions, especially since I want my classes to be general-purpose,
    accepting only an istream-reference, leaving to the client to open the
    file. Without knowing if the istream is an ifstream or something else, it
    is impossible to test whether it is opened in binary mode or text mode.
    (Or is it?)

    I hope this made more sense, and I appreciate feedback of any type.


    -eivind

    On Thu, 2 Sep 2004, John Harrison wrote:

    >
    > "Eivind Grimsby Haarr" <> wrote in message
    > news:p...
    > >
    > > I know that this has been posted before on several other newsgroups, but I
    > > need to make sure I got this right, so I hope you can forgive me for
    > > posting this.
    > >
    > > In MVSC6.0, and also in several Borland c++ compilers from what I can see
    > > from newsgroup postings, ifstream::tellg() alters the position of the file
    > > reading pointer when reading UNIX files (only LF character, not CRLF) in
    > > text mode. I can see why it does this, keeping consistency while treating
    > > CRLF as a single character.
    > >
    > > Using subsequent getline(...)-calls, no problems arises, but once I need
    > > to save a position with tellg, to be able to seek back to this position
    > > with seekg later, problems arises if the file accidentially has been
    > > converted to UNIX LF-format. I know I can solve this by opening the file
    > > in binary mode, but then I have to write my own code handling the
    > > reading of lines and different newline characters.
    > >
    > > My questions are:
    > > * Is this compiler-dependent, or a general problem with text-mode file
    > > reading? Does the standard specify anything about this?

    >
    > The standard specfies that if you open a file in text mode then only four
    > versions of seekg are going to work.
    >
    > 1) Seek to the start of a file
    > 2) Seek to the end of a file
    > 3) Seek to the current position
    > 4) Seek to a position previously saved with tellg.
    >
    > This last one seems to be the one you are interested in. Although I don't
    > get the bit about 'accidentally converted to UNIX LF-format'. If you're
    > writing the program you should be able to stop anything being accidentally
    > converted.
    >
    > One some systems with some compilers you may get other possibilites to work,
    > but these are the only ones guaranteed by the standard.
    >
    > > * Is it impossible to write a program using only standard library
    > > functions, that handles tellg/seekg positioning with both UNIX/DOS files
    > > in text mode? (Not to mention Mac-files...)

    >
    > It's prefectly possible provided you stick to the four possibilites above.
    >
    > john
    >
    >
    >
    Eivind Grimsby Haarr, Sep 2, 2004
    #4
  5. "Eivind Grimsby Haarr" <> wrote in message
    news:p...
    >
    > I can see I did not explain the problem thoroughly enough in the previous
    > posting.
    >
    > The problem arises when reading a UNIX text file, where line feeds are
    > represented by the line feed character (one byte, '\n' or LF) only. In
    > DOS text files, the line feeds are represented by two characters ("\r\n",
    > carriage return and line feed).
    >
    > An example:
    >
    > If I have a file in UNIX text format, whith line feed represented by a
    > single character, e.g:
    >
    > Line 1 in file\n
    > Line 2 in file\n
    > Line 3 in file
    >
    > Using this code:
    >
    > --------------
    >
    > std::ifstream fstrm("filename.txt");
    > std::ios::pos_type tellg_result(0);
    > std::string str("");
    >
    > // Save position in file before reading the line
    > tellg_result = fstrm.tellg();
    > getline(fstrm, str);
    > std::cout << str << std::endl;
    > // Save position again
    > tellg_result = fstrm.tellg();
    > getline(fstrm, str);
    > std::cout << str << std::endl;
    >
    > --------------
    >
    > This code would output:
    > Line 1 in file
    > ine 2 in file
    >
    > Without the calls to tellg(), the ouput would be correct, similar to
    > the file. Since the stream expects line feed to consist of two characters,
    > tellg() actually moves the internal file pointer one byte when
    > encountering the UNIX type single line feed character.


    My compiler does not do that. Its smart enough to treat this case correctly.
    However you have a file without correct line endings, which you are trying
    to read as if it did have correct line endings, so I think all bets are off
    and you shouldn't be too surprised that things don't work. So I'm not sure
    I'd call this a bug but I'd certainly call it a deficiency in your library.

    >
    > Usually, somewhere internally in the stream classes, the two-character
    > line-feed in DOS files is converted to the single line feed character '\n'
    > when writing and reading. I guess this is done for portability, and it
    > also suggests that it should be possible to enable/disable this feature.
    >


    That's correct (assuming that you are working on a DOS system of course).
    And of course you disable it by opening the file in binary mode.

    > I'm reading a big set of text files that is shared on the net among many
    > users, and it often occurs that the files are converted to and from UNIX
    > and DOS formats, some files ending up in UNIX format on my Windows system.
    > It seems very bothersome to have to write my own binary mode
    > read-functions, especially since I want my classes to be general-purpose,
    > accepting only an istream-reference, leaving to the client to open the
    > file. Without knowing if the istream is an ifstream or something else, it
    > is impossible to test whether it is opened in binary mode or text mode.
    > (Or is it?)


    It is impossible in standard C++.

    I think you are going to have to write you own version of a getline routine.
    One that can cope with different line ending styles and/or files open in
    binary or text mode. It also wouldn't hurt to document to your clients that
    they should open files in binary mode. You might also need to use a
    different compiler and/or C++ library, I don't like the way yours is
    behaving.

    john
    John Harrison, Sep 3, 2004
    #5
  6. Eivind Grimsby Haarr

    Jack Klein Guest

    On Fri, 3 Sep 2004 06:47:13 -0400, "P.J. Plauger" <>
    wrote in comp.lang.c++:

    > "John Harrison" <> wrote in message
    > news:...
    >
    > > > I'm reading a big set of text files that is shared on the net among many
    > > > users, and it often occurs that the files are converted to and from UNIX
    > > > and DOS formats, some files ending up in UNIX format on my Windows

    > system.
    > > > It seems very bothersome to have to write my own binary mode
    > > > read-functions, especially since I want my classes to be

    > general-purpose,
    > > > accepting only an istream-reference, leaving to the client to open the
    > > > file. Without knowing if the istream is an ifstream or something else,

    > it
    > > > is impossible to test whether it is opened in binary mode or text mode.
    > > > (Or is it?)

    > >
    > > It is impossible in standard C++.

    >
    > Nonsense.
    >
    > > I think you are going to have to write you own version of a getline

    > routine.
    > > One that can cope with different line ending styles and/or files open in
    > > binary or text mode. It also wouldn't hurt to document to your clients

    > that
    > > they should open files in binary mode. You might also need to use a
    > > different compiler and/or C++ library, I don't like the way yours is
    > > behaving.

    >
    > Whoa, there. He's trying to deal with two kinds of "text" files:
    >
    > 1) those that end each line with CR/LF (standard DOS format)
    >
    > 2) those that end each line with LF (standard Unix format)
    >
    > If he reads all files in binary mode, each will have an LF at the
    > end, which is the standard internal line terminator in C/C++
    > ('\n'). Existing getline, etc. will work fine. The only issues I
    > see are:
    >
    > 1) Do any CRs at the end of lines matter, or can they just be carried
    > along? Worst case is you delete all CRs and hope that no text plays
    > overstrike games with embedded CRs.
    >
    > 2) Do you want to produce canonical (CR/LF terminated) output from
    > such arbitrary input? In that case CRs *do* matter and you have to
    > be sure to write new files in text mode.
    >
    > No big deal.


    I've had to deal with this quite a bit in communications routines in
    the old days.

    The simplest solution I found was to consider every '\r' as a newline.
    Any '\n' immediately proceeded by a '\r' is ignored, any '\n'
    proceeded by any other character is considered a newline.

    Works quite well for '\r\n' (was CP/M in those days, MS-DOS wasn't
    around yet), '\r' only (Apple and some others, the others mostly
    defunct now), and Unix '\n' only. Even handled files produced by a
    few perverse utilities on '\r\n' that would skip the '\r' on repeated
    blank lines. That is:

    line1
    line2

    line3

    ....would appear as:

    "line1\r\nline2\r\n\nline3\n"

    This would not correctly handle something that used '\n\r' to end
    lines, but I knew of no such systems and never heard from any users
    that ran into one.

    In any case, this logic is quite simple to perform on files opened in
    binary mode.

    --
    Jack Klein
    Home: http://JK-Technology.Com
    FAQs for
    comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
    comp.lang.c++ http://www.parashift.com/c -faq-lite/
    alt.comp.lang.learn.c-c++
    http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
    Jack Klein, Sep 4, 2004
    #6
  7. On Fri, 03 Sep 2004 23:50:45 -0500, Jack Klein wrote:

    > The simplest solution I found was to consider every '\r' as a newline. Any
    > '\n' immediately proceeded by a '\r' is ignored, any '\n' proceeded by any
    > other character is considered a newline.
    >
    > Works quite well for '\r\n' (was CP/M in those days, MS-DOS wasn't around
    > yet), '\r' only (Apple and some others, the others mostly defunct now),
    > and Unix '\n' only. Even handled files produced by a few perverse
    > utilities on '\r\n' that would skip the '\r' on repeated blank lines.
    > That is:
    >
    > line1
    > line2
    >
    > line3
    >
    > ...would appear as:
    >
    > "line1\r\nline2\r\n\nline3\n"


    That's only perverse if you're not familiar with the origins of "carriage
    return" versus "line feed". (It is perverse in the modern sense of "line
    break" as a separator between lines, but that's newer than ASCII.)

    --
    Some say the Wired doesn't have political borders like the real world,
    but there are far too many nonsense-spouting anarchists or idiots who
    think that pranks are a revolution.
    Owen Jacobson, Sep 4, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Fred Ma
    Replies:
    2
    Views:
    586
    Fred Ma
    May 24, 2004
  2. John Harrison
    Replies:
    0
    Views:
    484
    John Harrison
    Oct 18, 2004
  3. Catalin Pitis
    Replies:
    0
    Views:
    529
    Catalin Pitis
    Oct 20, 2004
  4. Replies:
    2
    Views:
    915
    Moonlit
    Sep 6, 2005
  5. peek() and tellg()

    , Sep 28, 2005, in forum: C++
    Replies:
    9
    Views:
    581
    P.J. Plauger
    Oct 2, 2005
Loading...

Share This Page