The tellg bug

  • Thread starter Eivind Grimsby Haarr
  • Start date
E

Eivind Grimsby Haarr

I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.

In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.

Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.

My questions are:
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)

I know I'm not the first one that has encountered this problem, so I would
expect that somewhere someone has solved this before...

Finally, another question: Do anyone know about a good online
tutorial/reference for Windows programming with C++? Or can
someone alternatively tell me which newsgroup I rather should have posted
that question to...


- Eivind Grimsby Haarr

"Trying is the first step towards failure."
- Homer Simpson
 
M

Mike Wahler

Eivind Grimsby Haarr said:
I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.

In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.

Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.

My questions are:
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)

I know I'm not the first one that has encountered this problem, so I would
expect that somewhere someone has solved this before...

Since I have little experience with 'tellg()', I'll let
someone else address that issue.
Finally, another question: Do anyone know about a good online
tutorial/reference for Windows programming with C++?

I like the tutorials at www.relisoft.com
YMMV. In any case, I'd recommend going through the Petzold book
(5th edition) first (which uses C) for learning the fundamentals.
Or can
someone alternatively tell me which newsgroup I rather should have posted
that question to...

Good advice r.e. Windows programming is available at newsgroup
comp.os.ms-windows.programmer.win32

-Mike
 
J

John Harrison

Eivind Grimsby Haarr said:
I know that this has been posted before on several other newsgroups, but I
need to make sure I got this right, so I hope you can forgive me for
posting this.

In MVSC6.0, and also in several Borland c++ compilers from what I can see
from newsgroup postings, ifstream::tellg() alters the position of the file
reading pointer when reading UNIX files (only LF character, not CRLF) in
text mode. I can see why it does this, keeping consistency while treating
CRLF as a single character.

Using subsequent getline(...)-calls, no problems arises, but once I need
to save a position with tellg, to be able to seek back to this position
with seekg later, problems arises if the file accidentially has been
converted to UNIX LF-format. I know I can solve this by opening the file
in binary mode, but then I have to write my own code handling the
reading of lines and different newline characters.

My questions are:
* Is this compiler-dependent, or a general problem with text-mode file
reading? Does the standard specify anything about this?

The standard specfies that if you open a file in text mode then only four
versions of seekg are going to work.

1) Seek to the start of a file
2) Seek to the end of a file
3) Seek to the current position
4) Seek to a position previously saved with tellg.

This last one seems to be the one you are interested in. Although I don't
get the bit about 'accidentally converted to UNIX LF-format'. If you're
writing the program you should be able to stop anything being accidentally
converted.

One some systems with some compilers you may get other possibilites to work,
but these are the only ones guaranteed by the standard.
* Is it impossible to write a program using only standard library
functions, that handles tellg/seekg positioning with both UNIX/DOS files
in text mode? (Not to mention Mac-files...)

It's prefectly possible provided you stick to the four possibilites above.

john
 
E

Eivind Grimsby Haarr

I can see I did not explain the problem thoroughly enough in the previous
posting.

The problem arises when reading a UNIX text file, where line feeds are
represented by the line feed character (one byte, '\n' or LF) only. In
DOS text files, the line feeds are represented by two characters ("\r\n",
carriage return and line feed).

An example:

If I have a file in UNIX text format, whith line feed represented by a
single character, e.g:

Line 1 in file\n
Line 2 in file\n
Line 3 in file

Using this code:

--------------

std::ifstream fstrm("filename.txt");
std::ios::pos_type tellg_result(0);
std::string str("");

// Save position in file before reading the line
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;
// Save position again
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;

--------------

This code would output:
Line 1 in file
ine 2 in file

Without the calls to tellg(), the ouput would be correct, similar to
the file. Since the stream expects line feed to consist of two characters,
tellg() actually moves the internal file pointer one byte when
encountering the UNIX type single line feed character.

Usually, somewhere internally in the stream classes, the two-character
line-feed in DOS files is converted to the single line feed character '\n'
when writing and reading. I guess this is done for portability, and it
also suggests that it should be possible to enable/disable this feature.

I'm reading a big set of text files that is shared on the net among many
users, and it often occurs that the files are converted to and from UNIX
and DOS formats, some files ending up in UNIX format on my Windows system.
It seems very bothersome to have to write my own binary mode
read-functions, especially since I want my classes to be general-purpose,
accepting only an istream-reference, leaving to the client to open the
file. Without knowing if the istream is an ifstream or something else, it
is impossible to test whether it is opened in binary mode or text mode.
(Or is it?)

I hope this made more sense, and I appreciate feedback of any type.


-eivind
 
J

John Harrison

Eivind Grimsby Haarr said:
I can see I did not explain the problem thoroughly enough in the previous
posting.

The problem arises when reading a UNIX text file, where line feeds are
represented by the line feed character (one byte, '\n' or LF) only. In
DOS text files, the line feeds are represented by two characters ("\r\n",
carriage return and line feed).

An example:

If I have a file in UNIX text format, whith line feed represented by a
single character, e.g:

Line 1 in file\n
Line 2 in file\n
Line 3 in file

Using this code:

--------------

std::ifstream fstrm("filename.txt");
std::ios::pos_type tellg_result(0);
std::string str("");

// Save position in file before reading the line
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;
// Save position again
tellg_result = fstrm.tellg();
getline(fstrm, str);
std::cout << str << std::endl;

--------------

This code would output:
Line 1 in file
ine 2 in file

Without the calls to tellg(), the ouput would be correct, similar to
the file. Since the stream expects line feed to consist of two characters,
tellg() actually moves the internal file pointer one byte when
encountering the UNIX type single line feed character.

My compiler does not do that. Its smart enough to treat this case correctly.
However you have a file without correct line endings, which you are trying
to read as if it did have correct line endings, so I think all bets are off
and you shouldn't be too surprised that things don't work. So I'm not sure
I'd call this a bug but I'd certainly call it a deficiency in your library.
Usually, somewhere internally in the stream classes, the two-character
line-feed in DOS files is converted to the single line feed character '\n'
when writing and reading. I guess this is done for portability, and it
also suggests that it should be possible to enable/disable this feature.

That's correct (assuming that you are working on a DOS system of course).
And of course you disable it by opening the file in binary mode.
I'm reading a big set of text files that is shared on the net among many
users, and it often occurs that the files are converted to and from UNIX
and DOS formats, some files ending up in UNIX format on my Windows system.
It seems very bothersome to have to write my own binary mode
read-functions, especially since I want my classes to be general-purpose,
accepting only an istream-reference, leaving to the client to open the
file. Without knowing if the istream is an ifstream or something else, it
is impossible to test whether it is opened in binary mode or text mode.
(Or is it?)

It is impossible in standard C++.

I think you are going to have to write you own version of a getline routine.
One that can cope with different line ending styles and/or files open in
binary or text mode. It also wouldn't hurt to document to your clients that
they should open files in binary mode. You might also need to use a
different compiler and/or C++ library, I don't like the way yours is
behaving.

john
 
J

Jack Klein

Whoa, there. He's trying to deal with two kinds of "text" files:

1) those that end each line with CR/LF (standard DOS format)

2) those that end each line with LF (standard Unix format)

If he reads all files in binary mode, each will have an LF at the
end, which is the standard internal line terminator in C/C++
('\n'). Existing getline, etc. will work fine. The only issues I
see are:

1) Do any CRs at the end of lines matter, or can they just be carried
along? Worst case is you delete all CRs and hope that no text plays
overstrike games with embedded CRs.

2) Do you want to produce canonical (CR/LF terminated) output from
such arbitrary input? In that case CRs *do* matter and you have to
be sure to write new files in text mode.

No big deal.

I've had to deal with this quite a bit in communications routines in
the old days.

The simplest solution I found was to consider every '\r' as a newline.
Any '\n' immediately proceeded by a '\r' is ignored, any '\n'
proceeded by any other character is considered a newline.

Works quite well for '\r\n' (was CP/M in those days, MS-DOS wasn't
around yet), '\r' only (Apple and some others, the others mostly
defunct now), and Unix '\n' only. Even handled files produced by a
few perverse utilities on '\r\n' that would skip the '\r' on repeated
blank lines. That is:

line1
line2

line3

....would appear as:

"line1\r\nline2\r\n\nline3\n"

This would not correctly handle something that used '\n\r' to end
lines, but I knew of no such systems and never heard from any users
that ran into one.

In any case, this logic is quite simple to perform on files opened in
binary mode.
 
O

Owen Jacobson

The simplest solution I found was to consider every '\r' as a newline. Any
'\n' immediately proceeded by a '\r' is ignored, any '\n' proceeded by any
other character is considered a newline.

Works quite well for '\r\n' (was CP/M in those days, MS-DOS wasn't around
yet), '\r' only (Apple and some others, the others mostly defunct now),
and Unix '\n' only. Even handled files produced by a few perverse
utilities on '\r\n' that would skip the '\r' on repeated blank lines.
That is:

line1
line2

line3

...would appear as:

"line1\r\nline2\r\n\nline3\n"

That's only perverse if you're not familiar with the origins of "carriage
return" versus "line feed". (It is perverse in the modern sense of "line
break" as a separator between lines, but that's newer than ASCII.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top