peek() and tellg()

W

wizofaus

Is the any reason according to the standard that calling tellg() on an
std::ifstream after a call to peek() could place the filebuf in an
inconsistent state?
I think it's a bug in the VC7 dinkumware implementation (and I've
reported to them as such), but the following code

std::eek:fstream ofs("test.txt");
ofs << "0123456789";
ofs.close();
std::wifstream ifs("test.txt");
std::wcout << wchar_t(ifs.peek());
ifs.tellg();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << std::endl;

Prints out 00246, when I would expect 00123. Remove the tellg() (or
move it to after a get) and it prints exaclty that.


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 
P

P.J. Plauger

Is the any reason according to the standard that calling tellg() on an
std::ifstream after a call to peek() could place the filebuf in an
inconsistent state?
I think it's a bug in the VC7 dinkumware implementation (and I've
reported to them as such), but the following code

std::eek:fstream ofs("test.txt");
ofs << "0123456789";
ofs.close();
std::wifstream ifs("test.txt");
std::wcout << wchar_t(ifs.peek());
ifs.tellg();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << std::endl;

Prints out 00246, when I would expect 00123. Remove the tellg() (or
move it to after a get) and it prints exaclty that.

Actually, I would expect 01234, and that's what our latest
library gives, both in our shipped product and VC++ V8
(Whidbey) which is soon to be formally released. V7.0
and earlier "fail" in a different way than V7.0.

I put "fail" in quotes because the above code is asking
for trouble. First, it writes a text line with no
terminating newline. That's not a problem here, but it
generally causes trouble. More important, it mixes two
different ways of accessing a stream:

-- as a one-pass input stream with limited pushback

-- as a random-access sequence with bookmarks

It has been known for decades that trying to access
the same stream both ways is fraught with peril.
Whether you call the resulting surprising behavior
buggy or regrettable is a matter of taste.

The biggest stress point in the code above is the
initial peek followed by a tell. It's hard enough
pushing back a character and still generating a
proper seek offset; if you push back a character
at the beginning of a file it's way harder to get
"right". The C I/O model, which underlies C++,
permits the implementation to discard any pushed
back characters when determining a seek offset.
That's why we read the "0" only once. It may still
not be what you want, but I believe that it's
defensible.

FWIW, you'll find this code terribly nonportable.
Other Standard C++ library implementations go off
in all sorts of interesting directions in this
area. If you want robust code, don't mix peek
and seek/tell.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 
H

Howard Hinnant

Is the any reason according to the standard that calling tellg() on an
std::ifstream after a call to peek() could place the filebuf in an
inconsistent state?
I think it's a bug in the VC7 dinkumware implementation (and I've
reported to them as such), but the following code

std::eek:fstream ofs("test.txt");
ofs << "0123456789";
ofs.close();
std::wifstream ifs("test.txt");
std::wcout << wchar_t(ifs.peek());
ifs.tellg();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << std::endl;

Prints out 00246, when I would expect 00123. Remove the tellg() (or
move it to after a get) and it prints exaclty that.

Fwiw, the CodeWarrior/Freescale implementation outputs:

0101234567

The reason it looks so strange is because you create the file with
narrow char, and then read it back with wide wchar_t's. The default
wide character "encoding" for this product is to I/O all bytes of the
wchar_t in native byte order. If I change your example to read the file
back in as narrow characters:

std::eek:fstream ofs("test.txt");
ofs << "0123456789";
ofs.close();
std::ifstream ifs("test.txt");
std::cout << char(ifs.peek());
ifs.tellg();
std::cout << char(ifs.peek()); ifs.get();
std::cout << char(ifs.peek()); ifs.get();
std::cout << char(ifs.peek()); ifs.get();
std::cout << char(ifs.peek()); ifs.get();
std::cout << std::endl;

the output is then:

00123

-Howard
 
W

wizofaus

Ok, but what about if you create the file using wofstream? I know the
problem doesn't happen with narrow streams, and I'm not terribly
concerned about narrow streams. In actual fact I'm using my own UTF8
codecvt, but that seems not to be the problem.
 
H

Howard Hinnant

Ok, but what about if you create the file using wofstream? I know the
problem doesn't happen with narrow streams, and I'm not terribly
concerned about narrow streams. In actual fact I'm using my own UTF8
codecvt, but that seems not to be the problem.

This:

#include <iostream>
#include <fstream>

int main()
{
std::wofstream ofs("test.txt");
ofs << L"0123456789";
ofs.close();
std::wifstream ifs("test.txt");
std::wcout << char(ifs.peek());
ifs.tellg();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << std::endl;
}

Outputs:

00123

for me.

-Howard
 
W

wizofaus

Howard said:
This:

std::wofstream ofs("test.txt");
ofs << L"0123456789";
ofs.close();
std::wcout << std::endl;

Outputs:

00123

for me.
Ok, well that's exactly what I would expect.
Quite surprised that CodeWarrior defaults to storing the file as wide
characters though - is it big or little endian, 16-bit or 32-bit
depending on the platform? Does it have any simple ways of controlling
this?
I assume then there's no requirements in the standard regarding the
default behaviour when converting wchar_t's to char's for file output.
I pretty much always use UTF8 these days: at least until Chinese
becomes the lingua france of cyberspace it seems about the best choice!
 
H

Howard Hinnant

Quite surprised that CodeWarrior defaults to storing the file as wide
characters though - is it big or little endian, 16-bit or 32-bit
depending on the platform?

Yes. That is, it is whatever wchar_t/platform is (an image on disk of
what was in memory).
Does it have any simple ways of controlling
this?
I assume then there's no requirements in the standard regarding the
default behaviour when converting wchar_t's to char's for file output.
I pretty much always use UTF8 these days: at least until Chinese
becomes the lingua france of cyberspace it seems about the best choice!

You can of course write your own codecvt and install/imbue it into your
streams. It also comes with several prewritten codecvt's, including one
for UTF8.

std::__ucs_2
std::__jis
std::__shift_jis
std::__euc
std::__utf_8

All of these prewritten codecvt's are templated on the internal
character type and will work with either 16 or 32 bit internal character
types - which need not be a wchar_t.

These can simply be picked up and installed the same way you would
install your own codecvt. There is also a locale data file format that
can be used to control which prewritten codecvt gets installed into a
locale.

You are correct in your assumption about the requirements on the default
encoding scheme. It is supposed to be whatever is appropriate for the
vendor's customers on a given platform.

For better or worse I decided years ago that storing the whole wchar_t
on disk, unencoded, was the most obvious thing to do, and thus a good
default. At the time I was aware of the popular "drop the high byte(s)"
encoding but the resultant loss of information made me nervous. I also
worked with the standards committee to try to ensure that a "don't
encode" default behavior was conforming.

http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#305

Fyi, CodeWarrior for Windows is no longer for sale, at least not from
Metrowerks/Freescale.

-Howard
 
K

kanze

Is the any reason according to the standard that calling
tellg() on an std::ifstream after a call to peek() could place
the filebuf in an inconsistent state?
I think it's a bug in the VC7 dinkumware implementation (and
I've reported to them as such), but the following code
std::eek:fstream ofs("test.txt");
ofs << "0123456789";
ofs.close();
std::wifstream ifs("test.txt");

Careful. You're reading a file written with narrow characters
as if it contained wide characters. Any results will depend on
the locale; whether they're useful or sensible is almost pure
luck.

On most modern machines, narrow characters use an encoding in
which all of the characters in the basic character set are
ASCII. If this is the case, imbuing a UTF-8 locale should allow
reading them correctly. Still, IMHO, if you want the file to
contain UTF-8, you should write it with an wofstream imbued with
a UTF-8 locale.

The "C" locale depends on a lot of things; I don't think the
standard actually says what it should be in this case. And of
course, most programs will have done a std::locale::globale(
std::locale( "" ) ) as the first thing in main; under Unix, at
least, this sets the locale to a value determined by environment
variables.

Off hand, from what little I know of Windows, I would expect the
default to use UTF-16LE, not UTF-8. In which case, you're likely
to get some very strange results: letters from strange
alphabets, or illegal characters.
std::wcout << wchar_t(ifs.peek());
ifs.tellg();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << wchar_t(ifs.peek()); ifs.get();
std::wcout << std::endl;
Prints out 00246, when I would expect 00123. Remove the
tellg() (or move it to after a get) and it prints exaclty
that.

I'm not sure what effect the tellg() has -- that part seems
strange. But for the rest, I'd say that you're playing with
undefined, or poorly defined behavior. (Not necessarily
undefined behavior in the sense of the standard, but in the
sense that there really isn't any requirements as to what the
results should be.)

--
James Kanze GABI Software
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 
W

wizofaus

kanze said:
Careful. You're reading a file written with narrow characters
as if it contained wide characters. Any results will depend on
the locale; whether they're useful or sensible is almost pure
luck.

Yes, I should have provided the example using std::wofstream (and
L"0123456789"). Exactly the same problem occurs. (But not when using
narrow streams for both in and output).
Off hand, from what little I know of Windows, I would expect the
default to use UTF-16LE, not UTF-8. In which case, you're likely
to get some very strange results: letters from strange
alphabets, or illegal characters.

The Dinkumware implementation, and indeed I think all the others I've
used default to simply converting wchar_t's to char's (thus any values
over 255 cannot be stored).


[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 
P

P.J. Plauger

The Dinkumware implementation, and indeed I think all the others I've
used default to simply converting wchar_t's to char's (thus any values
over 255 cannot be stored).

Yes, we use the same conversions as the Standard C library by default.
But we also supply just about every conversion you can imagine for
C++ with our codecvt library in our CoreX product.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com



[ See http://www.gotw.ca/resources/clcm.htm for info about ]
[ comp.lang.c++.moderated. First time posters: Do this! ]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top