Does wstream make sense?

D

Daniel

I don't understand it. It seems to be as pointless and outside the spirit of c/c++ io as would be a floatstream or intstream.

Daniel
 
Ö

Öö Tiib

I don't understand it.
It seems to be as pointless and outside the spirit of c/c++ io as would
be a floatstream or intstream.

What is wstream and what you mean by the spirit? Do you mean
boost::spirit::qi::wstream? I further assume that you mean just usage of
wchar_t for streaming. If I misunderstood then, sorry. Does "stream of text"
make any sense to you?

Pointless? Let me try to give two points:

First point: Something like wchar_t is "the character type" in languages
like Java or C#. C and C++ are often used for writing efficient native
libraries for the likes of Java or C# so wchar_t can be used for
interfacing with those languages.

Second point: On modern hardware an octet of bits tends to be too little
portion of data. Bytes are often emulated by extracting octets from
bigger words. wchar_t can be actually more efficient than char on
such platforms. Therefore usage of wchar_t can be platform-specific
optimization.

So ... what is wrong with those points?

"C/C++" seems to be actually rather unfortunate to say in the context. In
C++, wchar_t is a distinct fundamental type. In C, it is a typedef of
an integral type.
 
D

Daniel

What is wstream

and what you mean by the spirit?

The C language approach has always been to do I/O using library functions which view data as a stream of bytes.
I further assume that you mean just usage of
wchar_t for streaming. If I misunderstood then, sorry. Does "stream of text"
make any sense to you?
No more than stream of floats, stream of ints etc. A unicode character is an encoding of bytes, as are floats, ints, etc.
Pointless? Let me try to give two points:

First point: Something like wchar_t is "the character type" in languages
like Java or C#.

Java streams characters as UTF8 octets, which in my understanding is also emerging as best practice in C++.
C and C++ are often used for writing efficient native
libraries for the likes of Java or C# so wchar_t can be used for
interfacing with those languages.
Second point: On modern hardware an octet of bits tends to be too little
portion of data. Bytes are often emulated by extracting octets from
bigger words. wchar_t can be actually more efficient than char on
such platforms. Therefore usage of wchar_t can be platform-specific
optimization.
So ... what is wrong with those points?
Not convinced, but pleased that you took the time to reply, thanks.
"C/C++" seems to be actually rather unfortunate to say in the context. In
C++, wchar_t is a distinct fundamental type. In C, it is a typedef of
an integral type.

I see the representation of characters and strings as completely orthogonalto how streaming should work, which traditionally for C has been streams of bytes.

Thanks,
Daniel
 
J

James Kanze

Apologies for the imprecision, just colloquial for std::basic_stream<wchar_t>
etc.

The C language approach has always been to do I/O using library functions
which view data as a stream of bytes.

This corresponds to the view taken by most OS's, and by the internet.
No more than stream of floats, stream of ints etc. A unicode character isan
encoding of bytes, as are floats, ints, etc.

With the difference that std::basic_istream<float> results in
Java streams characters as UTF8 octets, which in my understanding is
also emerging as best practice in C++.

Java streams characters as UTF16BE. You can also define a byte stream,
in which case, you have to specify its encoding (or it is specified
somehow from the environment).
The string representation, whether std::basic_string<char>,
std::basic_string<wchar_t>,std::basic_string<char32_t>, should be
orthogonal to streaming (but in c++ isn't.) The language should (I
hasten to add in my opinion) support streaming wchar_t and
std::basic_string<wchar_t> to and from byte streams.

In C++, how `wchar_t` is streamed is locale dependent. As it should (or
maybe even, must) be.

I'm not sure what his point is, but on modern systems, everything
outside the process is octet oriented. The Internet only understands
octets, for example, as does the most widespread networked file systems.
One system (and one only) did try to implement UTF-16; in theory, it
might have been a nice idea, but in practice, even that system has to
interact with the rest of the world, so it's nice ideas don't amount to
much.
I see the representation of characters and strings as completely
orthogonal to how streaming should work, which traditionally for C has
been streams of bytes.

Both C and C++ tend to be pragmatic. If interally, UTF-16 or UTF-32 are
a better chose, an implementation may implement them on wchar_t. But
externally, the world is octet oriented. Even systems with 9 bit bytes
have to provide some means of reading and writing octets.
 
D

Daniel

I would turn your question around and ask why are 8 bit representations
of text not equally "pointless"? UTF-8 is about as useless and as
useful as UTF-16. In neither case can any one "character" represent an
entire unicode code point. UTF-32 can, but that does not take you a
great deal further because of combining characters, and because not all
graphemes can be represented by a single unicode codepoint anyway.
UTF-8 is marginally more space efficient than UTF-16 with European
writing systems. UTF-16 is marginally more easy to decompose into
graphemes than is UTF-8 in non-European writing systems.
But won't things have been simpler if we could have written

std::eek:stream os;

wchar_t c;
std::basic_string<wchar_t> s;

os << c << "," << s;

with the appropriate conversions happening to default UTF8 (or an alternative specified 8 byte encoding)? Of course, I concur that's somewhat of a rhetorical point, but consider the sheer amount of machinery in C++ that's devoted to supporting the wide streams, and as far as I can tell, it's largelyunused. Just checking the table of contents, I don't think there's any guidance in The C++ Programming Language (4th Edition) for reading and writing unicode characters with or without the wide character streams, but perhaps reading and writing text in C++ is considered an advanced topic. It justseems that it could have been so much simpler.

And there are practical issues, for instance, need a library writer of unicode text processing software support the wide streams? It seems that all functional requirements for text IO can be met with the byte streams.

And finally, most of the text processing software I see, Windows or UNIX, supports unicode but doesn't use wide streams.

Daniel
 
D

Daniel

Java streams characters as UTF16BE. You can also define a byte stream,
in which case, you have to specify its encoding (or it is specified
The Java InputStream classes read and write bytes, Java streams everything as bytes. The Java Reader and Writer classes adapt the byte streams to characters according to an encoding.

Daniel
 
B

Bart van Ingen Schenau

In practice the char16_t and char32_t variants in C++11 will probably
turn out to be more useful in future, because they can represent UTF-16
and UTF-32 in a platform independent way. UTF-16 seems to have become
the de facto unicode standard for the web [...]

This is not true. The de-facto standard for unicode in internet standards
that use human-readable headers or content is UTF-8. This includes the
HTTP and HTML standards.
The reason for using UTF-8 there is probably because the ASCII character-
set forms a proper subset of UTF-8, with the same encoding being used for
corresponding characters. And the protocol elements for those text-based
internet standards are/were originally defined in terms of the ASCII
character set.
[...] and of course for anything
using Microsoft products. UTF-8 has more or less become the de facto
unicode standard for unix-like systems.

That is true.

Bart v Ingen Schenau
 
D

Daniel

Streams are used for locale-dependent string formatting. Connecting them
directly to output files only works if the internal representation of
characters exactly matches the output format. This is the same for narrow
and wide character streams.
I don't understand this sentence, I'm sure it's correct, but I don't understand it. At some point the internal representation of a character needs to be converted into the format of the data that flows on the stream. But thatcould happen at different places. For example, c++ could have supported (surely?)

std::eek:stream os;
std::basic_string<wchar_t> s;
float x;

os << s << " " << x << std::endl;

That doesn't preclude further conversion on a non-binary output stream.
Any serious text processing software needs to support files with
different character encodings

Agreed, at the point where bytes are read into and written into the stream,
different encodings need to be supported.
I expect wide streams to be useful only for non-portable simple Windows

programs where both the internal and external text representations are
UTF-16. E.g. things like:

std::wcout << L"PATH=" << _wgetenv(L"PATH") << std::endl;

Here, using std::wcout makes it possible to output Unicode data. By using
narrow std::cout this would not be possible as Windows does not support
UTF-8 locales.

I see, I was overlooking this issue, thanks. All of the files that I work
with on Windows and other platforms are stored as UTF8. All of the softwarethat I've seen that works on them essentially use binary streams and libraries to handle encoding conversions.
So if there were no std::wcout (nor any other wide char
output function) the program should prepare the text in an internal
buffer as UTF-16 and then copy it out to the terminal in binary mode,
which would not be so nice for some casual text output.

That I have noticed :) Except in the context, I didn't care.

Thanks,
Daniel
 
Ö

Öö Tiib

All of the files that I work with on Windows and other platforms
are stored as UTF8. All of the software that I've seen that works
on them essentially use binary streams and libraries to handle
encoding conversions.

On such case it is interesting for what you need that wchar_t? I
would simply type into the coding standard that usage of wstring
etc. is forbidden in those projects and all string literals must
be UTF8 (like u8"\u00F6\U0010FFFF") and done.
 
D

Daniel

Not sure what you mean there. Do you propose the output stream would
contain a mix of wchar_t and char?

No, except see below. The conversion to a specific byte encoding would
take place at the point at which the value was written to the stream.
Or do you mean wchar_t s should be automatically converted into char?
Yes

This is not possible unless the stream knows both the encodings used for
wchar_t and char, and they happen to be convertible into each other.
Which is supportable for the unicode encodings.
Yes, in an ideal world we would only have UTF-32, UTF-16 and UTF-8
encodings and such seamless conversion would be possible. However, the
real world is still far away from that goal, there are zillions of
encodings and UTF-8 is even not a possible option everywhere (Windows
console!)
Right, EBCDIC and many others, the way it works in JAVA is that there is a
platform default encoding, but the actual encoding can be overriden when
reading and writing an array of bytes.

I guess going back to your first point, JAVA does support reading and writing
mixed encodings on a byte stream, and for some things you do want that,
for example, when reading and writing EBCDIC files that contain BCD numbers.
But the default would be a single encoding.
It might well be that the C++ stream design is bad and there are far
better alternatives possible.
In its current form it seems it attempts to
do many different things at the same time, with the result that it is not
really good in any of them.

I think (hope) most people would agree that the C++ stream classes are a bit
screwed up, which is probably one of the reasons why 1264 page introductory
C++ books have less to say about reading and writing text than a student
might want :) And why many people who need to read and write text flee the
standard features and head to the libraries. Hopefully is can someday be
revisited, with more prior experience.
But this is a general streams problem, not
directly related to wchar_t.
Appreciate your comments, thanks, just trying to formulate my own thoughts on
this subject, and your feedback is very helpful.

Daniel
 
N

Nobody

I see the representation of characters and strings as completely
orthogonal to how streaming should work, which traditionally for C has
been streams of bytes.

Note that:

1. C99 also supports wide streams (fwide() etc).

2. A C++ wide stream doesn't necessarily involve conversion to/from bytes.
E.g. a std::wstringstream is backed by a std::wstringbuf and its .str()
method returns a std::wstring.

Also, wide streams allow for the possibility of the OS implementing files
(incl. pipes, etc) which operate in terms of wide characters rather than
bytes.
 
J

James Kanze


[concerning wstream?]
It was possibly introduced for supporting a major OS whose whole SDK was
defined in UCS-2 (later redeclared to be UTF-16), plus maybe Java also had
some indirect influence.

If this is wstream, it was *not* introduced to support wchar_t
at the system level. Just the opposite: it was introduced to
support localization internally (e.g. where the characters you
want to output as digits aren't present in the single byte
encodings of char). It defines all system level IO in terms of
char, and defines how the wchar_t should be mapped to and from
char (in filebuf, so that you can write char, even though
everything upstream was in wchar_t).

And Java obviously had no influence, since it was designed long
before Java ever existed.
As it happens, today wchar_t is non-portable (size
heavily depending on the platform) so the utility of related concepts like
wstream is also greatly reduced.

char is also non-portable. Perhaps moreso that wchar_t, because
with very few exceptions, wchar_t will be some presentation form
of Unicode (UTF-16 or UTF-32), where as even today, the encoding
in char will often change from one locale to the next.

In my own code, for various reasons, I've adopted the policy of
using char and UTF-8 internally. But this means that I've had
to write *all* of the usual character handling functions to
support this, with new interfaces.
 
D

Daniel

If this is wstream, it was *not* introduced to support wchar_t
at the system level. Just the opposite: it was introduced to
support localization internally (e.g. where the characters you
want to output as digits aren't present in the single byte
encodings of char). It defines all system level IO in terms of
char, and defines how the wchar_t should be mapped to and from
char (in filebuf, so that you can write char, even though
everything upstream was in wchar_t).
James,

Just for my own understanding, could you clarify where in the wostream composition that char first appears? In basic_filebuf<wchar_t>, the signatures of the put methods are still wchar_t, and the buffer variable appears to be also an array of wchar_t.

Thanks,
Daniel
 
N

Nobody

Just for my own understanding, could you clarify where in the wostream
composition that char first appears? In basic_filebuf<wchar_t>, the
signatures of the put methods are still wchar_t, and the buffer variable
appears to be also an array of wchar_t.

std::basic_filebuf::eek:verflow() is defined as converting the contents of
the put area into a char[] using the .out() method of the stream's
locale's codecvt facet, then outputting the result (27.9.1.5p10).

IOW, the internal buffer uses the template character type; conversion to
char is the final step before output.

Note that this is specific to file-based streams. std::wstringstream uses
wchar_t throughout; char doesn't come into it.
 
J

James Kanze

Just for my own understanding, could you clarify where in the
wostream composition that char first appears? In
basic_filebuf<wchar_t>, the signatures of the put methods are
still wchar_t, and the buffer variable appears to be also an
array of wchar_t.

And in §27.9.1.5, it explains how the wchar_t in the buffer are
converted to and from char so that they can be written or read.
(See the semantics of underflow and overflow.)

It's normal that all of the interfaces are wchar_t, since they
are there for the client code to use. The filebuf buffers the
characters as <charT>, when the buffer overflows, it converts
these to char (using the codecvt facet of the imbued local) in
order to write them, and when underflow occurs, it reads char,
then converts them to wchar_t (again, using the codecvt facet of
the imbued locale).

The same thing holds, sort of, for wide character streams in C.
(Sort of, because there is only one type of FILE*, you use
different functions for wide character IO and narrow character
IO, and the code conversion depends on the global locale, since
there are no imbued locales.)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,535
Members
45,008
Latest member
obedient dusk

Latest Threads

Top