std iostreams design question, why not like java stream wrappers?


J

Joshua Maurice

I've always found the C++ std iostreams interface to be convoluted,
not well documented, and most non-standard uses to be far out of the
reach of novice C++ programmers, and dare I say most competent C++
programmers. (When's the last time you've done anything with facets,
locales, etc.?)

I've never been a fan of parallel duplicate class hierarchies. It's a
huge design smell to me. The C++ standard streams have this design
smell. They have ifstream and the associated file streambuf,
stringstream its associated string streambuf, etc. The design smell
tends to indicate duplicate code and an overly complex set of classes.
If each class appears as a pair, why have two separate hierarchies?

Also, I've always liked C++ templates as compile time polymorphism. I
would think it natural to do something like create a stream which
writes to a file, then put a buffering wrapper over that, then put a
formatter over it to change '\n' to the system's native line ending,
then put a formatter over that whose constructor takes an encoding
(eg: UTF 8, ASCII, etc.) and whose operator<< functions take your
unicode string and converts it the encoding passed in the constructor.
The current std streams allow you to do this (sort of), but it's much
more complicated than what it needs to be, and it's done with runtime
polymorphism, not compile-time polymorphism of templates, so
potentially much slower.

Also, the std streams internationalization support is at best
pisspoor. The existence of locales and their meanings are
implementation defined. One cannot rely upon any of the C++ standard
locale + facet stuff for a portable program. It's also entirely
convoluted and complex, and doesn't support simple things like
changing from one encoding to another. Now, in the standard
committee's defense, internationalization is hard (tm). However, I
wish they did not try at all rather than clutter up a good standard
library with nearly useless features like locales and facets. Also,
seriously, wchar_t's size is implementation defined? Why even bother?
The new fixed size types for UTF 8, UTF 16, and UTF32 is a step in the
right direction, but from what little I understand it's still coming
up quite short. (Ex: std::string still will not support variable width
encodings like UTF 8 and UTF 16, but it will bear a name of UTF 8 and
UTF 16, a huge misnomer and disservice.)

As a start, I would suggest the following code. Now, I admit that I'm
not the most familiar with C++ std streams, but I think I know enough
to make the critics in this post. I'm just wondering if something like
the below code could not support something which the current C++ std
streams do support? And any other comments on the viability and
usability of the interface and implementation below?


#include <stdio.h>
#include <stdlib.h>
#include <assert.h>

#include <string>


namespace jjm
{
#ifdef WIN32
#pragma warning (disable : 4800)
#endif

/*
Three flags, eofbit, failbit, badbit.
eofbit is set whenever end of file (or stream, std::string, etc)
is encountered.
failbit is set whenever the operation could not complete
successfully:
ex: failure to open a file
ex: getline called and no characters left in the stream
ex: not set when getline is called with characters left in the
stream,
even if the deliminating character is not found
badbit is set when some underlying system call failed for an
"uncommon",
"unexpected", "not normal course of operations" reason:
ex: Not because of end of file, nor file doesn't exist, etc.

Some streams are wrappers over other streams. When a stream X has
been
wrapped by a stream Y, do not use stream X again for reading or
writing,
as stream Y may have read ahead from X and cached it, or Y may be
caching
data to write to X and not written it yet.
*/


/*DO NOT USE THE INTEGER VALUES DIRECTLY!
USE THE ENUM NAMES ONLY!*/
enum { eofbit = 1, failbit = 2, badbit = 4 };


/*To facilitate a safer conversion in a if / for / while / etc
condition
DO NOT USE THESE NAMES AT ALL (outside of this implemention)!*/
struct condition_conversion_class { void
condition_conversion_function() {} };
typedef void (condition_conversion_class::*condition_conversion)
();


/*each istream should have these verbatim*/
#define JJM_ISTREAM_COMMON_FUNCTIONS \
operator condition_conversion () const \
{ return (rdstate() & (failbit | badbit)) \
? 0 \
: &condition_conversion_class \
::condition_conversion_function; \
} \
bool operator! () const \
{ return (rdstate() & (failbit | badbit)) \
? true \
: false; \
} \
bool good() const { return ! rdstate(); } \
bool eof() const { return rdstate() & eofbit; } \
bool fail() const { return rdstate() & failbit; } \
bool bad() const { return rdstate() & badbit; } \
void setstate(unsigned char or_state) { clear(rdstate() |
or_state); }


class istream_wrapper
{
private:
template <bool b> struct static_assert_struct_template {};
#define JJM_STATIC_ASSERT(cond) \
typedef static_assert_struct_template<true> \
static_assert_typedef; \
typedef static_assert_struct_template<cond> \
static_assert_typedef;

class wrapper_base
{
public:
wrapper_base(void* stream_) : stream(stream_) {}
virtual ~wrapper_base() {}
virtual int get() = 0;
virtual size_t read(char* s, size_t const n) = 0;
virtual void clear(unsigned char new_state = 0) = 0;
virtual unsigned char rdstate() const = 0;
virtual void cloneInPlace(void* ptr) const = 0;
void swap(wrapper_base& x) { std::swap(stream,
x.stream); }
protected:
wrapper_base() {}
void* stream; /*declared in base class to do sizeof
(wrapper_base) later in code*/
private:
wrapper_base(wrapper_base const& ); /*not defined, not
copyable*/
wrapper_base& operator= (wrapper_base const& ); /*not
defined, not copyable*/
};

template <typename stream_t>
class wrapper_template : public wrapper_base
{
public:
wrapper_template(stream_t* stream_) : wrapper_base
(stream_)
{ JJM_STATIC_ASSERT(sizeof(wrapper_base) == sizeof
(wrapper_template<stream_t>)); }
virtual ~wrapper_template() {}
virtual int get() { return static_cast said:
virtual size_t read(char* s, size_t const n) { return
static_cast<stream_t*>(stream)->read(s, n); }
virtual void clear(unsigned char new_state = 0)
{ static_cast<stream_t*>(stream)->clear(new_state); }
virtual unsigned char rdstate() const { return
static_cast<stream_t*>(stream)->rdstate(); }
virtual void cloneInPlace(void* ptr) const { new (ptr)
wrapper_template<stream_t>(static_cast<stream_t*>(stream)); }
private:
wrapper_template(wrapper_template const& ); /*not defined,
not copyable*/
wrapper_template& operator= (wrapper_template const& ); /
*not defined, not copyable*/
};

template <typename stream_t>
class wrapper_owning_template : public
wrapper_template<stream_t>
{
public:
wrapper_owning_template(stream_t* stream) :
wrapper_template<stream_t>(stream)
{ JJM_STATIC_ASSERT(sizeof(wrapper_base) == sizeof
(wrapper_owning_template<stream_t>)); }
virtual ~wrapper_owning_template()
{ delete static_cast<stream_t*>(this->stream); }
virtual void cloneInPlace(void* ptr) const
{ new (ptr) wrapper_owning_template<stream_t>
(static_cast<stream_t*>(this->stream)); }
};
#undef JJM_STATIC_ASSERT

public:
/*This ctor wraps the argument stream, making it usable
through a generic type istream_wrapper, aka runtime
polymorphism.
It does not take ownership over its argument stream.*/
template <typename stream_t>
istream_wrapper(stream_t& stream)
{ /*Yes. I know this is hack-ish and probably not
standard.
I'm attempting to improve locality by keeping
contained class
inside this object's memory as opposed to in a
separate piece
of separately allocated memory. Asserting that the
sizes make
sense in wrapper_template's ctor.*/
new (impl) wrapper_template<stream_t>(&stream);
}

/*This ctor wraps the argument stream, making it usable
through a generic type istream_wrapper, aka runtime
polymorphism.*/
struct take_ownership {};
template <typename stream_t>
istream_wrapper(stream_t* stream, take_ownership /*fake
argument to pick constructor*/)
{ /*Yes. I know this is hack-ish and probably not
standard.
I'm attempting to improve locality by keeping
contained class
inside this object's memory as opposed to in a
separate piece
of separately allocated memory. Asserting that the
sizes make
sense in wrapper_owning_template's ctor.*/
new (impl) wrapper_owning_template<stream_t>(&stream);
}

/*This is not equivalent to the template constructor.
Instead, it merely copies the underlying stream pointer,
making it
an equivalent copy.*/
istream_wrapper(istream_wrapper const& x) { x.getimpl()-
cloneInPlace(impl); }

istream_wrapper& operator= (istream_wrapper x) { swap(x);
return *this; }

void swap(istream_wrapper& x) { getimpl()->swap(*x.getimpl
()); }

~istream_wrapper() { getimpl()->~wrapper_base(); }

/*will set eofbit if no characters left in stream
will set failbit if no characters left in stream
on EOF will return EOF*/
int get() { return getimpl()->get(); }

/*will set eofbit if no characters left in stream
will set failbit if no characters read into s
will not set failbit if some characters are read into s
will return number of characters read into s*/
size_t read(char* s, size_t const n) { return getimpl()->read
(s, n); }

/*more accurately: this sets the stream's state*/
void clear(unsigned char new_state = 0) { getimpl()->clear
(new_state); }

/*more accurately: this gets the stream's state*/
unsigned char rdstate() const { return getimpl()->rdstate(); }

JJM_ISTREAM_COMMON_FUNCTIONS
private:
char impl[sizeof(wrapper_base)];
wrapper_base* getimpl() { return
reinterpret_cast<wrapper_base*>(impl); }
wrapper_base const* getimpl() const { return
reinterpret_cast<wrapper_base const*>(impl); }
};
void swap(istream_wrapper& x, istream_wrapper& y) { x.swap(y); }


#ifdef WIN32
#pragma warning(push)
#pragma warning(disable : 4996)
#endif
class ifstream
{
public:
ifstream() : file(0), state(0) {}
ifstream(char const* filename, const char* mode = "r") : file
(0), state(0) { open(filename, mode); }

~ifstream() { if (is_open()) close(); }

bool is_open() const { return 0 != file; }
void open(char const* filename, const char* mode = "r")
{ if (is_open())
setstate(failbit);
if (!*this)
return;
file = fopen(filename, mode);
if (!file)
setstate(failbit);
}
void close()
{ if (!is_open())
setstate(failbit);
if (!fclose(file))
setstate(badbit);
file = 0;
}
int get()
{ char c = EOF;
read(&c, 1);
return c;
}
size_t read(char* s, size_t n)
{ if (!is_open() || !*this)
{ setstate(failbit);
return 0;
}
if (0 == n)
return 0;
size_t const bytes_read = fread(s, 1, n, file);
if (bytes_read != n)
{ if (feof(file))
setstate(eofbit | failbit);
else
setstate(badbit);
}
if (0 == bytes_read)
setstate(failbit);
return bytes_read;
}

void clear(unsigned char new_state = 0) { state = new_state; }
unsigned char rdstate() const { return state; }

JJM_ISTREAM_COMMON_FUNCTIONS

private:
FILE* file;
unsigned char state;

private:
ifstream(ifstream const& ); /*not defined, not copyable*/
ifstream& operator= (ifstream const& ); /*not defined, not
copyable*/
};
#ifdef WIN32
#pragma warning(pop)
#endif


template <typename stream_t, size_t buf_size = 1024>
class buffered_istream
{
public:
buffered_istream(stream_t& stream_)
: stream(stream_), start(buf), end(buf), state(0) {}

int get()
{ if (!*this)
{ setstate(failbit);
return EOF;
}
if (start == end)
{ size_t const x = stream.read(buf, buf_size);
setstate(stream.rdstate() & badbit);
if (bad())
return EOF;
if (x == 0)
setstate(eofbit);
start = buf;
end = buf + x;
}
if (eof())
{ setstate(failbit);
return EOF;
}
char const c = *start;
++start;
return c;
}
size_t read(char* s, size_t const n)
{ if (!*this)
{ setstate(failbit);
return 0;
}
if (0 == n)
return 0;
size_t bytes_read_into_user_buf = 0;
for (char const* caller_buf_end = s + n; good() && s !
= caller_buf_end; )
{ if (start == end)
{ size_t const x = stream.read(buf, buf_size);
setstate(stream.rdstate() & badbit);
if (bad())
break;
if (x == 0)
setstate(eofbit);
start = buf;
end = buf + x;
}
size_t const y = std::min(end - start,
caller_buf_end - s);
memcpy(s, buf, y);
s += y;
start += y;
bytes_read_into_user_buf += y;
}
if (0 == bytes_read_into_user_buf)
setstate(failbit);
return bytes_read_into_user_buf;
}

void clear(unsigned char new_state = 0) { state = new_state; }
unsigned char rdstate() const { return state; }

JJM_ISTREAM_COMMON_FUNCTIONS

private:
stream_t& stream;
char buf[buf_size];
char* start;
char* end;
unsigned char state;

private:
buffered_istream(buffered_istream const& ); /*not defined, not
copyable*/
buffered_istream& operator= (buffered_istream const& ); /*not
defined, not copyable*/
};


template <typename stream_t>
class formatter_istream
{
public:
formatter_istream(stream_t& stream_) : stream(stream_), state
(0), next_char(EOF) {}
int get()
{ if (!*this)
{ setstate(failbit);
return EOF;
}
if (EOF == next_char)
next_char = stream.get();
if (!stream)
{ setstate(stream.rdstate());
return EOF;
}
int x = next_char;
next_char = stream.get();
if ('\r' == x && '\n' == next_char)
{ next_char = EOF;
return '\n';
}
if ('\r' == x)
return '\n';
return x;
}
size_t read(char* s, size_t const n)
{ if (0 == n)
return 0;
size_t bytes_read_into_s = 0;
for ( ; bytes_read_into_s < n; )
{ int x = get();
if (!*this)
break;
*s = static_cast<char>(x);
++s;
++bytes_read_into_s;
}
if (0 == bytes_read_into_s)
setstate(failbit);
return bytes_read_into_s;
}

void clear(unsigned char new_state = 0) { state = new_state; }
unsigned char rdstate() const { return state; }

JJM_ISTREAM_COMMON_FUNCTIONS

private:
stream_t& stream;
unsigned char state;
int next_char;

private:
formatter_istream(formatter_istream const& ); /*not defined,
not copyable*/
formatter_istream& operator= (formatter_istream const& ); /
*not defined, not copyable*/
};


template <typename stream_t>
stream_t& getline(stream_t& stream, std::string& str, char const
delim = '\n')
{ str.clear();
for (;;)
{ int const c = stream.get();
if (((failbit | eofbit) == stream.rdstate()) && str.size
())
{ stream.clear(eofbit);
return stream;
}
if (!stream)
return stream;
if (c == delim)
return stream;
str.push_back(static_cast<char>(c));
}
}


/*meant to mimic std::ifstream not-binary-mode*/
class ifbfstream
: private ifstream,
private buffered_istream<ifstream>,
private formatter_istream<buffered_istream<ifstream> >
{
public:
#ifdef WIN32
#pragma warning(push)
#pragma warning(disable : 4355)
#endif
ifbfstream()
: buffered_istream<ifstream>(static_cast<ifstream&>
(*this)),
formatter_istream<buffered_istream<ifstream> >
(static_cast<buffered_istream<ifstream>&>(*this))
{}
ifbfstream(char const* filename, const char* mode = "r")
: ifstream(filename, mode),
buffered_istream<ifstream>(static_cast<ifstream&>
(*this)),
formatter_istream<buffered_istream<ifstream> >
(static_cast<buffered_istream<ifstream>&>(*this))
{}
#ifdef WIN32
#pragma warning(pop)
#endif
using ifstream::is_open;
using ifstream::eek:pen;
using ifstream::close;
using formatter_istream<buffered_istream<ifstream> >::get;
using formatter_istream<buffered_istream<ifstream> >::read;
using formatter_istream<buffered_istream<ifstream> >::clear;
using formatter_istream<buffered_istream<ifstream> >::rdstate;

JJM_ISTREAM_COMMON_FUNCTIONS
private:
ifbfstream(ifbfstream const& ); /*not defined, not copyable*/
ifbfstream& operator= (ifbfstream const& ); /*not defined, not
copyable*/
};
#undef JJM_ISTREAM_COMMON_FUNCTIONS
}


#include <iostream>
#include <fstream>
#include <sstream>
#include <string>

template <typename stream_t>
void print_compiletime_polymorphism(stream_t& stream)
{ std::cout << "---- " << __FILE__ << " " << __LINE__ << std::endl;
for (std::string line; getline(stream, line); )
std::cout << "X" << line << std::endl;
}

void print_runtime_polymorphism(jjm::istream_wrapper stream)
{ std::cout << "---- " << __FILE__ << " " << __LINE__ << std::endl;
for (std::string line; getline(stream, line); )
std::cout << "X" << line << std::endl;
}


int main()
{
{
std::eek:fstream fout("foo.txt");
fout << "a\n\nb\n";
}

/*Each of this should print out the same thing.
/Just a test to make sure it's all working right.*/

{ jjm::ifstream fin("foo.txt");
print_compiletime_polymorphism(fin);
}
{ jjm::ifstream fin("foo.txt");
print_runtime_polymorphism(fin);
}

{ jjm::ifstream fin("foo.txt");
jjm::buffered_istream<jjm::ifstream> buffered_fin(fin);
print_compiletime_polymorphism(buffered_fin);
}
{ jjm::ifstream fin("foo.txt");
jjm::buffered_istream<jjm::ifstream> buffered_fin(fin);
print_runtime_polymorphism(buffered_fin);
}

{ jjm::ifstream fin("foo.txt");
jjm::buffered_istream<jjm::ifstream> buffered_fin(fin);
jjm::buffered_istream<jjm::buffered_istream<jjm::ifstream> >
buffered_fin_2(buffered_fin);
print_compiletime_polymorphism(buffered_fin_2);
}
{ jjm::ifstream fin("foo.txt");
jjm::buffered_istream<jjm::ifstream> buffered_fin(fin);
jjm::buffered_istream<jjm::buffered_istream<jjm::ifstream> >
buffered_fin_2(buffered_fin);
print_runtime_polymorphism(buffered_fin_2);
}

{ jjm::ifstream fin("foo.txt");
jjm::istream_wrapper opaque_istream(fin);
jjm::buffered_istream<jjm::istream_wrapper> buffered_fin
(opaque_istream);
print_compiletime_polymorphism(buffered_fin);
}
{ jjm::ifstream fin("foo.txt");
jjm::istream_wrapper opaque_istream(fin);
jjm::buffered_istream<jjm::istream_wrapper> buffered_fin
(opaque_istream);
print_runtime_polymorphism(buffered_fin);
}

{ jjm::ifstream fin("foo.txt");
jjm::buffered_istream<jjm::ifstream> buffered_fin(fin);
jjm::formatter_istream<jjm::buffered_istream<jjm::ifstream> >
formatting_buffered_fin(buffered_fin);
print_compiletime_polymorphism(formatting_buffered_fin);
}
{ jjm::ifstream fin("foo.txt");
jjm::buffered_istream<jjm::ifstream> buffered_fin(fin);
jjm::formatter_istream<jjm::buffered_istream<jjm::ifstream> >
formatting_buffered_fin(buffered_fin);
print_runtime_polymorphism(formatting_buffered_fin);
}

{ jjm::ifbfstream fin("foo.txt");
print_compiletime_polymorphism(fin);
}
{ jjm::ifbfstream fin("foo.txt");
print_runtime_polymorphism(fin);
}
std::cout << "----" << std::endl;
}
 
Ad

Advertisements

J

Jerry Coffin

I've always found the C++ std iostreams interface to be convoluted,
not well documented, and most non-standard uses to be far out of the
reach of novice C++ programmers, and dare I say most competent C++
programmers. (When's the last time you've done anything with facets,
locales, etc.?)

Last week, though I realize I'm somewhat unusual in that respect.
I've never been a fan of parallel duplicate class hierarchies. It's a
huge design smell to me. The C++ standard streams have this design
smell. They have ifstream and the associated file streambuf,
stringstream its associated string streambuf, etc. The design smell
tends to indicate duplicate code and an overly complex set of classes.
If each class appears as a pair, why have two separate hierarchies?

In this case, there's very little (if any) duplicated code. For the
most part, what you have is a set of stream buffer classes, and a
single iostream class (which is, itself, derived from istream,
ostream, ios_base, and so on -- but we probably don't need to get
into that part at the moment).

The other iostream classes (stringstreams, fstreams, etc.) are just
puny adapter classes. They add a function or two to the iostream to
let you easily pass suitable parameters from the client code to the
underlying buffer class without going through creating a stream
buffer in one step, and then attaching it to a formatter in a
separate step.
Also, I've always liked C++ templates as compile time polymorphism. I
would think it natural to do something like create a stream which
writes to a file, then put a buffering wrapper over that, then put a
formatter over it to change '\n' to the system's native line ending,
then put a formatter over that whose constructor takes an encoding
(eg: UTF 8, ASCII, etc.) and whose operator<< functions take your
unicode string and converts it the encoding passed in the constructor.
The current std streams allow you to do this (sort of), but it's much
more complicated than what it needs to be, and it's done with runtime
polymorphism, not compile-time polymorphism of templates, so
potentially much slower.

That's actually fairly similar to how iostreams really do work. There
are a _few_ minor differences, but most of them really are fairly
minor.

First, the designers (mostly rightly, IMO) didn't bother with having
two classes, one for unbuffered and another for buffered access to a
stream. Unbuffered access to a stream just isn't common enough to
justify a separate class for this purpose (at least IME).

Formatting is divided between a locale and an iostream. A locale
contains all the details about things like how to format numbers
(including what characters to use for digit grouping and such). The
iostream mostly keeps track of flags (e.g. set by manipulators) to
decide what locale to use, and how to use it.
Also, the std streams internationalization support is at best
pisspoor. The existence of locales and their meanings are
implementation defined. One cannot rely upon any of the C++ standard
locale + facet stuff for a portable program.

Yes and no. The only piece that's implementation defined is exactly
what locales will exist (and what name will be given to each).

An awful lot of programs can get by quite nicely with just using
whatever locale the user wants, and C++ makes it pretty easy to
access that one -- an empty name gets it for you.
It's also entirely
convoluted and complex, and doesn't support simple things like
changing from one encoding to another.

I beg your pardon?
Now, in the standard
committee's defense, internationalization is hard (tm). However, I
wish they did not try at all rather than clutter up a good standard
library with nearly useless features like locales and facets. Also,
seriously, wchar_t's size is implementation defined? Why even bother?

At the time, the ISO 10646 camp figured wide characters required 32
bits. The Unicode camp still thought UCS-2 would do the job.
Eventually Unicode decided 32 bits was really necessary too, but a
number of major vendors were still thinking in terms of UCS-2 at the
time.

At least they didn't do like Java and decree that wide characters
were, and would always remain, 16 bits. A C++ implementation can get
things right or wrong, but a Java implementation is stuck with being
wrong.

Don't get me wrong: I'm not trying to defend iostreams as being the
perfect design, or anything like that -- but it seems to me that the
design is a bit better than you're portraying it.
 
A

Alf P. Steinbach

* Joshua Maurice:
I've never been a fan of parallel duplicate class hierarchies. It's a
huge design smell to me.

It depends. If one hierarchy is just very similar to another then you have a
smell. If one hierarchy is wrapping another hierarchy then you have a wrapper.

And there is this thing with wrapping an existing class hierarchy of concrete
classes that if you want C++ RAII and type safety, that everything's Ready To
Use after successful construction, then using two parallel hierarchies for the
wrapping is a practical technique, and then you have *three* parallel
hierarchies (the original and the two used for wrapping).

For example, off the cuff,


struct OriginalBase { OriginalBase() {} };
struct OriginalDerived { OriginalDerived( int ) {} };

class WrapperBase
{
private:
OriginalBase* myOriginal;
protected:
OriginalBase& original() { return *myOriginal; }

struct Factory
{
virtual ~Factory() {}
virtual OriginalBase* newOriginal() const
{
return new OriginalBase();
}
}

WrapperBase( Factory const& factory )
: myOriginal( factory.newOriginal() )
{}
public:
WrapperBase()
: myOriginal( Factory().newOriginal() )
{}
};

class WrapperDerived
: public WrapperBase
{
typedef WrapperBase Base;
protected:
OriginalDerived& original()
{
return static_cast<OriginalDerived&>( Base::eek:riginal() );
}

struct Factory: Base::Factory
{
int myInt;
Factory( int const anInt ): myInt( anInt ) {}
virtual OriginalDerived* newOriginal() const
{
return new OriginalDerived( myInt );
}
};

WrapperDerived( Factory const& aFactory )
: Base( aFactory )
{}

public:
WrapperDerived( int const anInt )
: Base( Factory( anInt ) )
{}
};


It's somewhere in the FAQ (title like, "what your mother never told you" :) ),
but when I asked Marshall to put it there it was in the context of supporting
derived class specific initialization actions for base class construction, which
it also does, e.g., renaming a little bit, when the most derived class is
WrapperButton the Factory above might produce an OriginalButton for the top base
class WrapperWidget, which is fully constructed by any of its constructors.


Cheers & hth.,

- Alf
 
J

Joshua Maurice

Last week, though I realize I'm somewhat unusual in that respect.

Ok. Also, do you use wstream in any real code? Or wstring?
In this case, there's very little (if any) duplicated code. For the
most part, what you have is a set of stream buffer classes, and a
single iostream class (which is, itself, derived from istream,
ostream, ios_base, and so on -- but we probably don't need to get
into that part at the moment).

The other iostream classes (stringstreams, fstreams, etc.) are just
puny adapter classes. They add a function or two to the iostream to
let you easily pass suitable parameters from the client code to the
underlying buffer class without going through creating a stream
buffer in one step, and then attaching it to a formatter in a
separate step.

Meh. True.
That's actually fairly similar to how iostreams really do work. There
are a _few_ minor differences, but most of them really are fairly
minor.

First, the designers (mostly rightly, IMO) didn't bother with having
two classes, one for unbuffered and another for buffered access to a
stream. Unbuffered access to a stream just isn't common enough to
justify a separate class for this purpose (at least IME).

I can live with this mostly. The expensive sinks or sources are
automatically buffered. Makes sense.
Formatting is divided between a locale and an iostream. A locale
contains all the details about things like how to format numbers
(including what characters to use for digit grouping and such). The
iostream mostly keeps track of flags (e.g. set by manipulators) to
decide what locale to use, and how to use it.

But only very basic formatting, and only formatting which the standard
library thought useful. You add another kind of formatting, just tweak
existing formatting rules.
Yes and no. The only piece that's implementation defined is exactly
what locales will exist (and what name will be given to each).

So, basically entirely implementation defined, and as there is no
particular standard in use beyond the C++ standard, they're basically
worthless for portable code.
An awful lot of programs can get by quite nicely with just using
whatever locale the user wants, and C++ makes it pretty easy to
access that one -- an empty name gets it for you.

Only if you enjoy serving English speakers. The iostream library is
woefully insufficient on its own for anything but that. When you start
making products in which English is not the language of choice,
iostreams become little more than binary mode byte streams.
I beg your pardon?

I'm still not exactly clear on what a facet is, and how all of that
stuff works. Given what it does, IMHO my class design shown above
seems far simpler to understand. Instead of every stream has an
attached formatter which can only format things built into the
interface, instead you can make a stream wrapper which can format
anything you want.
At the time, the ISO 10646 camp figured wide characters required 32
bits. The Unicode camp still thought UCS-2 would do the job.
Eventually Unicode decided 32 bits was really necessary too, but a
number of major vendors were still thinking in terms of UCS-2 at the
time.

At least they didn't do like Java and decree that wide characters
were, and would always remain, 16 bits. A C++ implementation can get
things right or wrong, but a Java implementation is stuck with being
wrong.

No. Java may have gotten it "wrong" quote unquote, but a wrong answer
now is infinitely better than "implementation defined". How many real
successful Java programs are there out there making use of built-in
support of Unicode? Lots. How many real successful C++ programs are
there out there making use of wstring and wiostream? I would assume
basically none.
Don't get me wrong: I'm not trying to defend iostreams as being the
perfect design, or anything like that -- but it seems to me that the
design is a bit better than you're portraying it.

This is what I want:

1- A cleaner and simpler interface than iostreams wrapping streambuf,
and the relatively nasty interface of a streambuf itself. When some
standard library's implementations are 100x slower than its printf
library, I'd say that yea, it's pretty complex.

I'd also say that it's convoluted given that it doesn't really solve
any problems. Sure, it correctly handles the systems end of line, and
it correctly uses the right code points for converting an integer to
string representation for the locale, and cool stuff like a comma or a
period for the thousand separators and "1 to tenths" separator.
However, the entire iostream library is overkill if those are the only
problems it solves, hence convoluted.

2- No virtual overhead if it's not required. I should not have to go
through a virtual function to write to a file or to a string with a
stringstream. I'd like to be able to use it in performance critical
aspects of code without invoking the holy war of printf vs cout. Also
see point 1 for why it might be slow.

3- Actual support for changing between encodings like UTF 16 and UTF
32. Ex:
raw_ofstream out("some_file");
buffered_ostream<raw_ofstream> out_2(out);
out_encoder<buffered_ostream<raw_ofstream>> out_3(out_2, "UTF-16");

//you have some data in utf8 format, like in a utf8string
utf8string str;
out_3 << str;

//or perhaps some raw data which you know the encoding
//(Yes, a string literal may not be ASCII. I know.)
out_3.writeString("foo", "ASCII");
Or more likely, a helper class would exist:
ofstream_with_encoding /*for want of a better name*/ out
("some_file", "ASCII");
utf8str str;
out << str;

utf16str str2;
out << str2;
with whatever failure mechanism by default if it can't do the encoding
translation, be that by exception or setting the fail bit, and of
course it would be configurable on each iostream like setw.
 
J

James Kanze

Last week, though I realize I'm somewhat unusual in that respect.

It's interesting that he complains of iostream, and then cites
facets. The base design of iostream (pre-standard, if you
prefer) is actually quite clean (although with lousy naming
conventions and very limited error handling); the facets stuff
ranks as some of the most poorly designed software I've seen,
however, and the way it is integrated into iostream is pretty
bad.
In this case, there's very little (if any) duplicated code.
For the most part, what you have is a set of stream buffer
classes, and a single iostream class (which is, itself,
derived from istream, ostream, ios_base, and so on -- but we
probably don't need to get into that part at the moment).
The other iostream classes (stringstreams, fstreams, etc.) are
just puny adapter classes. They add a function or two to the
iostream to let you easily pass suitable parameters from the
client code to the underlying buffer class without going
through creating a stream buffer in one step, and then
attaching it to a formatter in a separate step.

The important point is that they are just convenience classes;
they make the most frequent cases easier. The separation of
formatting from data sink/source is IMHO an essential concept,
however, and any modern IO design must recognize this.
That's actually fairly similar to how iostreams really do
work. There are a _few_ minor differences, but most of them
really are fairly minor.
First, the designers (mostly rightly, IMO) didn't bother with
having two classes, one for unbuffered and another for
buffered access to a stream. Unbuffered access to a stream
just isn't common enough to justify a separate class for this
purpose (at least IME).

I'd guess that 99% of the streambuf's I write use unbuffered
access. About the only time you want buffering is when you're
going to an external source or sink, like a file (filebuf, or a
custom memorybuf or socketbuf).

The reason it's a single class is far more fundamental. Back in
the 1980's, when the concept was being developed, actually
calling a virtual function for each character really was too
expensive in runtime to be acceptable; the public interface to
streambuf is typically implemented as inline functions, and the
virtual call only occurs when there is nothing in the buffer.
Today, given modern machines, I think I'd separate the two, as
in Java. But it's not a big deal; you can more or less ignore
the buffering for output, and some sort of buffering is always
necessary for input anyway, at least if you want to provide a
peek function (named sgetc in streambuf---as I said the naming
conventions were horrible).
Formatting is divided between a locale and an iostream. A
locale contains all the details about things like how to
format numbers (including what characters to use for digit
grouping and such). The iostream mostly keeps track of flags
(e.g. set by manipulators) to decide what locale to use, and
how to use it.

The fact that some of the flags are permanent, and others not,
can cause some confusion. More generally, one would like some
means of "scoping" use of formatting options, but I can't think
of a good solution. (In the meantime, explicit RAII isn't that
difficult.)
Yes and no. The only piece that's implementation defined is
exactly what locales will exist (and what name will be given
to each).
An awful lot of programs can get by quite nicely with just
using whatever locale the user wants, and C++ makes it pretty
easy to access that one -- an empty name gets it for you.

Yes and no. It's not an iostream problem, but I use UTF-8
internally, and I've had to implement all of my own isalpha,
etc. This really belongs in the standard. (Back when C was
defined, limiting support to single byte encodings was quite
comprehensible. But even in the 1990's, it was clear that
functions like toupper couldn't provide a character to character
mapping, and multibyte encodings were common.)
I beg your pardon?

The issue is complex, and he's at least partially right. But
part of the problem is inherent---logically, the encoding is a
separate issue from the locale (which is concerned with things
like whether the decimal is a dot or a comma), but practically,
at least with single byte encodings, things like toupper or is
digit depend on both. If your dealing with a stream, the
solution I use is to imbue the stream itself with the locale
you're interested in, then (the order is important) to imbue the
streambuf with the correct locale for the encoding---this is
especially true if the encoding isn't known until part of the
file has been read (the usual case, in my experience). I find
this more logical than creating new locale on the fly (although
in principle, that should also work).
At the time, the ISO 10646 camp figured wide characters
required 32 bits. The Unicode camp still thought UCS-2 would
do the job. Eventually Unicode decided 32 bits was really
necessary too, but a number of major vendors were still
thinking in terms of UCS-2 at the time.
At least they didn't do like Java and decree that wide
characters were, and would always remain, 16 bits. A C++
implementation can get things right or wrong, but a Java
implementation is stuck with being wrong.

:). In practice, there's nothing "wrong" with the Java
solution (UTF-16). Nor with either of the two widespread C++
solutions. Or with my solution of using UTF-8 and char. What
might be considered "wrong" is imposing one, and only one, on
all code. But I don't know of any system which offers both
UTF-16 and UTF-32; Java imposes one "right" solution, whereas
C++ allows the implementation to choose (guess?) which solution
is right for its customers.

Of course, the real reason why C++ is so open with regards to
what an implementation can do with wchar_t is because C is. And
the reason C is so open is because when C was being normalized,
no one really knew what encodings would end up being the most
wide spread; Unicode hadn't really become the standard back
then.
 
J

James Kanze

Ok. Also, do you use wstream in any real code? Or wstring?

I don't, because I generally use UTF-8, and have my own
libraries, etc., for doing so. But talking around, I'd say that
wstring, wistream, etc. are fairly widespread---at least as
frequently as there narrow character equivalents.

And applications which don't use locale are really the
exception, although the use is generally limited to
std::locale::global( std::locale( "" ) ) ;
at the start of main. Unless the application doesn't do any
text handling.
But only very basic formatting, and only formatting which the
standard library thought useful. You add another kind of
formatting, just tweak existing formatting rules.

One of the most important principles of iostream is that the
formatting (and the sink/sources) should be user extendable.
It's easy to add additional formatting options---just overload
So, basically entirely implementation defined, and as there is
no particular standard in use beyond the C++ standard, they're
basically worthless for portable code.

The most important one is "", which is portable. That handles
all of the real "locale" stuff. The only real problem I've
experienced has been encoding.
Only if you enjoy serving English speakers.

We use it all the time for French, with no problem. Under Unix,
locale( "" ) means pick up the correct locale from the user's
environment variables.
The iostream library is woefully insufficient on its own for
anything but that. When you start making products in which
English is not the language of choice, iostreams become little
more than binary mode byte streams.

Having written extensive projects in both France and Germany,
using iostream for all our input and output, I can definitely
say that that's false. About the only place I've found the
standard deficient here in terms of provided functionality is in
formatting complex (where it imposes a comma between the real
and the imaginary part).

[...]
No. Java may have gotten it "wrong" quote unquote, but a wrong
answer now is infinitely better than "implementation defined".
How many real successful Java programs are there out there
making use of built-in support of Unicode? Lots. How many real
successful C++ programs are there out there making use of
wstring and wiostream? I would assume basically none.

I don't know. Their use seems rather common here.
This is what I want:
1- A cleaner and simpler interface than iostreams wrapping
streambuf, and the relatively nasty interface of a streambuf
itself. When some standard library's implementations are 100x
slower than its printf library, I'd say that yea, it's pretty
complex.

With the exception of the encoding issues, the only problem I
see with the streambuf interface is the naming. The encoding
issues are poorly designed; what is needed is a filtering
streambuf which reads from a byte oriented streambuf, and
presents the interface of a wchar_t streambuf. (That's also how
I implement things in my own code... except that my filtering
streambuf still returns char, but guarantees legal UTF-8.)

As for the speed... people have been optimizing printf for 30
years now; when printf appeared, it was very important. The
best iostream implementations (e.g. the one by Dietmar Kühl)
beat printf/scanf in speed; if they haven't been widely adopted
in commercial libraries, it's because the current
implementations are felt to be "fast enough", and there's no
presure for more speed from them (with a few exceptions).
I'd also say that it's convoluted given that it doesn't really
solve any problems. Sure, it correctly handles the systems end
of line, and it correctly uses the right code points for
converting an integer to string representation for the locale,
and cool stuff like a comma or a period for the thousand
separators and "1 to tenths" separator. However, the entire
iostream library is overkill if those are the only problems it
solves, hence convoluted.

You seem to be confusing the issues. The iostream library isn't
concerned about much of that. The iostream library provides:

-- a standard interface for data sinks and sources
(std::streambuf), along with two sample implementations
covering the most frequence cases (files and strings),

-- a standard interface for formatting, using the strategy
pattern to cleanly separate sinking and sourcing bytes from
the formatting---the formatting interface uses overloading
so that client code can extend it for types it's never heard
of,

-- a standard error handling strategy (which is rather
simplistic), and

-- a couple of wrapper classes to make it easier to set up the
most common cases (reading or writing from a file or a
string).

which is said:
2- No virtual overhead if it's not required. I should not have
to go through a virtual function to write to a file or to a
string with a stringstream.

And the alternative is? The alternative is simply an
unacceptable amount of code bloat, with every single function
which does any I/O a template. Without the virtual functions,
iostream becomes almost unusable, like printf and company.
I'd like to be able to use it in performance critical aspects
of code without invoking the holy war of printf vs cout. Also
see point 1 for why it might be slow.
3- Actual support for changing between encodings like UTF 16 and UTF
32. Ex:
raw_ofstream out("some_file");
buffered_ostream<raw_ofstream> out_2(out);
out_encoder<buffered_ostream<raw_ofstream>> out_3(out_2, "UTF-16");
//you have some data in utf8 format, like in a utf8string
utf8string str;
out_3 << str;
//or perhaps some raw data which you know the encoding
//(Yes, a string literal may not be ASCII. I know.)
out_3.writeString("foo", "ASCII");
Or more likely, a helper class would exist:
ofstream_with_encoding /*for want of a better name*/ out
("some_file", "ASCII");
utf8str str;
out << str;
utf16str str2;
out << str2;
with whatever failure mechanism by default if it can't do the
encoding translation, be that by exception or setting the fail
bit, and of course it would be configurable on each iostream
like setw.

That's less obvious than you think, because of buffering. You
can change the encoding on the fly, e.g.:
std::cin.rdbuf()->imbue( localeWithNewEncoding ) ;
or
std::cin.imbue( std::locale( std::cin.getloc(),
newEncodingFacet ) ) ;
but there are serious practical limitations (which affect e.g.
the Java solution as well). Buffering and changing encodings on
the fly don't work well together.
 
Ad

Advertisements

J

Joshua Maurice

You seem to be confusing the issues.  The iostream library isn't
concerned about much of that.  The iostream library provides:

 -- a standard interface for data sinks and sources
    (std::streambuf), along with two sample implementations
    covering the most frequence cases (files and strings),

 -- a standard interface for formatting, using the strategy
    pattern to cleanly separate sinking and sourcing bytes from
    the formatting---the formatting interface uses overloading
    so that client code can extend it for types it's never heard
    of,

Except for system specific newline handling for non-binary mode, but I
can live with that. I think. Not exactly sure how that works.
 -- a standard error handling strategy (which is rather
    simplistic), and

 -- a couple of wrapper classes to make it easier to set up the
    most common cases (reading or writing from a file or a
    string).

For all localization issues, it defers to <locale>, which is
overly complicated for what it does (which isn't really enough).

And remind me again, exactly what does the iostream library without
<locale> do again? It handles newlines for you if not in binary mode,
and uhh... is it with facet support that it handles printing integers
and floats into human readable strings? So, back to my original
complaint of why a separate streambuf and iostream class hierarchies
when something like the Java Writer hierarchy or the Java OutputStream
hierarchy seems so much clearer and simpler IMHO?

Well, not exactly that. That would imply virtual overhead at every
step of the way. I like my plan where each root stream type is a stand
alone type, and then you have wrapping filtering streams which take
other streams as 'template arguments'. Something very much like how
the containers of the STL are stand alone classes, and a lot of the
algorithms are written using templates to work on any of the STL
containers. All you need for a dynamic dispatch stream wrapper is that
one class I wrote above, jjm::istream_wrapper (though probably with
buffering on by default for jjm::istream_wrapper to avoid virtual
function calls on every call).
And the alternative is?  The alternative is simply an
unacceptable amount of code bloat, with every single function
which does any I/O a template.  Without the virtual functions,
iostream becomes almost unusable, like printf and company.

Please review my first post where my solution supports both compile-
time polymorphism and runtime polymorphism. Specifically look at the
functions

template <typename stream_t>
void print_compiletime_polymorphism(stream_t& stream)
{ std::cout << "---- " << __FILE__ << " " << __LINE__ << std::endl;
for (std::string line; getline(stream, line); )
std::cout << "X" << line << std::endl;

}

void print_runtime_polymorphism(jjm::istream_wrapper stream)
{ std::cout << "---- " << __FILE__ << " " << __LINE__ << std::endl;
for (std::string line; getline(stream, line); )
std::cout << "X" << line << std::endl;
}

I fully recognize that templates do not solve everything, and that you
need to be able to work on a generic stream which uses dynamic
dispatch to do the actual work, for compilation speed reasons, code
bloat reasons (and thereby runtime speed and size reasons), etc. My
solution offers both. With it, you would only pay for the virtual
overhead when you actually need to use it. (Note that
jjm:istream_wrapper should probably be buffered by default.) I haven't
thought this through fully, but enough to ask "Why isn't it done this
way which seems superior?"
That's less obvious than you think, because of buffering.  You
can change the encoding on the fly, e.g.:
    std::cin.rdbuf()->imbue( localeWithNewEncoding ) ;
or
    std::cin.imbue( std::locale( std::cin.getloc(),
newEncodingFacet ) ) ;
but there are serious practical limitations (which affect e.g.
the Java solution as well).  Buffering and changing encodings on
the fly don't work well together.

Sorry. Indeed. You are correct. Oversight there.
 
J

Joshua Maurice

We use it all the time for French, with no problem.  Under Unix,
locale( "" ) means pick up the correct locale from the user's
environment variables.

Meh. Let me slightly correct myself. It's sufficient for most Latin
scripts if you're not doing any sort of text transformtions (like
substring), and it's generally sufficient for English which doesn't
have combining characters, weird non-lexicogprahic sort rules, or
anything else nasty.
 
J

Jerry Coffin

[ ... ]
Ok. Also, do you use wstream in any real code? Or wstring?

I use wstring semi-regularly, but not wstream.

[ ... ]
I can live with this mostly. The expensive sinks or sources are
automatically buffered. Makes sense.

It's not so much a matter of dealing with expense, as it is that C
streams are always buffered, and they seem to have figured that
iostreams should be allowed to be built as a layer on top of a C
stream implementation. Or at least they recognized why the C standard
didn't include anything much like UNIX-style files with file
descriptors, and didn't want to restrict iostreams to working on top
of that model.
But only very basic formatting, and only formatting which the standard
library thought useful. You add another kind of formatting, just tweak
existing formatting rules.

[meant to say 'can't add another kind'...]

I'm not sure exactly what you mean by "another kind of formatting".
Ultimately, the job of a stream comes down to this: take some input
data and translate it into a stream of bytes. If you had enough
memory to do so, an iostream could be created as basically nothing
more than a huge 2D lookup table, with types along one axis, and
values along the other. For each possible value of each type, you get
a (possibly empty) sequence of bytes to append to the output buffer.

Storing the table is impractical, so we generally compute one cell of
it (so to speak) when needed. Nonetheless, if we look at it from that
viewpoint all possible formatting is of the same "kind" -- it just
changes the sequence of bytes to append to the output buffer.

Within that constraint, I'm pretty sure you can do almost anything
you want to -- i.e. you can associate almost any sequence of bytes
you want with an arbitrary type and value.

[ ... ]
Only if you enjoy serving English speakers.

Nonsense! I have tested this. Just for example, consider VC++ under
Windows. If I have Windows configured for use in Germany, VC++
figures out enough that a std::locale(""); creates a locale that
seems to be perfectly reasonable for use in Germany. Likewise with
French, Italian, Chinese, etc.

There are a _few_ exceptions, where std::locale doesn't support what
would be really traditional for a particular country and culture --
but most of the time it does perfectly well.

Even when you need more than that, the names of locales are still
strings. Being strings, they do NOT need to be literals in the
program -- the program can, for example, use a string it reads from a
configuration file.
I'm still not exactly clear on what a facet is, and how all of that
stuff works.

Sorry, I was replying to the second part -- it _does_ allow changing
from one encoding to another.

As far as what a facet is, it's not really convoluted at all. A facet
holds localized information about one aspect of formatting. One facet
"knows" about lower-case versus upper-case letters, another facet
about reading and writing numbers, another about reading and writing
times, and so on.

This is for the relatively simple reason that each of those facets
(aspects, if you prefer) of localization can and does vary
independently of the others. People in Quebec speak close enough to
the same language that I believe they follow the same rules for
letter case, but unless I'm mistaken, most Canadians write numbers
about like Americans do. Conversely, Belgium has three official
languages, each with its own rules about letter case, but I believe
all three normally write numbers the same way.

If there's any real convolution, it's in locales. This is less
because of real complexity than from locales being a lot different
than most people think they are. Since facets handle all the
individual pieces of localization, a locale is really a collection
class. Unlike most (e.g. STL) collections, it's heterogeneous,
holding objects of a number of different types. It's also extensible,
so you can add entirely new facets all your own.

[ ... ]
No. Java may have gotten it "wrong" quote unquote, but a wrong
answer now is infinitely better than "implementation defined". How
many real successful Java programs are there out there making use
of built-in support of Unicode? Lots. How many real successful C++
programs are there out there making use of wstring and wiostream? I
would assume basically none.

Java did get it wrong -- no need of quotes or anything else. They
screwed up, and someday people who use Java are going to pay dearly
for it (that's not a threat, just recognition of reality).

I couldn't even hope to estimate how many programs make successful
use of wstring and the "w" variants of streams. From what I've seen,
it's a long ways from none though.

[ ... ]
1- A cleaner and simpler interface than iostreams wrapping
streambuf, and the relatively nasty interface of a streambuf
itself. When some standard library's implementations are 100x
slower than its printf library, I'd say that yea, it's pretty
complex.

The facts don't support the conclusion. Some things that are very
simple are also very slow. The streambuf doesn't have all that nasty
an interface either -- what is nasty is mostly just poorly chosen
names.
I'd also say that it's convoluted given that it doesn't really solve
any problems. Sure, it correctly handles the systems end of line, and
it correctly uses the right code points for converting an integer to
string representation for the locale, and cool stuff like a comma or a
period for the thousand separators and "1 to tenths" separator.
However, the entire iostream library is overkill if those are the only
problems it solves, hence convoluted.

Those aren't the only problems being solved though, so your "proof"
of overkill is clearly flawed.
2- No virtual overhead if it's not required. I should not have to go
through a virtual function to write to a file or to a string with a
stringstream. I'd like to be able to use it in performance critical
aspects of code without invoking the holy war of printf vs cout. Also
see point 1 for why it might be slow.

You seem to be concluding that since one implementation is slower,
that all implementations _must_ be slower. The logic is clearly
flawed.

In fairness, it's true that quite a few implementations are slower --
but I'm still not sure it's due to the design.
3- Actual support for changing between encodings like UTF 16 and UTF
32. Ex:

[ ... ]
with whatever failure mechanism by default if it can't do the
encoding translation, be that by exception or setting the fail bit,
and of course it would be configurable on each iostream like setw.

Read through the documentation for the codecvt facet. Some parts of
the design really are hard to understand, and I've never seen what
I'd call decent documentation for it either. Nonetheless, the basic
idea is clear enough -- what you're asking for is already provided.
 
J

Jerry Coffin

[ ... ]
This is for the relatively simple reason that each of those facets
(aspects, if you prefer) of localization can and does vary
independently of the others. People in Quebec speak close enough to

Oops -- that should say: "Quebec and France" instead of just
"Quebec".
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...
[ ... ]
The fact that some of the flags are permanent, and others not,
can cause some confusion. More generally, one would like some
means of "scoping" use of formatting options, but I can't think
of a good solution. (In the meantime, explicit RAII isn't that
difficult.)

While certainly true, this seems to be (mostly) unrelated to the
issues he's citing.

[ ... ]
The issue is complex, and he's at least partially right.

Yes -- I was thinking in terms of the supposed inability to change
from one encoding to another, which most certainly IS possible.
Admittedly (and as you've pointed out) there's a bit of a problem
with changing encoding when buffering is involved.

I'd note that this is really just an instance of a much larger
problem though. Buffering is intended to decouple the actions on the
two sides of the buffer, and it does that quite well. In this case,
however, we care about the state on the "far" side of the buffer --
exactly what the buffer is supposed to hide from us.

[ ... ]
:). In practice, there's nothing "wrong" with the Java
solution (UTF-16).

Sort of true -- it's certainly true that UTF-16 (like UTF-8) is a
much "nicer" encoding than things like the old shift-JIS. At least
it's easy to recognize when you're dealing with a code point that's
encoded as two (or more) words.

At the same time, you do still need to deal with the possibility that
a single logical character will map to more than one 16-bit item,
which keeps most internal processing from being as clean and simple
as you'd like. Then again, at least in the Java code I've seen, the
internal code is kept clean and simple -- which is fine until
somebody feeds it the wrong data, and code that's been "working" for
years suddenly fails completely...
Nor with either of the two widespread C++
solutions. Or with my solution of using UTF-8 and char. What
might be considered "wrong" is imposing one, and only one, on
all code. But I don't know of any system which offers both
UTF-16 and UTF-32; Java imposes one "right" solution, whereas
C++ allows the implementation to choose (guess?) which solution
is right for its customers.

IMO, UTF-16 causes the biggest problem. The problem arises when the
"length" of a string is ambiguous -- the number of characters differs
from the number of units of storage.

With UTF-8, those differences are large enough and common enough that
a mistake in this area will cause visible problems almost
immediately.

With UCS-4/UTF-32, there's never a difference, so no problem ever
arises.

With UTF-16, however, there's only rarely a difference -- and even
when there is, it's often small enough that if (for example) your
memory manager rounds up memory allocation sizes, you can use buggy
code almost indefinitely without the bug becoming apparent. Then,
(Murphy still being in charge) exactly when it's most crucial for it
to work, the code fails _completely_, but duplicating the problem is
next to impossible...
Of course, the real reason why C++ is so open with regards to
what an implementation can do with wchar_t is because C is. And
the reason C is so open is because when C was being normalized,
no one really knew what encodings would end up being the most
wide spread; Unicode hadn't really become the standard back
then.

At the time, Unicode was still _competing_ with ISO 10646 rather than
cooperating with it. I think there's more involved though: C++ (like
C) embodies a general attitude toward allowing (and even embracing)
variation. While I think in recent years it has moderated to a
degree, I think for a while (and still, though to a lesser degree)
there was rather an amount of pride taken in leaving the languages
loosely enough defined that they could be implemented on almost any
machine (past or future), including some for which there was no
realistic hope of anybody actually porting an implementation.
 
Ad

Advertisements

J

James Kanze

Except for system specific newline handling for non-binary
mode, but I can live with that.

That's *not* in iostream, per se. It's a detail of one
particular subclass of streambuf (and it concerns more than just
newline handling---try reading a file which contains
"abc\032xyz\n" in text mode under Windows, for example).
I think. Not exactly sure how that works.
And remind me again, exactly what does the iostream library
without <locale> do again?

Read the two points I just made above. Some of the predefined
formatting functions do use locale, but that's just an
implementation detail, and not part of the basic concept. And
the formatting functions you write don't have to use locale
unless you want them to.
It handles newlines for you if not in binary mode, and uhh...

No it doesn't.
is it with facet support that it handles printing integers and
floats into human readable strings?

It provides the basic structure which allows formatting for any
type. Some of the actual formatters do use locale, but that's
rather secondary to iostream.
So, back to my original complaint of why a separate streambuf
and iostream class hierarchies when something like the Java
Writer hierarchy or the Java OutputStream hierarchy seems so
much clearer and simpler IMHO?

The Java OutputStream and Writer hierarchies are the Java
equivalent of streambuf. For formatting, you need
java.text.Format, and if you think that's simpler than iostream,
you've never used it.
 
J

James Kanze

Meh. Let me slightly correct myself. It's sufficient for most
Latin scripts if you're not doing any sort of text
transformtions (like substring), and it's generally sufficient
for English which doesn't have combining characters, weird
non-lexicogprahic sort rules, or anything else nasty.

If you're using non-Latin scripts, you're probably using
wchar_t. But you're write that there's no real standard support
for things like combining characters. In any language I know.
(The standard comparison operators for string, both in C++ and
in Java, are very naïve.)
 
J

James Kanze

(e-mail address removed)>, (e-mail address removed)
says...
Yes -- I was thinking in terms of the supposed inability to
change from one encoding to another, which most certainly IS
possible. Admittedly (and as you've pointed out) there's a
bit of a problem with changing encoding when buffering is
involved.
I'd note that this is really just an instance of a much larger
problem though. Buffering is intended to decouple the actions
on the two sides of the buffer, and it does that quite well.
In this case, however, we care about the state on the "far"
side of the buffer -- exactly what the buffer is supposed to
hide from us.

Certainly, and I don't know of a system which really supports it
fully. In general, you can pass from a one-to-one (byte)
encoding to something more complex, but once you've started with
something more complex, you can't go back. About the only
difference between C++ and Java here is that C++ documents this
fact. (Or else... IIRC, Java does the encoding after the
buffering, so the problems should be less.)
[ ... ]
:). In practice, there's nothing "wrong" with the Java
solution (UTF-16).
Sort of true -- it's certainly true that UTF-16 (like UTF-8)
is a much "nicer" encoding than things like the old shift-JIS.
At least it's easy to recognize when you're dealing with a
code point that's encoded as two (or more) words.

And it's trivial to resynchronize if you get lost.
At the same time, you do still need to deal with the
possibility that a single logical character will map to more
than one 16-bit item, which keeps most internal processing
from being as clean and simple as you'd like.

But if you're doing any serious text processing, that's true for
UTF-32 as well. \u0302\u0071 is a single character (a q with a
circumflex accent), even if it takes two code points to
represent. And if you're not concerned down to that level,
UTF16 will usually suffice.

But speaking from experience... Handling multibyte characters
isn't that difficult, and I find UTF8 the most appropriate for
most of what I do.
Then again, at least in the Java code I've seen, the internal
code is kept clean and simple -- which is fine until somebody
feeds it the wrong data, and code that's been "working" for
years suddenly fails completely...

Yes and no. Where I live, there are very strong arguments to go
beyond ISO 8859-1---the Euro character, the oe ligature, etc.,
not to mention supporting foreign names. But everything must be
in a Latin script; anything not Latin script is "wrong data".
In this case (and it's a frequent one in Europe), whether the
byte is a surrogate or a CKJ character really doesn't
matter---it's wrong data, and must be detected as such.
IMO, UTF-16 causes the biggest problem. The problem arises
when the "length" of a string is ambiguous -- the number of
characters differs from the number of units of storage.

But that's just as true with UTF-8 (which I regularly use), and
in a very real sense, with UTF-32 as well (because of combining
diacritical marks).
With UTF-8, those differences are large enough and common
enough that a mistake in this area will cause visible problems
almost immediately.
With UCS-4/UTF-32, there's never a difference, so no problem
ever arises.
With UTF-16, however, there's only rarely a difference -- and
even when there is, it's often small enough that if (for
example) your memory manager rounds up memory allocation
sizes, you can use buggy code almost indefinitely without the
bug becoming apparent. Then, (Murphy still being in charge)
exactly when it's most crucial for it to work, the code fails
_completely_, but duplicating the problem is next to
impossible...

OK. I can almost see that point. Almost, because I'm still not
sure from where you're getting the length value for the
allocator. If you have a routine for counting characters that
is intelligent enough to handle surrogates correctly (where two
code points form a single character), then it might be
intelligent enough to handle combining diacritical marks
correctly as well, and the same problem will occur with UTF32.
At the time, Unicode was still _competing_ with ISO 10646
rather than cooperating with it.

The Unicode Consortium was incorporated in January, 1991, and
the C adopted wchar_t sometime in the late 1980's---certainly
before 1988, when the final committee draft was voted on. And
ISO attributes standard numbers successively, which means that
ISO 10646 was adopted after ISO 9899 (the C standard). At the
time, I think that while it was generally acknowledged that
characters should be more than 8 bits, there was absolutely no
consensus as to what they should be.
I think there's more involved though: C++ (like C) embodies a
general attitude toward allowing (and even embracing)
variation. While I think in recent years it has moderated to a
degree, I think for a while (and still, though to a lesser
degree) there was rather an amount of pride taken in leaving
the languages loosely enough defined that they could be
implemented on almost any machine (past or future), including
some for which there was no realistic hope of anybody actually
porting an implementation.

I think this is a good point---an essential point in some ways.
Don't formally standardize until you know what the correct
solution is. Today (2009), I think it's safe to say that the
"correct" solution is to support all of the Uncode encoding
formats (UTF-8, UTF-16 and UTF-32), and let the user choose; if
I were designing a language from scratch today, that's what I'd
do. Today, however, both Java and C++ have existing code to
deal with, which complicates the issues---Java has an additional
problem in that evolutions of the language must still run on the
original JVM. (But Java could define a new character type for
UTF-32 at the language level, using 'int' to implement it at the
JVM level. Except that some knowledge of the class String is
built into the language.)

FWIW: I'm not really convinced that we know enough about what is
"correct" even today to dare build it into the language (which
means casting it in stone). For the moment, I think that the
C++ solution is about the most we dare do.
 
J

Jerry Coffin

[ ... ]
And remind me again, exactly what does the iostream library without
<locale> do again?

Nothing, of course -- since it depends on locale, it won't compile
without locale, and code that doesn't compile almost never does much
of anything.
It handles newlines for you if not in binary mode,
and uhh... is it with facet support that it handles printing integers
and floats into human readable strings? So, back to my original
complaint of why a separate streambuf and iostream class hierarchies

Because they do entirely different sorts of things. A stream buffer
is a standardized interface to some sort of stream. An iostream is a
formatter. Yes, a lot of the default formatting is locale specific,
but it doesn't really change much. _Something_ has to connect the
bits and pieces in a locale and apply them to a stream, and that code
certainly does NOT belong in a stream buffer, because it's entirely
unrelated to anything a stream buffer does.
when something like the Java Writer hierarchy or the Java OutputStream
hierarchy seems so much clearer and simpler IMHO?

Of course, it's impossible to answer why you hold a particular
opinion -- especially a positive opinion of a hierarchy like Java's
OutputStream and relatives, that its own inventors think little
enough of that they've deprecated its use (in favor of the rather
more complex OuputWriter hierarchy).

[ ... ]
I fully recognize that templates do not solve everything, and that
you need to be able to work on a generic stream which uses dynamic
dispatch to do the actual work, for compilation speed reasons, code
bloat reasons (and thereby runtime speed and size reasons), etc. My
solution offers both. With it, you would only pay for the virtual
overhead when you actually need to use it. (Note that
jjm:istream_wrapper should probably be buffered by default.) I haven't
thought this through fully, but enough to ask "Why isn't it done this
way which seems superior?"

Although a lot has been modified and added to iostreams along the
way, the _basic_ design goes back to (for one thing) long before the
existence of templates. I'm pretty sure that if anybody was designing
them today, they'd come out rather different than they are now.

I guess that's a rather roundabout way of saying that your design may
well be superior, but unless you're willing to put in a _lot_ of work
(to turn it into something the committee might accept as a real
proposal for an addition to the standard) it's probably not going to
go much of anywhere. At least to me, it appears that most users of
C++ just don't care very much.

This highlights one major difference between Java and C++. In C++
it's _much_ easier to bypass the built-in I/O primitives (or almost
any other part of the standard library) and do something else (often
system-specific) when/if it's a problem. They work well enough most
of the time, and they're easy to ignore when they're a real problem.
In Java, if the built-in I/O facilities are inadequate, it's _much_
more difficult to just ignore them and put something together on your
own.
 
J

Joshua Maurice

Well, I started putting my money where my mouth was, and I might have
to take some / most of what I said back.

I took the simple use case of creating an ifstream which automatically
converted system "end of line" to '\n'. With the standard library,
that's simply std::ifstream. There is a single level of buffering in
the file streambuf. When I do std::istream::get, it sees the buffer in
the streambuf is empty, does a virtual call to fill the streambuf's
buffer, that is a single call to fread which writes directly to the
buffer in the streambuf. std::istream may then access this buffer
directly without virtual calls.

To do so with my basic plan would require two different levels of
buffering, and that's why I'm retracting some / most of my complaints
of std streams. To do it my way, you would need:

- A virtual istream class to act as a generic stream. Internally it
buffers its wrapped stream.
- It wraps a stream wrapper which properly converts newlines.
- which wraps a buffering stream wrapper.
- which wraps the file stream.

You need a buffering level between the newline stream wrapper and the
file stream wrapper, and you need a buffering level in the virtual
istream class to minimize virtual calls. You simply cannot do it by
trying to separate the functionality like I wanted to. The formatting
needs to be on top of the virtual calls. I don't know what I think at
the moment.

(I agree that the names of the streambuf functions are definitely
borderline obfuscated, and I still don't have a good reference for its
interface and the rest of the std streams library.)

PS:
I do stand by my unrelated points made earlier that the standard
library's support for unicode is bad, though I would love to be
corrected here too.

I want a string class for UTF8 and UTF16 which are implemented with
actual variable-width encodings for performance reasons. I want each
to have a unicode codepoint iterator and a grapheme cluster iterator
(so like begin_codepoint(), and begin_grapheme()). I want each to have
an equivalent interface suitable for use as an argument to the same
template function (facilitated by the unicode codepoint iterator and
the grapheme cluster iterator). Each should at least support push_back
(unicode_codepoint).

I want a simple function to read from or write to char arrays (aka
byte array) to change encodings, like to go from UTF8 to UTF16. I want
these encodings to be portable, either defined by the standard, or by
convention and easily installed on a new system by an installer.

I want a portable list of collators usable as comparison objects in
the standard library containers, either defined by the standard, or by
convention and easily installed on a new system by an installer. ex:
std::set<ustring, std::collation_english>
std::map<ustring, some_type, std::collation_english>
And because that's inefficient, optimized containers
std::string_set<std::collation_english>
std::string_map<some_type, std::collation_english>
Same for unordered_map and unordered_set.
ustring::eek:perator< would use the global locale I guess.
 
Ad

Advertisements

J

Jerry Coffin

[ ... ]
I do stand by my unrelated points made earlier that the standard
library's support for unicode is bad, though I would love to be
corrected here too.

I (at least sort of) agree, but I'll point out a few problems...
I want a string class for UTF8 and UTF16 which are implemented with
actual variable-width encodings for performance reasons.

Although they've tried to minimize the differences between the two,
Unicode and ISO 10646 are still two separate standards with slightly
different views on things. One difference shows up in how each
defines UTF-8. The Unicode definition says UTF-8 sequences can only
be up to 4 bytes apiece, but the ISO 10646 definition says they can
be up to 6 bytes apiece. A proper Unicode function should reject some
legal ISO 10646 sequences, so you need to decide whether you really
want to restrict it to Unicode, or to support all of ISO 10646.

The C (and C++) standards originally tried to deal with things a bit
differently. They assumed that any multi-byte encoding like UTF-8
would be converted to a wide-character encoding like UCS-4 during
input. Internally, your program would deal only with wide characters,
then (if need be) they'd be converted back to multi-byte encodings
like UTF-8 during output.

IMO, that's a fairly reasonable way to handle things, but I'd like to
see a few additions. First, they should define names for a few
required locales (or at least codecvt facets) to handle obvious
conversions like UTF-8, and UTF-16 as a bare minimum. Second, I'd
like to see a Unicode normalization function (or set of functions) to
handle the decomposition/composition rules, so you can turn a random
(legal) Unicode string into, for example, normalization form C.

The latter is necessary because even in UCS-4/UTF-32, you can have
two strings that contain different code points, but are still
equivalent -- e.g. an A with a ring above it can be encoded as either
"A with ring" or 'A' followed by 'combining ring'. The same is true
of a number of diacritical marks -- acutes, graves, umlauts, etc.

To support just about anything like string comparison or searching,
you nearly need to do some sort of normalization so the same symbol
will be represented the same way -- otherwise, your string search,
for example, has to be aware that an 'A' followed by 'combining
acute' is the same as 'A with acute'.

OTOH, you don't want to apply normalization indiscriminately either
-- for example, Unicode includes both Omega and an Ohm symbol, which
look exactly alike, and if you normalize, every ohm symbol will
disappear and be replaced by an omega instead. In a file dealing with
electronics, you might really want to preserve the ohm symbols as ohm
symbols.

[ ... ]
I want a simple function to read from or write to char arrays (aka
byte array) to change encodings, like to go from UTF8 to UTF16. I want
these encodings to be portable, either defined by the standard, or by
convention and easily installed on a new system by an installer.

The UTF-8 and UTF-16 encodings are defined by two separate standards,
and it's pretty easy to write portable code for both. The Unicode
site has portable code for conversions between UTF-32, UTF-16, and
UTF-8 as well as some code for converting between UCS-2 and UTF-7
(deprecated, but probably still perfectly usable for quite a few
purposes). For better or worse, this code was written to suit
somebody's idea of optimization rather than readability. RFC 3629
gives (IMO) a better description of UTF-8 conversion (and leaves it
to you to decide how you want to optimize that, or even if you want
to optimize it at all).
I want a portable list of collators usable as comparison objects in
the standard library containers, either defined by the standard, or by
convention and easily installed on a new system by an installer. ex:
std::set<ustring, std::collation_english>
std::map<ustring, some_type, std::collation_english>
And because that's inefficient, optimized containers
std::string_set<std::collation_english>
std::string_map<some_type, std::collation_english>
Same for unordered_map and unordered_set.
ustring::eek:perator< would use the global locale I guess.

That's a little more difficult -- collation is a hard problem. Not
hard like NP-hard, but hard like poorly defined. Even if we restrict
things to English, we quickly find that libraries use one collation
sequence, dictionaries a second, and phone directories a third.

Both Unicode (in UTS #10) and ISO (in ISO 14651) define collation
algorithms as well. Nicely enough, Unicode also publishes a table
(DUCET) of the weights to be used in collation, which makes the
algorithm pretty easy to implement (as long as you don't mind using a
fairly large table to do the job).

Overall, my advice would be to look up the ICU library at:

http://site.icu-project.org/

It already does a pretty reasonable job of most of what you're asking
for, and a few things you haven't as well (but that almost inevitably
arise when you try to do much).

Another project you might want to look at is CLDR, at:
cldr.unicode.org.

They have a _large_ quantity of locale data in Locale Data Markup
Language (a dialect of XML), which can be transformed into data files
for a locale under almost any system (e.g. they already have a tool
to automatically produce a properly formatted POSIX locale file).
 
J

Joshua Maurice

[ ... ]
I do stand by my unrelated points made earlier that the standard
library's support for unicode is bad, though I would love to be
corrected here too.

I (at least sort of) agree, but I'll point out a few problems...

[...]
I want a simple function to read from or write to char arrays (aka
byte array) to change encodings, like to go from UTF8 to UTF16. I want
these encodings to be portable, either defined by the standard, or by
convention and easily installed on a new system by an installer.

The UTF-8 and UTF-16 encodings are defined by two separate standards,
and it's pretty easy to write portable code for both. The Unicode
site has portable code for conversions between UTF-32, UTF-16, and
UTF-8 as well as some code for converting between UCS-2 and UTF-7
(deprecated, but probably still perfectly usable for quite a few
purposes). For better or worse, this code was written to suit
somebody's idea of optimization rather than readability. RFC 3629
gives (IMO) a better description of UTF-8 conversion (and leaves it
to you to decide how you want to optimize that, or even if you want
to optimize it at all).

Well yes, if we restrict ourselves to UTf-8, UTF-16, and UTF-32, I
could write reasonably optimized code to do that myself in under 200
lines of source. However, I meant to imply I want encoding conversions
between unicode and the global locale encoding, and other installed
encodings as well, like Shift JIS.
That's a little more difficult -- collation is a hard problem. Not
hard like NP-hard, but hard like poorly defined. Even if we restrict
things to English, we quickly find that libraries use one collation
sequence, dictionaries a second, and phone directories a third.

Then I will go back to what I said earlier, that there should be a
standard framework so you can install your collator of choice easily.
Both Unicode (in UTS #10) and ISO (in ISO 14651) define collation
algorithms as well. Nicely enough, Unicode also publishes a table
(DUCET) of the weights to be used in collation, which makes the
algorithm pretty easy to implement (as long as you don't mind using a
fairly large table to do the job).

Overall, my advice would be to look up the ICU library at:

http://site.icu-project.org/

It already does a pretty reasonable job of most of what you're asking
for, and a few things you haven't as well (but that almost inevitably
arise when you try to do much).

Another project you might want to look at is CLDR, at:
cldr.unicode.org.

They have a _large_ quantity of locale data in Locale Data Markup
Language (a dialect of XML), which can be transformed into data files
for a locale under almost any system (e.g. they already have a tool
to automatically produce a properly formatted POSIX locale file).

Indeed. My company uses ICU, well technically a home modified ICU
because we decided we wanted to be marginally efficient. Apparently
some of the ICU interfaces have a virtual lookup \for each character\,
so we rewrote some significant portions ourselves.

I basically want a lot of ICU to be standardized. At least, that was
my initial point, though I'm not sure I trust the standards committee
to do it "correctly", so maybe I will just stick with our home
modified ICU.
 
Ad

Advertisements

J

Jorgen Grahn

Ok. Also, do you use wstream in any real code? Or wstring?

If he answers that, will you ask him about some third thing which you
think is stupid? A little bit more politeness, please.

I used locales a few months ago for collation and was happy with the
result -- although I didn't use them together with iostreams.

....
So, basically entirely implementation defined, and as there is no
particular standard in use beyond the C++ standard, they're basically
worthless for portable code.

What -- worthless just because you cannot portably know that a locale
for Swedish as spoken by a minority of Finns is installed, and called
"sv_FI"? That doesn't make sense.

/Jorgen
 
Ad

Advertisements


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top