* James Kanze:
[...]
Well then, if there's something in the standard, it's up to
the users like Shahid to ask the features from the compiler
vendors. We may hope for some progress on this front.
Maybe. I haven't seen much progress on export.
In 1990, (just to mention how advanced they were at the time),
NeXTSTEP compiler accepted sources in .rtf format, so you
could write your code in "Rich Text", with fonts, colors, etc.
(The driver would just convert the rtf to ascii and proceed).
That's cool.
According to the C++ standard, how the compiler maps "physical
source file characters" to the "basic source character set" is
implementation defined. So the next step would be to map
superscripts to std:

ow, and subscripts to []. I'm pretty sure
such an implementation would be legal. And at least for numeric
applications, the possibilities are interesting, to say the
least---mapping a capital Greek sigma to a call to
std::accumulate, etc.
Unfortunately, with current compiler technology -- after all, we're only about
50 years on in that game and one can't expect much in just 50 years -- the
problem is not how to get beyond Latin-1, but rather, how to be able to use
Latin-1, as opposed to just plain ASCII, in our C++ programs' string constants.
For example, with MinGW g++ 3.4.5, the current version, the following *does not
compile* when the source code is in Latin-1:
L"blåbærsyltetøy" // Norwegian for "blueberry jam"
This may be surprising to some because apparently g++ handles non-ASCII Latin-1
characters just fine in narrow character literals.
However, the reason they "work" for narrow characters is a bug in the compiler,
where it doesn't recognize the source code bytes as invalid UTF-8 (with a wide
character literal it's forced to attempt a conversion to UTF-16 and chokes).
Save the source code as UTF-8 no BOM and that compiler is happy, of course, but
then, the source code is not portable (e.g. MSVC won't eat it, just spits it
out) and for a console application the executable is then useless, because the
Windows command interpreter's UTF-8 codepage doesn't work.
One very inefficient solution is to preprocess the source code to pure ASCII.
E.g., the following (not optimized at all, optimizations should be obvious but
do not affect the total efficiency very much) program does that:
<code>
#include <iomanip> // std::setfill, std::setw
#include <iostream>
#include <locale.h> // setlocale
#include <stdlib.h> // abort, mbtowc
wchar_t unicodeFrom( char c )
{
wchar_t wc;
int const returnValue = mbtowc( &wc, &c, 1 );
if( returnValue == -1 )
{
abort(); // mbtowc failed.
}
return wc;
}
bool isInAsciiRange( char c )
{
typedef unsigned char UChar;
return (UChar(c) < 0x80);
}
int const outsideLiteral = 0;
int const afterPrefix = 1;
int const inWideLiteral = 2;
int const inEscape = 3;
struct State
{
int current;
char terminator;
};
void onOutsideLiteralChar( char c, State& state )
{
if( c == 'L' )
{
state.current = afterPrefix;
}
}
void onAfterPrefixChar( char c, State& state )
{
if( c == '\'' || c == '\"' )
{
state.terminator = c; state.current = inWideLiteral;
}
else
{
state.current = outsideLiteral;
}
}
void onInWideLiteralChar( char c, State& state )
{
if( c == '\\' )
{
state.current = inEscape;
}
else if( c == state.terminator )
{
state.current = outsideLiteral;
}
}
void onInEscapeChar( char c, State& state )
{
state.current = inWideLiteral;
}
int main()
{
using namespace std;
char c;
State state;
setlocale( LC_ALL, "" ); // Affects mbtowc translation.
cout << hex << uppercase << setfill( '0' );
cout.sync_with_stdio( false );
state.current = outsideLiteral;
while( cin.get( c ) )
{
if( state.current != inWideLiteral || isInAsciiRange( c ) )
{
cout << c;
}
else
{
cout << "\\u" << setw( 4 ) << unsigned( unicodeFrom( c ) );
}
switch( state.current )
{
case outsideLiteral: onOutsideLiteralChar( c, state ); break;
case afterPrefix: onAfterPrefixChar( c, state ); break;
case inWideLiteral: onInWideLiteralChar( c, state ); break;
case inEscape: onInEscapeChar( c, state ); break;
}
}
}
</code>
To use this preprocessing properly the source should first be preprocessed via
the C/C++ preprocessor, i.e., compilation is then a pipeline of three processes.
Which, I suspect due to amount of text generated by the C/C++ preprocessor, is
very inefficient.
So, I contend that before asking compiler vendors to support the full range of
characters required by the C++ standard, we should ask them to support Latin-1.
Cheers,
- Alf