Portable Code that supports Unicode

T

Tomás

Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};


Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.

How does one go about using Unicode in portable code? Something like the
following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};


Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).

Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};


What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?

-Tomás
 
B

Ben Pope

Tomás said:
Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};

Why are you using char* instead of std::basic_string said:
Let's say we want to give the name of the nation in the nation's official
language... and so we want to use the Unicode character set to achieve this.

WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.
How does one go about using Unicode in portable code? Something like the
following?:

Unicode is still not part of the standard, so it is not portable.
typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};


Would you use "wchar_t", or would you use "unsigned short"? (Unicode is 16-
bit).

Not all Unicode is 16 bit, and not all 16 bit encodings are Unicode.
wchar_t is often not suitable for Unicode.

Until I was sure what I was doing, I would probably use:

class unicode_char {
/* wrap wchar_t */
}

typedef std::basic_string said:
Furthermore, how do you go about making your code in such a way that it can
use either normal characters or wide characters. Microsoft do it something
like the following: (You define the UNICODE macro if you're using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

That's ugly and is not a modal to be copied. If you need Unicode
support, just support Unicode.

Anyway, this is merely a way of supporting wide and narrow characters,
not encodings.
class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};


What do you think of this? At the moment I'm writing code which I want to
support the normal character set and also Unicode... but I want to keep it
portable!

Any suggestions on how to go about this? Is the Microsoft way decent enough?

I think you need to decide what exactly it is you are doing, and read up
on Unicode.

So far you have only demonstrated wide and narrow character support, and
nothing to do with encodings.

You need to decide on an internal representation, and then you need to
provide mappings to your OS of choice, probably through stream operators
and facets. I don't know what your definition of portable is.

Ben Pope
 
L

loufoque

Tomas said:
(Unicode is 16-
bit).

Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.

Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.

I would advise to use Glib::ustring from glibmm.
It contains some nice tools about general Unicode stuff too.

There is also ICU from IBM that you could check out.
 
G

Gianni Mariani

loufoque said:
Unicode is defined on 21 bits.
You can use various encodings to represent it, like UTF-8, UTF-16 or
UTF-32 alias UCS-4.
There is also UCS-2 that Microsoft uses, but it doesn't support the
whole Unicode range.

If you need something with Random Access, you can only take UCS-2 or UCS-4.
If you only need a Reversible Container, UTF-8 or UTF-16 will do.

What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.
Anyway you shouldn't use pointers for strings, but strings objects.

std::wstring can be used for UCS-2 or UCS-4 depending on your system.
Be aware than in the standard, though, std::wstring wasn't made for
unicode. You'd better use something dedicated IMO.

I don't think the UNICODE macro of Microsoft is a good idea. That makes
libs compiled with unicode support incompatible with the ones which
aren't etc.
Just make your application unicode aware, compiling flags to mess
everything up are useless.

I second that.

UTF-16 is also a big waste of time IMHO.
 
L

loufoque

Ben Pope a écrit :
WHICH unicode "character set"? There are several, such as UTF-8,
UTF-16, UTF-32, UCS-2, UCS-4 as well as big and little endian versions.

I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.

Unicode is still not part of the standard, so it is not portable.

Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.

Making the OS display the characters correctly is another thing.

It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.
 
B

Ben Pope

loufoque said:
Ben Pope a écrit :


I think those are character encodings, not character sets.

Character sets specify a table that maps characters to integers and
character encodings define ways to encode that integer in bytes.

Unicode would indeed be a character set.

It is actually rather confusing because "charset" is "character
encoding" because of its usage in various protocols for character encoding.

Yeah, sorry. I'm not helping the confusion. I actually started with
"charset" and expanded it as a scanned through for mistakes. D'oh!
Having a sequence of bytes in memory representing a character according
to a well defined encoding and character set is very portable.

Of course, but there is no native support. In order to get full Unicode
support, you need a rather large library, or at least a decent framework
in which to stick encodings.
Making the OS display the characters correctly is another thing.

....that was my point.
It's not because something isn't part of the standard that it isn't
portable, one can write a portable std::string-like rather easily.

Indeed, which is fine for internal use, it's the outside world which is
the problem. That's where standardisation (and support) needs to be.

Thanks for the clarifications.

Ben Pope
 
T

Tomás

Tomás posted:
Let's start off with:

class Nation {
public:
virtual const char* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const char* GetName() const
{
return "Norway";
}
};


Let's say we want to give the name of the nation in the nation's
official language... and so we want to use the Unicode character set to
achieve this.

How does one go about using Unicode in portable code? Something like
the following?:

typedef wchar_t UnicodeChar;

class Nation {
public:
virtual const UnicodeChar* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const UnicodeChar* GetName() const
{
return L"Norway"; //Note the preceding L
}
};


Would you use "wchar_t", or would you use "unsigned short"? (Unicode is
16- bit).

Furthermore, how do you go about making your code in such a way that it
can use either normal characters or wide characters. Microsoft do it
something like the following: (You define the UNICODE macro if you're
using Unicode)

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

class Nation {
public:
virtual const Character* GetName() const = 0;
}

class Norway : public Nation {
public:
virtual const Character* GetName() const
{
return StringLiteral("Norway");
}
};


What do you think of this? At the moment I'm writing code which I want
to support the normal character set and also Unicode... but I want to
keep it portable!

Any suggestions on how to go about this? Is the Microsoft way decent
enough?

-Tomás


I always try to keep my posts implementation independant... but anywho
here's what I'm doing:

(About to drift off-topic...)

I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.

There's two flavours of each Windows function, the ASCII one and the Unicode
one, for instance:

SetWindowTextA ( ASCII version )
SetWindowTextW ( Unicode version )

A person can use my control by adding a header file and source file to their
project. Like this:

#inclue <control.hpp>
using namespace Control;

int main()
{
PlaceCtrlOnDialog();
}


Anyway, the whole point is that I while I want the control to support
Unicode, I also want it to support ASCII. I think the best way to do this is
to have a project-wide preprocessor directive such as UNICODE. Then, I could
have:

#ifdef UNICODE
typedef wchar_t Character;
#define StringLiteral(x) Lx
#else
typedef char Character;
#define StringLiteral(x) x
#endif

const Character* GetAuthorName()
{
return StringLiteral("Tomás");
}


You may not think it's the most beautiful code, but it achieves its
objective.

Any thoughts?


-Tomás
 
L

loufoque

Gianni Mariani a écrit :
What is "Reversible" ? If UTF-16 is "reversible" then so must be UTF-32.

This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.
 
G

Gianni Mariani

loufoque said:
Gianni Mariani a écrit :



This is Standard C++ terminology.
A Reversible Container is a Forward Container whose iterators are
Bidirectional Iterators.
A Random Access Container is a Reversible Container whose iterator type
is a Random Access Iterator.

As you can see, UTF-32/UCS-4 being a possible implementation for a
Random Access Container, it is "reversible" too.

Ah. I thought you were referring to Unicode terminology.

The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

Then again, when you look at the requirements for Unicode's composing
characters, it's a problem as well, for any encoding.

G
 
L

loufoque

Gianni Mariani a écrit :
The problem with utf-8 and utf-16 is that they're multibyte
(multi-value) in nature. Making a reversible iterator is non-trivial.

A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.
 
M

Martin Vejnar

TomÐs said:
Tomás posted:

I'm writing a Windows control that you can place on a dialog box. As some of
you may know, the earlier versions of Windows (95, 98, Me) all used ASCII
internally when dealing with strings. Characters were stored in 8-Bits.

Now, all the Windows versions are using Unicode. My control will display
text, and so I want it to be able to display Unicode text. Unicode
characters are stored using 16 bits on Windows.

<OT>

If by "portable" you mean "running on any Windows system using its
native character encoding" and you are willing to have two binaries, one
for 9x/Me and one for 2k/XP, why don't you use the solution that
Microsoft has already created?

Use 'TCHAR' for characters, 'std::basic_string<TCHAR>' for strings, and
enclose string literals in '_T'.

I think you are unnecessarily trying to reinvent the wheel.

</OT>
 
G

Gianni Mariani

loufoque said:
Gianni Mariani a écrit :



A bidirectionnal iterator for utf-8 or utf-16 is pretty easy to make.
It's because the characters have variable length in bytes that you can
only iterate forward and backward and not use random access.

I didn't say "hard", I said "non trivial", i.e. it's not a simple
increment or decrement of a pointer.

utf-16 is especially hard if the data is a mix of endianness since you
would need to check for embedded BOM's unless this string is normalized.
 
T

Tomás

I've gone as far as to let both character sets be used at the same time.
Any opinions and suggestions welcome.

#include <iostream>
using std::cout;
using std::endl;

#define Literal(x) StringLiteral( x, L##x )
/*
The macro creates an anonymous object of type "StringLiteral".
It passes two arguments to its constructor: the char version
of the string, and the wchar_t version of the string.
*/

class StringLiteral
{
private:

const char* const p_c;
const wchar_t* const p_w;

public:

StringLiteral( const char* const c, const wchar_t* const w)
: p_c(c), p_w(w) {}

operator const char*() { return p_c; }

operator const wchar_t*() { return p_w; }

};

void GiveMeAnsiString(const char* p)
{
cout << "Ansi!" << endl;
}

void GiveMeUnicodeString(const wchar_t* p)
{

cout << "Unicode!" << endl;
}

int main()
{

GiveMeAnsiString( Literal("Amn't I a pretty string!") );

GiveMeUnicodeString( Literal("Amn't I a pretty string!") );
}


-Tomás
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top