Converting from UTF-16 to UTF-32

J

Jimmy Shaw

Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?

If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

Thanks!
 
C

Clark S. Cox III

Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?

First, your question is off-topic here, as it isn't really a C++ question.

[offtopic]But there is indeed a conversion that is needed (otherwise,
UTF-32 would be a pointless waste of space or UTF-16 would be
incomplete)[/offtopic]
If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

Thanks!

[offtopic]The conversion isn't really intricate at all. See
http://www.zvon.org/tmRFC/RFC2781/Output/chapter2.html#sub2 for a
description of the algorithms used to convert to/from UTF-16.[/offtopic]
 
?

=?ISO-8859-15?Q?Juli=E1n?= Albo

Jimmy said:
Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?
If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

http://www.unicode.org/
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

Jimmy said:
Hi everybody,

Is there any SIMPLE way to convert from UTF-16 to UTF-32? I may be
mixed up, but is it possible that all UTF-16 "code points" that are 16
bits long appear just the same in UTF-32, but with zero padding and
hence no real conversion is necessary?

If I am completely wrong and some intricate conversion operation needs
to take place, can anyone give me some primer on the subject?

Thanks!

These are the important bits from the functions that I use. utf32 is a
typedef for a signed 32 bit integer (__int32 on MSVC). utf16 is
normally the same as wchar_t on most platforms, but just in case it
isn't it needs to be sixteen bit. The UTF16 sequence is assumed to be
in Intel endian mode - the same as Windows uses.

You do need to use this sort of belt and braces approach though as this
is a prime vector for security cracks. The checks are even more
important for UTF8 sequences. I think that there's a lot of
improvements that could be made, but the code does work.


std::size_t FSLib::utf::utf16length( const utf32 ch ) {
if ( ch < 0x10000 ) return 1;
else return 2;
}

utf32 FSLib::utf::assertValid( const utf32 ch ) {
try {
if ( ch >= 0xD800 && ch <= 0xDBFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is in the UTF-16
leading surrogate pair range." );
if ( ch >= 0xDC00 && ch <= 0xDFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is in the UTF-16
trailing surrogate pair range." );
if ( ch == 0xFFFE || ch == 0xFFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"UTF-32 character is disallowed
(0xFFFE/0xFFFF)" );
if ( ch > 0x10FFFF ) throw FSLib::Exceptions::UnicodeEncoding(
L"UTF-32 character is beyond the allowable range." );
return ch;
} catch ( FSLib::Exceptions::UnicodeEncoding &e ) {
e.info() << L"Character value is: " << ch << std::endl;
throw;
}
}

utf32 FSLib::utf::decode( const utf16 *seq ) {
try {
utf32 ch = *seq;
if ( ch >= 0xD800 && ch <= 0xDBFF ) {
if ( seq[ 1 ] == 0 ) throw FSLib::Exceptions::UnicodeEncoding(
L"Trailing surrogate missing from UTF-16 sequence (it is ZERO)" );
if ( seq[ 1 ] < 0xDC00 || seq[ 1 ] > 0xDFFF ) throw
FSLib::Exceptions::UnicodeEncoding( L"Trailing character in a UTF-16
surrogate pair is missing (outside correct range)" );
return assertValid( ( ch << 10 ) + seq[ 1 ] + 0x10000 - ( 0xD800 <<
10 ) - 0xDC00 );
}
return assertValid( ch );
} catch ( FSLib::Exceptions::Exception &e ) {
e.info() << L"Decoding UTF-16 number: " << toString( unsigned int(
seq[ 0 ] ) ) << std::endl;
e.info() << L"Preceeding UTF-16 number: " << toString( unsigned int(
seq[ -1 ] ) ) << std::endl;
e.info() << L"Following UTF-16 number: " << toString( unsigned int(
seq[ 1 ] ) ) << std::endl;
throw;
}
}
 
D

dayton

Clark said:
That's certainly not true.

Concur. Most non-Windows platforms use a full integer for
wchar_t. Using locales and <codecvt> your iostreams probably
already provide the capability. Check your platform's wchar_t
to see if it already is in UTF32.

Dinkumware (http://www.dinkumware.com/) sells an extension
library that includes <codecvt> converters for UTF16.
 
P

P.J. Plauger

Concur. Most non-Windows platforms use a full integer for wchar_t. Using
locales and <codecvt> your iostreams probably already provide the
capability. Check your platform's wchar_t to see if it already is in
UTF32.

Dinkumware (http://www.dinkumware.com/) sells an extension library that
includes <codecvt> converters for UTF16.

Yep, except that they're now included as part of our standard
(Compleat) library product.

P.J. Plauger
Dinkumware, Ltd.
http://www.dinkumware.com
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top