Upgrade from Windows-1252 to UCS-2

B

Boris

I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.

As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?

Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Anything I might have missed?

Boris
 
J

John Harrison

Boris said:
I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows
"ANSI" code page) to UCS-2. Currently the program reads and writes files
encoded in Windows-1252 but should be able to read files encoded in
UCS-2, too.

As I don't want to deal with two character representations in the
program I plan to use UCS-2 internally. I should be able to simply use
std::wstring then?

Yes.

When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends
now on the implementation of the C++ standard library if and what kind
of conversions are supported? I might need to use a third-party library
like the Dinkum Conversions Library which converts data on the fly or
something like UTF-8 CPP where I can call functions explicitly to
convert between character sets?

AFAIK a third party library (or writing your own code) is the only way
to go. For Windows-1252 to UCS-2 why not write your own? It can't be
that hard.
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct
that I'm safe to use member functions of std::wstring as long as the
character set used is not multibyte?

That's correct for UCS-2.
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

But yes, to convert UCS-2 to UTF-8 is another step for which you could
either get a third party library or write your own code.
Anything I might have missed?

Boris

john
 
J

John Harrison

Some confusion here I think, UTF-8 and Windows-1252 are not the same.
The first is an character encoding, the second is a character set.

I want to take that back, Windows 1252 is an encding too, but it's still
the case that it's not the same as UTF-8

john
 
B

Boris

I want to take that back, Windows 1252 is an encding too, but it's still
the case that it's not the same as UTF-8

Thanks, John! I should have clarified it better: The idea is that files
with an ASCII-compatible subset of UTF-8 look like normal ASCII files when
encoded in UTF-8 (so other programs can simply assume they are ASCII
files).

Boris
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

I'm trying to find out what the steps look like to upgrade a program
(which is used on Windows and Unix) from Windows-1252 (the Windows "ANSI"
code page) to UCS-2. Currently the program reads and writes files encoded
in Windows-1252 but should be able to read files encoded in UCS-2, too.

I think you mean UCS-4 and UTF-16. Old documents talk about UCS-2, but
current Windows (and I assume Linux etc.) is UCS-4. This causes no end
of confusion especially as for most purposes there isn't much
difference. Check that your software manages to handle the treble
cleff character properly. Let's see how it works here :)


As I don't want to deal with two character representations in the program
I plan to use UCS-2 internally. I should be able to simply use
std::wstring then? When Windows-1252 encoded files are read I have to
convert the data to UCS-2 though. My understanding is that it depends now
on the implementation of the C++ standard library if and what kind of
conversions are supported? I might need to use a third-party library like
the Dinkum Conversions Library which converts data on the fly or something
like UTF-8 CPP where I can call functions explicitly to convert between
character sets?

Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.
After converting everything to UCS-2 and storing it in std::wstring I
suppose I can use the well-known string functions to search, replace,
compare strings (including < and >) etc. Is my understanding correct that
I'm safe to use member functions of std::wstring as long as the character
set used is not multibyte?

If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.
Last but not least the program needs to save files again. It might make
sense to use UTF-8 here for backward compatibility (as other programs
might be able to read the files more easily if they support only
Windows-1252). Thus I would need another converter to make sure that
std::wstring is encoded in UTF-8 correctly which means I need a
third-party tool again?

Anything I might have missed?

To convert from UTF-16 to UTF-8 is fairly simple, but don't forget you
HAVE to go through UTF-32.

It's not directly about your situation, but you may find this
interesting as it does discuss some of the issues about encodings and
Unicode.

http://www.kirit.com/Getting the correct Unicode path within an ISAPI filter


The way you're going about it is a good way to start this sort of
conversion. In the end for our systems we made our own
std::basic_string like class that knows it is UTF-16 and alters parts
of the interface accordingly.

Once you start working with Unicode you won't want to go back.


K
 
B

Boris

Your std::wstring will be in UTF-16 (on Windows, maybe UTF-32 on
Linux). Whether this matters or not depends on what string
manipulation you do. E.g. you can safely substr on a boundary where
you find a certain character, but you cannot safely take the first
twenty wchar_ts from a std::wstring without a chance of breaking the
string.

What's so special about the first twenty wchar_ts? It's the first time I
hear about it.
[...]If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.

My idea was to use UCS-2 internally as with UTF-16 you'll get all kind of
problems like the one you described. I understand that you even created
your own std::basic_string class for your products. However I'm trying to
go the easy way. :) I understand that UCS-2 might not be sufficient for
all Unicode characters but for now that's the price I'm ready to pay. Or
do I really miss anything important (if for example the Klingon characters
don't fit in UCS-2 anymore I really don't mind :)?

Boris
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

What's so special about the first twenty wchar_ts? It's the first time I
hear about it.

Nothing. It was just an example. You can't take any internal range
between positions n and m in a UTF-16 sequence without checking that
you don't cut surrogate pairs in half.

You will generally be OK so long as you use a string instead of a
wchar_t for single character operations - ie. every place you would
get user input as one character handle it internally as a string. You
also need to make sure that you never use functions like substr at any
boundary that has not been found by searching within the string.
[...]If you ignore normal forms and collation ordering. std::wstring's
members that return single wchar_ts may only give you half of a
surrogate pair. Searching for some Unicode characters through the
string's single character members won't be possible. UTF-16 is
multibyte.

My idea was to use UCS-2 internally as with UTF-16 you'll get all kind of
problems like the one you described. I understand that you even created
your own std::basic_string class for your products. However I'm trying to
go the easy way. :) I understand that UCS-2 might not be sufficient for
all Unicode characters but for now that's the price I'm ready to pay. Or
do I really miss anything important (if for example the Klingon characters
don't fit in UCS-2 anymore I really don't mind :)?

Then you need to strip the surrogate pairs from your code, but I'm not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is
pretty simple. You should be able to write a simple iterator based
algorithm that does it in a short amount of code.

If you are interested in licensing the implementations that we have
then you can contact me via email or my web site.


K
 
B

Boris

[...]Nothing. It was just an example. You can't take any internal range

Ah, okay, then I understand. :)
[...]Then you need to strip the surrogate pairs from your code, but I'm
not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is

The problem is that the code base is pretty big. There are strings used
everywhere and of course the string member functions we all know from the
C++ standard. I expect that it's rather simple to replace std::string with
std::wstring. But I try to avoid having to make a complete code review to
figure out if the strings are used everywhere correctly. If I simply use
UCS-2 in std::wstring I should be more or less done? Or is there any trick
to make a std::wstring aware of UTF-16 - can't possibly work?

Boris
 
J

John Harrison

Boris said:
[...]Nothing. It was just an example. You can't take any internal range

Ah, okay, then I understand. :)
[...]Then you need to strip the surrogate pairs from your code, but
I'm not
sure that I'd recommend it. So long as you are careful with the string
operations you'll be fine with UTF-16 and converting to UTF-8 is

The problem is that the code base is pretty big. There are strings used
everywhere and of course the string member functions we all know from
the C++ standard. I expect that it's rather simple to replace
std::string with std::wstring. But I try to avoid having to make a
complete code review to figure out if the strings are used everywhere
correctly. If I simply use UCS-2 in std::wstring I should be more or
less done? Or is there any trick to make a std::wstring aware of UTF-16
- can't possibly work?

Boris

You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

On the other hand if you really mean the more modern UTF-16 then
surrogate pairs is an issue. Frankly though I'd stick with UCS-2.

john
 
J

John Harrison

Boris said:
Thanks, John! I should have clarified it better: The idea is that files
with an ASCII-compatible subset of UTF-8 look like normal ASCII files
when encoded in UTF-8 (so other programs can simply assume they are
ASCII files).

Boris

That is true, but again ASCII is not the same as Windows-1252. You need
to be precise about your terminology.

john
 
B

Boris

[...]You need to be careful about terminolgy. If you really mean UCS-2
then the surrogate pairs problem that Kirit mentioned is not a problem
for the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

On the other hand if you really mean the more modern UTF-16 then
surrogate pairs is an issue. Frankly though I'd stick with UCS-2.

Yes, I really do mean UCS-2. Kirit started to talk about UTF-16, not me. :)

Boris
 
?

=?iso-8859-1?q?Kirit_S=E6lensminde?=

You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

Windows now uses UTF-16 not UCS-2. Early versions of Windows used
UCS-2 and there is still a lot of documentation from that era on the
web. I don't remember which versions changed from UCS-2 to UCS-4.

If all of the strings are generated internally then you can probably
get away with assuming UCS-2 so long as you reject the surrogate
pairs. At every location that strings enter the program they will need
to be checked.


K
 
J

James Kanze

You need to be careful about terminolgy. If you really mean UCS-2 then
the surrogate pairs problem that Kirit mentioned is not a problem for
the simple reason that UCS-2 doesn't have surrogate pairs. I believe
that Windows internally still uses UCS-2, though I could be wrong.

If you want to be really careful: it may use both, since they
refer to different things. UCS-2 specifies a character set and
its abstract encoding: the mapping between a specific character
and a numeric value. UTF-16 specifies an encoding format: the
way the numeric value is represented in a particular context
(memory, media, etc.). UTF-16 (like UTF-8) can be used to
represent both UCS-2 and UCS-4.

I think modern Windows uses UCS-4 in UTF-16 format, at least on
disk and at the API level.
On the other hand if you really mean the more modern UTF-16
then surrogate pairs is an issue. Frankly though I'd stick
with UCS-2.

The choice of the code set depends on the characters you need.
If I were writing a compiler for K&R C, I'd stick with US ASCII;
it's a lot simpler than either, and has all the necessary
characters. If I have to handle text in a far eastern
languange, on the other hand, I probably need UCS-4, regardless
of what I want, because it is the only encoding which has all of
the characters I need.

Depending on what I'm doing, internally, I'll use UTF-32 or
UTF-8. Probably... I've never worked in an environment which
had any native support for UTF-16, and perhaps in some cases,
the presence of native support would win out.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,434
Messages
2,571,689
Members
48,796
Latest member
Greg L.

Latest Threads

Top