isupper and islower for wstring

Rahul · Dec 9, 2010

Hi,

I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"

Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.

Thanks in advance
Rahul

David Lowndes · Dec 9, 2010

I have a std::wstring and I want to find which character are upper

case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"

Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.

I suspect you may have to revert to using the Windows API IsCharUpper.

Dave

Rahul · Dec 9, 2010

I suspect you may have to revert to using the Windows API IsCharUpper.

Dave

Hi Dave,

Basically I want a function to differentiate between "Unicode
lowercase characters" and rest all unicode chars.
on reading the msdn description I feel IsCharUpper is not the one, but
let me try it.

Rahul · Dec 9, 2010

Hi Dave,

Basically I want a function to differentiate between "Unicode
lowercase characters" and rest all unicode chars.
on reading the msdn description I feel IsCharUpper is not the one, but
let me try it.

Yes it is exactly what I wanted (IsCharUpperW/IsCharLowerW ).
Thanks Dave.

Goran · Dec 9, 2010

Try GetStringType. C and C++ are horrible when it comes to Unicode.
You may also want to try ICU library.

Goran.

Joseph M. Newcomer · Dec 9, 2010

THe problem is that isupper defaults to the "C" locale, which makes it compatible with the
1975 PDP-11 implementation of the C language. You must do a SetLocale to choose the
target locale.

As already pointed out, the API IsCharUpper will be more robust, but note that it works in
the current user locale.
joe

Hi,

I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"

Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.

Thanks in advance
Rahul

Joseph M. Newcomer [MVP]
email: (e-mail address removed)
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Joseph M. Newcomer · Dec 9, 2010

Note that IsCharUpper in a Unicode app becomes IsCharUpperW. I presume you have a Unicode
app. If you don't, you are already in trouble. If you have a Unicode app, there is no
need to explicitly call IsCharUpperW because that is what the IsCharUpper macro expands
to.
joe

Yes it is exactly what I wanted (IsCharUpperW/IsCharLowerW ).
Thanks Dave.

Joseph M. Newcomer [MVP]
email: (e-mail address removed)
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Joseph M. Newcomer · Dec 10, 2010

See below...

This is kind of circular. About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions. This is
done by some macro trickery with all the pitfalls of macros. So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to mark
the application as Unicode (or not) and as a bonus I can #undef all the
conflicting macros. Another bonus is that the code will not change meaning
silently if somebody defines or undefines UNICODE.

****
But there is no need to put the suffix on if you are just calling the API. And I'm not
sure what "pitfalls" occur since these macros are not particularly sophisticated--not
like, for example, the min and max macros, which really are dangerous. I'm not sure why
you think explicitly calling the W form is going to make sense if the app is not Unicode,
because you have not said explicitly that you are worrying about exceptions in the case of
an ANSI app. I'm not sure why anyone would want to #undef all the API calls just because
you harbor some mythical fear about macros. I wouldn't consider that a bonus, I'd
consider that silly. And the point of the APIs is that they retain meaning in both ANSI
and Unicode.
joe
****

Cheers
Paavo

Joseph M. Newcomer [MVP]
email: (e-mail address removed)
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Goran · Dec 10, 2010

About the only meaning of "Unicode app" in MSVC
is that the 'W' variants are used for string-specific functions.

"Unicode app" rather means "my application supports text in any
language you throw at it, and can mix several languages in one single
text". The "W" APIs are a tool to get Unicode (through UTF-16LE, as
that's what "W" APIs work with), a consequence, if you will. Hardly
"about the only meaning".

This is
done by some macro trickery with all the pitfalls of macros.

Pretty much ANYTHING in programming has pitfalls, but advantages, too.
In case of A/W variants of APIs, advantage is that you don't see
gibberish at the end of function names, and you can go from MBCS to
Unicode more easily. And if your own code is correct in the first
place, there is no "macro pitfall" with A/W functions. If you think
there is, show it.

So if I always
call the 'W' or 'A' variants explicitly as needed there is no need to mark
the application as Unicode (or not) and as a bonus I can #undef all the
conflicting macros. Another bonus is that the code will not change meaning
silently if somebody defines or undefines UNICODE.

Many consider this ability to change a bonus (admittedly, much smaller
today; anything should just be compiled with UNICODE/_UNICODE, and
reach for MBCS function variants only in rare language-specific
cases).

All in all, I think that you're doing it wrong, you are making it more
complicated for yourself, and you're making it strange for other
people who are used to A/W APIs. If your code is your personal
project, OK, but if you're in a team, I think you should step back and
reconsider your opinions.

Goran.

James Kanze · Dec 10, 2010

I have a std::wstring and I want to find which character are upper
case and which ones are lowercase. the std::isupper and islower seems
to work on ASCII characters only but I want to be able to find out all
kinds of uppercase and lowercase characters
e.g. á is an "Latin small letter a with acute"
and Á is an "Latin capital letter A with acute"

Is there any function (mfc, boost or in any other library) which I can
use to find out the above said difference? My application is a native
VC++ program.

What's wrong with iswupper (in wctype.h)? (Like isupper, it is
locale dependent.) Or using the equivalent functionality in
<locale>?

Also, be aware that concepts such as upper case don't
necessarily have any meaning in non-latin alphabets.

Goran · Dec 10, 2010

What's wrong with iswupper (in wctype.h)? (Like isupper, it is
locale dependent.) Or using the equivalent functionality in
<locale>?

If character is outside basic multilingual plane, how do you plan to
put it in a wint_t? (I don't know if case matters for languages
outside BMP, but why wouldn't it?).

That's why I proposed GetStringType or ICU.

Goran.

James Kanze · Dec 10, 2010

If character is outside basic multilingual plane, how do you plan to
put it in a wint_t? (I don't know if case matters for languages
outside BMP, but why wouldn't it?).

That is a general problem with all such functions which take
a single code point (even in UTF-32, although it probably only
affects very, very few characters with UTF-32).

That's why I proposed GetStringType or ICU.

GetStringType seems to have a fairly complicated interface; I'm
not too sure about ICU. But you're right. And the complicated
interface is due to the fact that the problem itself is more
complicated than it might appear at first glance. (Somewhere
floating around, I've got code which implements the functions in
ctype for UTF-8. Obviously, it takes two iterators to bytes,
rather than a single int, as argument. And the actual tables it
uses are generated from the UnicodeData.txt file. But one of
the things I learned while doing it is that some obvious
definitions, like isupper, are far from obvious once you leave
the usual Western European conventions. And that it still isn't
really correct, because it ignores composed characters, and only
treats single code points.)

Joseph M. Newcomer · Dec 11, 2010

See below...

I was specifically referring to the "Use Unicode character set" setting
present in MSVC project configurations. For supporting Unicode in general
there are of course multiple ways, I am preferring UTF-8 myself.

*****
UTF-8 is great at the edges, but it sucks internally.
*****

My programs are portable to Linux/MacOSX and all strings are internally
in UTF-8. Interfacing with Windows system calls typically looks something
like:

****
Most people have found trying to use UTF-8 internally is a nightmare.
*****

std::string name = ...;

HANDLE f = ::CreateFileW(Utf2Win(name).c_str(), GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);

-- or --
//...
} catch(const std::exception& e) {
::MessageBoxW(NULL, Utf2Win(e.what()).c_str(), L"Error in baseutil
load", MB_OK);
}

The Utf2Win() function converts from UTF-8 to UTF-16 for Windows wide
character API-s. This is done only on application boundary. I do not see
any gain in defining UNICODE and relying on macro trickery to achieve the
exact same code what I have written here.

Note that I'm not using MFC, this would probably change the rules of the
game.

****
Probably not much, if you can use UTF-8 comfortably. But most of us prefer the simplicity
of UTF-16 which for the bulk of locales is perfectly fine. If Microsoft introduced UTF-32
I could move to that without blinking.
joe
****

Cheers
Paavo

Joseph M. Newcomer [MVP]
email: (e-mail address removed)
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Miles Bader · Dec 11, 2010

Joseph M. Newcomer said:
Most people have found trying to use UTF-8 internally is a nightmare.

Not true, of course.

Given MS's full-court press to try and get people to use UTF-16, I can
understand why windows programmers might feel that way in some cases
though...

-Miles

Goran · Dec 11, 2010

My programs are portable to Linux/MacOSX and all strings are internally
in UTF-8. Interfacing with Windows system calls typically looks something
like:

std::string name = ...;

HANDLE f = ::CreateFileW(Utf2Win(name).c_str(), GENERIC_READ,
FILE_SHARE_READ|FILE_SHARE_WRITE, NULL, OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL, NULL);

That's not all that good as a general approach. As soon as you enter
into windows-specific code, that string should be UTF-16. Your main
line does not look like snippets you're showing here. Or at least it
shouldn't look like that. (Else, you're doing inline-#define-based
platform independence, a silly thing to do).

Instead, you should have platform-agnostic wrappers for platform-
specific stuff, and these will have platform-specific implementation.
Now... Implementation on windows needs UTF-16, so it's most expedient
to convert your UTF-8 to UTF-16 either at wrapper entry, or when you
first pass your text to win api (see e.g. implementation of _bstr_t
for an equivalent example). If you don't do this, you are carrying
UTF-8 around and converting it it to UTF-16 at least once, but as soon
as the wrapper isn't trivial, multiple times).

But of course, if platform-independence is the goal, a lot of code
would be better off using the likes of Qt and have platform-
independence cut-out for them (albeit, funnily, Qt's strings actually
use UTF-16)^^^.

And finally, once you do have said wrappers, you compile them with
UNICODE and look at Func1, Func2 instead of finger-sore-inducing and
eye-sore-inducing Func1W, Func2W.

Goran.

^^^ Use of UTF-16 (and not e.g. UTF-8) is IMO a good sign of platform
maturity when it comes to Unicode. Why? Because it kinda-sorta shows
that platform has been around when Unicode meant BMP and UCS2. And
indeed, Windows, Java, Qt, ICU, all picked UTF-16. That's not accident
or ignorance, it's historical convenience.

Joseph M. Newcomer · Dec 11, 2010

The problem with 8-bit apps is they use localized code pages to display text, which can be
a problem in certain locales. And locale determines some very important parameters, such
as collating sequence (sort order), and what is a "lower case" and "upper case" letter for
those languages which have the notion of case. These functions work on UTF-16 encoding
but not UTF-8. So if you need to compare two strings, you can't compare the UTF-8
encodings to determine collating sequence, and you can't do "bitwise" comparison of
UTF-16, but you can call the functions (e.g., CompareString, lstrcmp) which are
locale-aware and will sort the strings in accordance with the rules of the locale. Did
you know that CompareString does the right thing for Chinese symbols?
joe

I cannot see any benefits of UTF-16 over UTF-8. Both are packed formats and
need special care for accessing individual characters. And how is this
related to locales? Locale should specify how I want to see the dates or
numbers formatted, not which characters I can process or see on my screen.

If Microsoft introduced decent support of UTF-8 I could move to that
without blinking.

Cheers
Paavo

Joseph M. Newcomer [MVP]
email: (e-mail address removed)
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

Goran · Dec 12, 2010

Currently I try to keep platform-dependent code in separate .cpp files,
to be compiled only for the given platform. Currently my I/O wrapper
functions are quite trivial for Linux/MacOSX as there the UTF-8 locales
are the de facto standard, and Windows specific wrappers contain UTF-8
to/from UTF-16 conversions. If I used UTF-16 internally this would be the
other way around, no improvement in my mind.

I wanted to say that you should convert your text to UTF-16 when you
enter your win-wrapper, or eventually at first use of a win function
using the text, __not__ that you should use UTF-16 for all of your
text.

Goran.

Joseph M. Newcomer · Dec 13, 2010

See below...

Currently I try to keep platform-dependent code in separate .cpp files,
to be compiled only for the given platform. Currently my I/O wrapper
functions are quite trivial for Linux/MacOSX as there the UTF-8 locales
are the de facto standard, and Windows specific wrappers contain UTF-8
to/from UTF-16 conversions. If I used UTF-16 internally this would be the
other way around, no improvement in my mind.

****
I well and truly detest #ifdefd scattered throughout the source files--if youve ever been
a victim^H^H^H^H^H^Huser of anything from the Free Software Foundation, trying to figure
out what of the 8-deep collection of complex #ifdef/#if defined(...) && defined(...) code
actually generates, or porting it to a new platform realizes that keeping it as separate
sources in separate directories is about the only sane approach. We adopted this
technique when porting code across six platforms, having six subdirectories (Unix, Ultrix,
Mac, Win16, VMS, and a couple years later, Win32), and it was really a sane thing. We had
a few #ifdefs in one of the header files to define types, e.g., we would have used
something like "VCHAR" which might be defined as 'char' or 'TCHAR' or 'WCHAR'; the hardest
definitions dealt with how to represent 32-bit address arithmetic on Win16 (we had to make
them HUGE, for those of you unfortunate enough to remember Win16). I highly recommend
this technique, and I still use it for porting.
****

I have got an impression Qt is mostly about GUI interfaces, but that's
not so interesting for us. Our GUI-s are mostly done via web interfaces
(sometimes as embedded browser windows) and there UTF-8 is quite
widespread, again.

Cheers
Paavo

Joseph M. Newcomer [MVP]
email: (e-mail address removed)
Web: http://www.flounder.com
MVP Tips: http://www.flounder.com/mvp_tips.htm

James Kanze · Dec 13, 2010

I cannot see any benefits of UTF-16 over UTF-8. Both are
packed formats and need special care for accessing individual
characters.

I'm not sure about the meaning of "packed" here, but all Unicode
encoding formats, including UTF-32, require special care for
accessing individual characters. The question is more or less
one of the cases when that special care is needed: I've dealt
with more than one application where no special care is needed
for UTF-8. And there are a lot of applications in which special
care will be needed for UTF-8, but not for UTF-16.

And how is this
related to locales? Locale should specify how I want to see the dates or
numbers formatted, not which characters I can process or see on my screen.

Locale also determines the behavior of such functions as isalpha
or toupper (or iswalpha or towupper). And rather obviously,
such functions depend on the encoding, so locales encompass
encodings.

If Microsoft introduced decent support of UTF-8 I could move to that
without blinking.

Define "decent support"

. If any system introduced some sort
of comprehensive support for any of the Unicode encoding
formats, I'd jump at it. At present, if you need comprehensive
support, your best bet is ICU (which uses UTF-16). (The fact
that the only comprehensive support for Unicode uses UTF-16
suggests that programs requiring such support use UTF-16, even
on platforms where wchar_t is 32 bits. But judging from what
I've seen, such programs are the exception.)

At one time (up until about a year ago), I'd been experimenting
with implementing more or less comprehensive support for UTF-8,
but that's on hold at present.

James Kanze · Dec 13, 2010

The problem with 8-bit apps is they use localized code pages
to display text, which can be a problem in certain locales.

Use a UTF-8 locale

.

And locale determines some very important parameters, such as
collating sequence (sort order), and what is a "lower case"
and "upper case" letter for those languages which have the
notion of case. These functions work on UTF-16 encoding but
not UTF-8.

Yes and no. These functions don't work with multi-byte
characters (UTF-8), surrogates (UTF-16) or characters composed
of multiple code points (Unicode in general, regardless of the
encoding format used). And they are always locale dependent.

So if you need to compare two strings, you can't
compare the UTF-8 encodings to determine collating sequence,

You can *if* you are using the "native" collating sequence. And
otherwise, it's a question of implementation.

and you can't do "bitwise" comparison of UTF-16, but you can
call the functions (e.g., CompareString, lstrcmp) which are
locale-aware and will sort the strings in accordance with the
rules of the locale. Did you know that CompareString does the
right thing for Chinese symbols?

Which is a question of quality of implementation. There's no
reason for it to do the right thing with wchar_t, and not with
char in a UTF-8 locale. std::locale has a templated
operator()(basic_string<> const& basic_string<> const&) which
does the right thing when used with
std::lexicographical_compare. This is the standard way of
handling the problem, and should work for both UTF-8 (with
std::string) and UTF-16 or UTF-32 (with std::wstring, depending
on the actual type of wchar_t), providing you have the
corresponding locales (which is a requirement anyway).

wstring to string and back	3	Feb 19, 2009
wstring, wofstream, and encodings	8	Apr 12, 2008
g++ 3.2.2 error - "no matching function for call to A::setResponse(std::wstring)"	3	Jun 7, 2006
ostream outputting garbage	9	Jan 3, 2008
converting to and from octal escaped UTF--8	9	Dec 3, 2007
Converting EBCDIC to Unicode	3	Sep 28, 2010
Tasks	1	Nov 29, 2022
code review / efficient lookup techniques...	30	Nov 10, 2009

isupper and islower for wstring

Rahul

David Lowndes

Rahul

Rahul

Goran

Joseph M. Newcomer

Joseph M. Newcomer

Joseph M. Newcomer

Goran

James Kanze

Goran

James Kanze

Joseph M. Newcomer

Miles Bader

Goran

Joseph M. Newcomer

Goran

Joseph M. Newcomer

James Kanze

James Kanze

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads