Using std::lexicographical_compare with ignore case equalitydoesn't always work

A

Alex Buell

The short snippet below demonstrates the problem I'm having with
std::lexicographical_compare() in that it does not reliably work!

#include <iostream>
#include <vector>
#include <ctype.h>

bool compare_ignore_case_equals(char c1, char c2)
{
return toupper(c1) == toupper(c2);
}

bool compare_ignore_case_less(char c1, char c2)
{
return toupper(c1) < toupper(c2);
}

int main(int argc, char *argv[])
{
std::vector<std::string> args(argv + 1, argv + argc);
const char *words[] =
{
"add", "del", "new", "help"
};

std::vector<std::string> list(words, words + (sizeof words / sizeof words[0]));
std::vector<std::string>::iterator word = list.begin();
while (word != list.end())
{
std::cout << "Testing " << *word << " = " << args[0];
if (std::lexicographical_compare(
word->begin(), word->end(),
args[0].begin(), args[0].end(),
compare_ignore_case_equals))
{
std::cout << " found!\n";
break;
}

std::cout << "\n";
word++;
}
}

Here's an example:

./quick new
Testing add = new
Testing del = new found!

That simply cannot be correct, what is it that I've done wrongly? Thanks
 
A

Alex Buell

if (std::lexicographical_compare(
word->begin(), word->end(),
args[0].begin(), args[0].end(),
compare_ignore_case_equals))

First, remove compare_ignore_case_equals and try again. You'll get
similar problems. Then read about lexicographical_compare and what
its return value means.

I've now switched to using this:

#include <string.h>
#include <string>

inline int strcasecmp(const std::string& s1, const std::string& s2)
{
return strcasecmp(s1.c_str(), s2.c_str());
}

This leverages C++'s ability to overload functions and works better.

stricmp() isn't standard whilst strcasecmp() is standard ANSI/ISO. Some
posters have mentioned using stricmp() instead of strcasecmp(), which
happens not to be the correct answer. Why?
 
A

Alex Buell

No, it's not. It's Unix, if I remeber correctly. But I think I didn't
make my point clearly enough. The problem isn't fundamentally in the
predicate. So drop the predicate and use the default predicate until
you understand what lexicographical_compare does.

strcasecmp() is actually defined in the POSIX standards. But I will
look again at std::lexicograpical_compare() when I get some time. The
program works well enough with strcasecmp().
 
J

James Kanze

if (std::lexicographical_compare(
word->begin(), word->end(),
args[0].begin(), args[0].end(),
compare_ignore_case_equals))
First, remove compare_ignore_case_equals and try again.
You'll get similar problems. Then read about
lexicographical_compare and what its return value means.
I've now switched to using this:
#include <string.h>
#include <string>
inline int strcasecmp(const std::string& s1, const std::string& s2)
{
return strcasecmp(s1.c_str(), s2.c_str());
}
This leverages C++'s ability to overload functions and works
better.
stricmp() isn't standard whilst strcasecmp() is standard
ANSI/ISO.

It's not present in any version of the standard I have handy
(C++98, C99, and the latest C++ draft). The standard C++
functionnal object for comparing strings in a locale dependent
way is std::locale (which has an operator() which does exactly
what is needed for lexicographical_compare). And as any
comparisons involved case are locale sensitive, it's really what
you need, e.g.:

if ( std::lexicographical_compare(
word->begin(), word->end(),
args[ 0 ].begin(), args[ 0 ].end(),
std::locale() ) ) {...}

(or std::locale( "xxx" ), with whatever locale you want).
Some posters have mentioned using stricmp() instead of
strcasecmp(), which happens not to be the correct answer.
Why?

Neither are the correct answer, since neither are standard
C/C++. (strcasecmp is defined in Posix, but not very well: "In
the POSIX locale, [...]. The results are unspecified in other
locales." So unless you happen to live in POSIX, it's not very
useful.)
 
J

James Kanze

The short snippet below demonstrates the problem I'm having with
std::lexicographical_compare() in that it does not reliably work!

#include <iostream>
#include <vector>
#include <ctype.h>

bool compare_ignore_case_equals(char c1, char c2)
{
return toupper(c1) == toupper(c2);

Just a reminder, but this is, of course, undefined behavior.
bool compare_ignore_case_less(char c1, char c2)
{
return toupper(c1) < toupper(c2);

As is this.

(I've addressed the other issues in another posting.)
 
A

Alex Buell

Actually, it looks more like it leverages C++'s ability to cause a
stack overflow due to infinite recursion. strcasecmp isn't part of
ISO C++, so on plenty of compilers, this function will simply call
itself.

As this snippet below shows, you're actually correct.

#include <iostream>
#include <string>

int hahaha(const std::string& s1, const std::string& s2)
{
return hahaha(s1.c_str(), s2.c_str());
}

int main()
{
std::string s1 = "hahaha";
std::string s2 = "HAHAHA";

if (hahaha(s1, s2) == 0)
std::cout << "Equal!\n";

return 0;
}
As far as I can tell, neither are part of standard C++.

Yes, at some point in time I'm going to have to change to
std::lexicographical_compare, or is there anything else I can try for
case insensitive compares on std::string objects?
 
T

Thomas J. Gritzan

James said:
stricmp() isn't standard whilst strcasecmp() is standard
ANSI/ISO.

It's not present in any version of the standard I have handy
(C++98, C99, and the latest C++ draft). The standard C++
functionnal object for comparing strings in a locale dependent
way is std::locale (which has an operator() which does exactly
what is needed for lexicographical_compare). And as any
comparisons involved case are locale sensitive, it's really what
you need, e.g.:

if ( std::lexicographical_compare(
word->begin(), word->end(),
args[ 0 ].begin(), args[ 0 ].end(),
std::locale() ) ) {...}

(or std::locale( "xxx" ), with whatever locale you want).

operator() of std::locale works on strings by itself. You could use
operator() directly:

/* true, if word < args[0] */
if ( std::locale()(word, args[0]) ) {...}

But does std::locale()() really compare case insensitive?
 
J

James Kanze

James said:
stricmp() isn't standard whilst strcasecmp() is standard
ANSI/ISO.
It's not present in any version of the standard I have handy
(C++98, C99, and the latest C++ draft). The standard C++
functionnal object for comparing strings in a locale
dependent way is std::locale (which has an operator() which
does exactly what is needed for lexicographical_compare).
And as any comparisons involved case are locale sensitive,
it's really what you need, e.g.:
if ( std::lexicographical_compare(
word->begin(), word->end(),
args[ 0 ].begin(), args[ 0 ].end(),
std::locale() ) ) {...}
(or std::locale( "xxx" ), with whatever locale you want).
operator() of std::locale works on strings by itself. You could use
operator() directly:
/* true, if word < args[0] */
if ( std::locale()(word, args[0]) ) {...}
But does std::locale()() really compare case insensitive?

The answer to that is a definite maybe. It does (or it should)
in locales where case insensitive comparison makes sense. And
it does so correctly, matching "Straße" and "STRASSE" (or
"ändern" and "Aendern", in Switzerland, but not in Germany).
And "I" and "i" won't compare equal in a Turkish locale. Since
the "C" locale is designed for parsing C code, and the POSIX
locale for working in a Posix environment (including the file
systems and filenames), the comparison in those locales will NOT
be case insensitive.

And of course, you can always define your own locale. (At
least, that's what it says. In practice, it takes a pretty high
level of C++ competence to do it reliably. More than I have, at
any rate.)
 
T

Thomas J. Gritzan

James said:
Just a reminder, but this is, of course, undefined behavior.

#include <locale>

struct compare_ignore_case_equals
{
compare_ignore_case_equals(const std::locale& loc_ = std::locale())
: loc(loc_) {}

bool operator()(char c1, char c2) const
{
return std::tolower(c1, loc) == std::tolower(c2, loc);
}

private:
std::locale loc;
};

How about this? Doesn't depend on users locale, you can provide your own
locale, and isn't UB.

Why does ::toupper actually take an int?
 
T

Thomas J. Gritzan

James said:
On Dec 29, 3:10 pm, "Thomas J. Gritzan" <[email protected]> [...]
But does std::locale()() really compare case insensitive?

The answer to that is a definite maybe. [...]

If you want to parse commands case insensitivly, like in a shell, script
interpreter or text based protocoll, a maybe isn't enough.
And of course, you can always define your own locale. (At
least, that's what it says. In practice, it takes a pretty high
level of C++ competence to do it reliably. More than I have, at
any rate.)

Then it would be easier to build a comparision predicate with
std::toupper/tolower as I showed else-thread.

What do people do for multibyte encodings like UTF-8?
 
T

Thomas J. Gritzan

Daniel said:
Still doesn't work with lexicographical_compare...

Replace the == with < and you've got the ordering predicate needed for
lexicographical_compare.
 
J

James Kanze

James Kanze schrieb:
On Dec 29, 3:10 pm, "Thomas J. Gritzan" <[email protected]> [...]
But does std::locale()() really compare case insensitive?
The answer to that is a definite maybe. [...]
If you want to parse commands case insensitivly, like in a
shell, script interpreter or text based protocoll, a maybe
isn't enough.

The problem is that case insensitive comparison is locale
dependent. So of course, you have to involve the locale
somehow. But yes, there is a gap between literal comparison
(all bytes equal) and locale dependent colating (which can
involve a number of things, e.g. "é" compares equal to "E", "ä"
collates as "ae", etc. And there's no real support for anything
between these two extremes in the language (either C or C++).
Then it would be easier to build a comparision predicate with
std::toupper/tolower as I showed else-thread.

Probably:). You have to define what equality actually means
first (e.g. does "ß" compare equal to "SS"), but for things like
filenames and interpreter commands, you're often limited to a
small set of characters where the definition isn't too
difficult. (This is becoming less and less true with regards to
filenames, of course.)
What do people do for multibyte encodings like UTF-8?

A lot of hand written code:). In practice, you can't count on
the present of a UTF-8 locale, and you can't count on it working
right if it's present. Note too that anything case insensitive
will still be locale dependent, even if you limit it to UTF-8;
in practice, if you want case insensitivity over the full
Unicode range, you have a lot of defining to do (although the
Unicode Consortium data files help a lot).
 
J

James Kanze

James Kanze schrieb:
#include <locale>
struct compare_ignore_case_equals
{
compare_ignore_case_equals(const std::locale& loc_ = std::locale())
: loc(loc_) {}
bool operator()(char c1, char c2) const
{
return std::tolower(c1, loc) == std::tolower(c2, loc);
}
private:
std::locale loc;
};
How about this? Doesn't depend on users locale, you can
provide your own locale, and isn't UB.

I'm not sure what you mean by "doesn't depend on the user's
locale". The constructor std::locale() creates a copy of the
current global locale, which if you're writing library code, is
unknown, but which will usually be the user's locale, since the
very first action in most main functions is to set the global
locale to "".
Why does ::toupper actually take an int?

So that things like:

for ( int ch = getchar() ; isspace( ch ) ; ch = getchar() )
...

work. It is defined for EOF, as well as all of the values in
the range 0...UCHAR_MAX. (The reason for toupper, of course, is
coherence---all of the functions in <ctype.h> take the same type
of argument.) It's a useful idiom; I still use it a lot (not
with ::toupper, etc., but with some of my own stuff).

The real question is why plain char is allowed to be signed, if
it is intended to contain "characters". I don't know of any
character encoding which uses negative values.
 
A

Alex Buell

Replace the == with < and you've got the ordering predicate needed
for lexicographical_compare.

You might want to look at the OPs question again. His complaint (as
can be seen by the subject line) was that "lexicographical_compare
with ignore case *equality* doesn't always work." [stress added]
Think about that sentence for a second... :)

If the OP hasn't already figured it out, lexicographical_compare
isn't *designed* to work with equality functors in the first place.

[pained grin]

Yeah.

Perhaps this should be a FAQ: How do we do a case insensitive equality
compare on std::string values?
 
J

James Kanze

It also won't work reliably for all languages. Personally I
don't think anything will work reliably for all languages. A
programmer is better off IMHO to ignore locals and the "upper"
and "lower" functions in <cctype>, and write his own code that
works with the languages he has to deal with.

It's supposed to work reliably for all supported locales. (A
locale is more than just a language.) Which is sort of vague:
the standard doesn't make any requirements with regards to what
locales are supported (other than "C"), and it leaves the
definition as to what the behavior is in a given locale
"implementation defined".

If you're targetting a single compiler, for a single locale or a
small set of locales, and that compiler provides them, and they
behave "correctly" (for your definition of "correctly"), there's
no problem with using locales for this. Otherwise, you're
right: it can be a bit tricky.
 
J

jason.cipriani

You might want to look at the OPs question again. His complaint (as
can be seen by the subject line) was that "lexicographical_compare
with ignore case *equality* doesn't always work." [stress added]
Think about that sentence for a second... :)
If the OP hasn't already figured it out, lexicographical_compare
isn't *designed* to work with equality functors in the first place.

[pained grin]

Yeah.

Perhaps this should be a FAQ: How do we do a case insensitive equality
compare on std::string values?

Why? It's easy enough to find on Google already. Here is a good
article discussing all of the issues with proposed solutions, which
everybody involved in this thread should read:

http://lafstern.org/matt/col2_new.pdf

It was linked to from GCC's page on case-insensitive strings:

http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt05ch13s02.html

Which was linked to in a forum post in the first Google result for
"std string case insensitive compare":

http://bytes.com/groups/c/489747-lowercase-std-string-compare

Although it did require a bit of poking around on gcc.gnu.org since
the link in the forum post was actually broken.

Jason
 
A

Alex Buell

Replace the == with < and you've got the ordering predicate
needed for lexicographical_compare.  
You might want to look at the OPs question again. His complaint
(as can be seen by the subject line) was that
"lexicographical_compare with ignore case *equality* doesn't
always work." [stress added] Think about that sentence for a
second... :)
If the OP hasn't already figured it out, lexicographical_compare
isn't *designed* to work with equality functors in the first
place.

[pained grin]

Yeah.

Perhaps this should be a FAQ: How do we do a case insensitive
equality compare on std::string values?

Why? It's easy enough to find on Google already. Here is a good
article discussing all of the issues with proposed solutions, which
everybody involved in this thread should read:

http://lafstern.org/matt/col2_new.pdf

It was linked to from GCC's page on case-insensitive strings:

http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt05ch13s02.html

Which was linked to in a forum post in the first Google result for
"std string case insensitive compare":

http://bytes.com/groups/c/489747-lowercase-std-string-compare

Although it did require a bit of poking around on gcc.gnu.org since
the link in the forum post was actually broken.

Thanks for all that, I'd already seen some of these pages.
 
A

Alex Buell

As this thread, and every other thread/article on the subject shows,
it is a rather complex subject. Pretty much any subject that deals
with natural language is.

I suggest you don't perform case insensitive compares in your code.

Seems a lot of thought has gone into designing the STL libraries. I've
just been playing with std::locale and std::locale::global, with
currencies. I can see how useful this can be in cojunction with glibc.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top