Using std::lexicographical_compare with ignore case equalitydoesn't always work

Discussion in 'C++' started by Alex Buell, Dec 28, 2008.

  1. Alex Buell

    Alex Buell Guest

    The short snippet below demonstrates the problem I'm having with
    std::lexicographical_compare() in that it does not reliably work!

    #include <iostream>
    #include <vector>
    #include <ctype.h>

    bool compare_ignore_case_equals(char c1, char c2)
    {
    return toupper(c1) == toupper(c2);
    }

    bool compare_ignore_case_less(char c1, char c2)
    {
    return toupper(c1) < toupper(c2);
    }

    int main(int argc, char *argv[])
    {
    std::vector<std::string> args(argv + 1, argv + argc);
    const char *words[] =
    {
    "add", "del", "new", "help"
    };

    std::vector<std::string> list(words, words + (sizeof words / sizeof words[0]));
    std::vector<std::string>::iterator word = list.begin();
    while (word != list.end())
    {
    std::cout << "Testing " << *word << " = " << args[0];
    if (std::lexicographical_compare(
    word->begin(), word->end(),
    args[0].begin(), args[0].end(),
    compare_ignore_case_equals))
    {
    std::cout << " found!\n";
    break;
    }

    std::cout << "\n";
    word++;
    }
    }

    Here's an example:

    ./quick new
    Testing add = new
    Testing del = new found!

    That simply cannot be correct, what is it that I've done wrongly? Thanks
     
    Alex Buell, Dec 28, 2008
    #1
    1. Advertisements

  2. Alex Buell

    Alex Buell Guest

    I've now switched to using this:

    #include <string.h>
    #include <string>

    inline int strcasecmp(const std::string& s1, const std::string& s2)
    {
    return strcasecmp(s1.c_str(), s2.c_str());
    }

    This leverages C++'s ability to overload functions and works better.

    stricmp() isn't standard whilst strcasecmp() is standard ANSI/ISO. Some
    posters have mentioned using stricmp() instead of strcasecmp(), which
    happens not to be the correct answer. Why?
     
    Alex Buell, Dec 28, 2008
    #2
    1. Advertisements

  3. Alex Buell

    Alex Buell Guest

    strcasecmp() is actually defined in the POSIX standards. But I will
    look again at std::lexicograpical_compare() when I get some time. The
    program works well enough with strcasecmp().
     
    Alex Buell, Dec 28, 2008
    #3
  4. Alex Buell

    James Kanze Guest

    It's not present in any version of the standard I have handy
    (C++98, C99, and the latest C++ draft). The standard C++
    functionnal object for comparing strings in a locale dependent
    way is std::locale (which has an operator() which does exactly
    what is needed for lexicographical_compare). And as any
    comparisons involved case are locale sensitive, it's really what
    you need, e.g.:

    if ( std::lexicographical_compare(
    word->begin(), word->end(),
    args[ 0 ].begin(), args[ 0 ].end(),
    std::locale() ) ) {...}

    (or std::locale( "xxx" ), with whatever locale you want).
    Neither are the correct answer, since neither are standard
    C/C++. (strcasecmp is defined in Posix, but not very well: "In
    the POSIX locale, [...]. The results are unspecified in other
    locales." So unless you happen to live in POSIX, it's not very
    useful.)
     
    James Kanze, Dec 29, 2008
    #4
  5. Alex Buell

    James Kanze Guest

    Just a reminder, but this is, of course, undefined behavior.
    As is this.
    (I've addressed the other issues in another posting.)
     
    James Kanze, Dec 29, 2008
    #5
  6. Alex Buell

    Alex Buell Guest

    As this snippet below shows, you're actually correct.

    #include <iostream>
    #include <string>

    int hahaha(const std::string& s1, const std::string& s2)
    {
    return hahaha(s1.c_str(), s2.c_str());
    }

    int main()
    {
    std::string s1 = "hahaha";
    std::string s2 = "HAHAHA";

    if (hahaha(s1, s2) == 0)
    std::cout << "Equal!\n";

    return 0;
    }
    Yes, at some point in time I'm going to have to change to
    std::lexicographical_compare, or is there anything else I can try for
    case insensitive compares on std::string objects?
     
    Alex Buell, Dec 29, 2008
    #6
  7. operator() of std::locale works on strings by itself. You could use
    operator() directly:

    /* true, if word < args[0] */
    if ( std::locale()(word, args[0]) ) {...}

    But does std::locale()() really compare case insensitive?
     
    Thomas J. Gritzan, Dec 29, 2008
    #7
  8. Alex Buell

    James Kanze Guest

    The answer to that is a definite maybe. It does (or it should)
    in locales where case insensitive comparison makes sense. And
    it does so correctly, matching "Straße" and "STRASSE" (or
    "ändern" and "Aendern", in Switzerland, but not in Germany).
    And "I" and "i" won't compare equal in a Turkish locale. Since
    the "C" locale is designed for parsing C code, and the POSIX
    locale for working in a Posix environment (including the file
    systems and filenames), the comparison in those locales will NOT
    be case insensitive.

    And of course, you can always define your own locale. (At
    least, that's what it says. In practice, it takes a pretty high
    level of C++ competence to do it reliably. More than I have, at
    any rate.)
     
    James Kanze, Dec 29, 2008
    #8
  9. #include <locale>

    struct compare_ignore_case_equals
    {
    compare_ignore_case_equals(const std::locale& loc_ = std::locale())
    : loc(loc_) {}

    bool operator()(char c1, char c2) const
    {
    return std::tolower(c1, loc) == std::tolower(c2, loc);
    }

    private:
    std::locale loc;
    };

    How about this? Doesn't depend on users locale, you can provide your own
    locale, and isn't UB.

    Why does ::toupper actually take an int?
     
    Thomas J. Gritzan, Dec 29, 2008
    #9
  10. If you want to parse commands case insensitivly, like in a shell, script
    interpreter or text based protocoll, a maybe isn't enough.
    Then it would be easier to build a comparision predicate with
    std::toupper/tolower as I showed else-thread.

    What do people do for multibyte encodings like UTF-8?
     
    Thomas J. Gritzan, Dec 29, 2008
    #10
  11. jason.cipriani, Dec 30, 2008
    #11
  12. Replace the == with < and you've got the ordering predicate needed for
    lexicographical_compare.
     
    Thomas J. Gritzan, Dec 30, 2008
    #12
  13. Alex Buell

    James Kanze Guest

    The problem is that case insensitive comparison is locale
    dependent. So of course, you have to involve the locale
    somehow. But yes, there is a gap between literal comparison
    (all bytes equal) and locale dependent colating (which can
    involve a number of things, e.g. "é" compares equal to "E", "ä"
    collates as "ae", etc. And there's no real support for anything
    between these two extremes in the language (either C or C++).
    Probably:). You have to define what equality actually means
    first (e.g. does "ß" compare equal to "SS"), but for things like
    filenames and interpreter commands, you're often limited to a
    small set of characters where the definition isn't too
    difficult. (This is becoming less and less true with regards to
    filenames, of course.)
    A lot of hand written code:). In practice, you can't count on
    the present of a UTF-8 locale, and you can't count on it working
    right if it's present. Note too that anything case insensitive
    will still be locale dependent, even if you limit it to UTF-8;
    in practice, if you want case insensitivity over the full
    Unicode range, you have a lot of defining to do (although the
    Unicode Consortium data files help a lot).
     
    James Kanze, Dec 30, 2008
    #13
  14. Alex Buell

    James Kanze Guest

    I'm not sure what you mean by "doesn't depend on the user's
    locale". The constructor std::locale() creates a copy of the
    current global locale, which if you're writing library code, is
    unknown, but which will usually be the user's locale, since the
    very first action in most main functions is to set the global
    locale to "".
    So that things like:

    for ( int ch = getchar() ; isspace( ch ) ; ch = getchar() )
    ...

    work. It is defined for EOF, as well as all of the values in
    the range 0...UCHAR_MAX. (The reason for toupper, of course, is
    coherence---all of the functions in <ctype.h> take the same type
    of argument.) It's a useful idiom; I still use it a lot (not
    with ::toupper, etc., but with some of my own stuff).

    The real question is why plain char is allowed to be signed, if
    it is intended to contain "characters". I don't know of any
    character encoding which uses negative values.
     
    James Kanze, Dec 30, 2008
    #14
  15. Alex Buell

    Alex Buell Guest

    [pained grin]

    Yeah.

    Perhaps this should be a FAQ: How do we do a case insensitive equality
    compare on std::string values?
     
    Alex Buell, Dec 30, 2008
    #15
  16. Alex Buell

    James Kanze Guest

    It's supposed to work reliably for all supported locales. (A
    locale is more than just a language.) Which is sort of vague:
    the standard doesn't make any requirements with regards to what
    locales are supported (other than "C"), and it leaves the
    definition as to what the behavior is in a given locale
    "implementation defined".

    If you're targetting a single compiler, for a single locale or a
    small set of locales, and that compiler provides them, and they
    behave "correctly" (for your definition of "correctly"), there's
    no problem with using locales for this. Otherwise, you're
    right: it can be a bit tricky.
     
    James Kanze, Dec 30, 2008
    #16
  17. Why? It's easy enough to find on Google already. Here is a good
    article discussing all of the issues with proposed solutions, which
    everybody involved in this thread should read:

    http://lafstern.org/matt/col2_new.pdf

    It was linked to from GCC's page on case-insensitive strings:

    http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt05ch13s02.html

    Which was linked to in a forum post in the first Google result for
    "std string case insensitive compare":

    http://bytes.com/groups/c/489747-lowercase-std-string-compare

    Although it did require a bit of poking around on gcc.gnu.org since
    the link in the forum post was actually broken.

    Jason
     
    jason.cipriani, Dec 30, 2008
    #17
  18. Alex Buell

    Alex Buell Guest

    Thanks for all that, I'd already seen some of these pages.
     
    Alex Buell, Dec 30, 2008
    #18
  19. Alex Buell

    Alex Buell Guest

    Seems a lot of thought has gone into designing the STL libraries. I've
    just been playing with std::locale and std::locale::global, with
    currencies. I can see how useful this can be in cojunction with glibc.
     
    Alex Buell, Dec 30, 2008
    #19
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.