sloppy wchar_t comparison

R

Ralf Goertz

Hi,

is there a standard way to compare wide characters c1 and c2 such that
the result is "equal" if c1 is the same but accented character like c2:

equal(L'é',L'e') => true,
equal(L'a',L'e') => false

I know this depends on the used locale. The problem is that
std::collate.compare() only gives a different sort order but does not
return 0 for a comparison of say "Apfel" and "Äpfel".
 
R

Robert Fendt

I know this depends on the used locale. The problem is that
std::collate.compare() only gives a different sort order but does not
return 0 for a comparison of say "Apfel" and "Äpfel".

I do not think there is a mechanism for this anywhere in the
standard library. And why should it? "Äpfel" (plural) is
something quite different than "Apfel" (singular). The decision
to treat 'ä' and 'a' (or à, á, â, ã, å, æ for that matter) as
equal is one that no general library can do for you. It would be
an arbritrary decision and thus almost always wrong.

If you want to build a typo-tolerant string matcher, you will
have to do it yourself (and also maybe use a more sophisticated
algorithm, like something based e.g. on the Levenshtein
distance).

Regards,
Robert
 
R

Ralf Goertz

Robert said:
I do not think there is a mechanism for this anywhere in the
standard library. And why should it? "Äpfel" (plural) is
something quite different than "Apfel" (singular).

Being german I know that of course. Thought I could get away with that
example.
The decision to treat 'ä' and 'a' (or à, á, â, ã, å, æ for that
matter) as equal is one that no general library can do for you. It
would be an arbritrary decision and thus almost always wrong.

That's why I said it would be locale dependent. MySQL does exactly that.
If you want to build a typo-tolerant string matcher, you will have to
do it yourself (and also maybe use a more sophisticated algorithm,
like something based e.g. on the Levenshtein distance).

That's the task but I don't use Levenshtein but Jaro-Winkler distance
(jwd). The problem is that jwd("Müller","Muller") is the same as
jwd("Müller","Miller"), although "Müller" might have become "Muller"
because there were no accented characters available. (I know that in
that case "Müller" probably would have become "Mueller" but you get the
point.)
 
R

Robert Hairgrove

Ralf said:
Being german I know that of course. Thought I could get away with that
example.


That's why I said it would be locale dependent. MySQL does exactly that.


That's the task but I don't use Levenshtein but Jaro-Winkler distance
(jwd). The problem is that jwd("Müller","Muller") is the same as
jwd("Müller","Miller"), although "Müller" might have become "Muller"
because there were no accented characters available. (I know that in
that case "Müller" probably would have become "Mueller" but you get the
point.)

Maybe a soundex function would do it for you? I'm not sure how well
these work with German language strings, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,904
Latest member
HealthyVisionsCBDPrice

Latest Threads

Top