Unicode strings

A

Andrew L

Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew
 
B

Bob Hairgrove

Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew

You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.

Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are
spelled exactly the same. "Band" in German might be a different word
than "band" in English, for example.

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).
 
K

Karl Heinz Buchegger

Bob said:
[snip]

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).

One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German: eventuell,
which means: maybe. But eventually means: finally
 
A

Andrew L

Bob said:
You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.

This is what I suspected. Many thanks for that. Now, I wonder if such a
dictionary has already been implemented?
Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are

This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.

Many thanks,

Andrew
 
I

Ips

Karl Heinz Buchegger said:
Bob said:
[snip]

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).

One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German: eventuell,
which means: maybe. But eventually means: finally
The same in Polish. 'ewentualnie' means 'maybe'
Other puzzling example 'aktualnie' (like English 'actually') means
'currently' - very often misused word.
Sorry for off-topic.

regards,
Ips
 
R

Rufus V. Smith

Andrew L said:
Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Well, were it my job, I'd just put all words into a big matching dictionary,
and
each entry has a pointer (or index) to the first occurring synonym. The
first synonym
can point to null or to itself.

Or if you want to be fancy, all the synonyms can point to the previous
synonym, and
the first point to the last. This would allow you to print all synonyms to
a given word.

Although you have to be careful. For city names, you may not have a
problem, but
for other types of synonym use you have to be careful. Just because A is
synonymous
with B, and C is synonymous with B, does not imply A is synonymous with C.
(B may
change meanings depending on the comparison).

Rufus
 
B

Bob Hairgrove

This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.

And watch out for Paris, France vs. Paris, Texas; Moscow, Idaho vs.
Moscow, Russia; ad infinitem...
 
J

JKop

Andrew L posted:
Hello all,

What strategy should I use in solving the following problem? I have a
list of unicode strings which I would like to compare with its English
language 'equivalent.' eg

8-Bit chars will suffice.
"reykjavík" (note the accent above the i) should match both "reykjavík"
and "reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted
a's, o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Here's a function that checks if all the 'ß' in the German one are equal to
's','s' in the second English one;

bool Compare(const char* pGerman, const char* pEnglish)
{
const char* pGermanTemp = pGerman;

const char* pEnglishTemp = pEnglish;

for ( ; ; )
{
if (*pGermanTemp != 'ß') continue;

if (*pEnglishTemp != 's')
{
return false;
}
else
{
if (*++pEnglishTemp != 's') return false;
}

++pGermanTemp;
++pEnglishTemp;
}


//Reset the pointers and perform another test:

pGermanTemp = pGerman;

pEnglishTemp = pEnglish;


return true;

}


Or you could go through it charachter by character and perform tests based
upon each character, it'd be faster that way too.


-JKop
 
M

Meikel Weber

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

I don't think the c++ standard libraries will help you here. You need to do
unicode normalization and comparison.

Here are a few hints:

http://oss.software.ibm.com/icu/
(open source)

http://www.roguewave.com/support/docs/leif/sourcepro/html/i18nug/5.html
(I think this one is commercial)

http://www.unicode.org/unicode/reports/tr15/
( the spec )

Of course there are many more libs out there, just google around.

Greetings from Bonn, Germany
Meikel Weber
http://www.meikel.com
 
C

CFG

I'm affraid C++ stdlib will be of little help.

Take a look at ICU library. They have related functionality.
1. Transliteration
http://oss.software.ibm.com/icu/userguide/Transform.html
2. Language specific case mapping
http://oss.software.ibm.com/icu/userguide/caseMappings.html#lang_specific
3. Unicode string normalization:
http://oss.software.ibm.com/icu/userguide/normalization.html

But the task is more difficult than it might look at first. What you are
trying to do is locale and language dependent and may require some
linguistic knowledge or even consulting the dictionary. For instance, things
like inflected forms, handling compound words in foreign language, matching
spelling variations such as "organization" and "organisation", and so forth,
and so on.

Simple & general character manipulation algorithms/rules will not be able to
handle it right.

You've been warned.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top