Unicode strings

Andrew L · Jun 22, 2004

Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew

Bob Hairgrove · Jun 22, 2004

Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Andrew

You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.

Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are
spelled exactly the same. "Band" in German might be a different word
than "band" in English, for example.

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).

Karl Heinz Buchegger · Jun 22, 2004

Bob said:
[snip]

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).

One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German: eventuell,
which means: maybe. But eventually means: finally

Andrew L · Jun 22, 2004

Bob said:
You have to implement some kind of lookup table or dictionary.

Although STL supports locales, I don't think there is a way of
comparing two strings in *different* locales ... especially for
Unicode strings, since there is no locale for Unicode -- Unicode
covers *all* locales.

This is what I suspected. Many thanks for that. Now, I wonder if such a
dictionary has already been implemented?

Also, there are many words which mean one thing in one language (or
locale) and something else in a different language, although they are

This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.

Many thanks,

Andrew

Ips · Jun 22, 2004

Karl Heinz Buchegger said:
Bob said:

[snip]

Even if you get rid of the special characters, you must really watch
out (e.g. German "Präservative" and English "preservative" <g>).

Click to expand...

One of the most puzzeling words for german speaking english students
is the word: eventually. There is a very similar word in German: eventuell,
which means: maybe. But eventually means: finally

The same in Polish. 'ewentualnie' means 'maybe'
Other puzzling example 'aktualnie' (like English 'actually') means
'currently' - very often misused word.
Sorry for off-topic.

regards,
Ips

Rufus V. Smith · Jun 22, 2004

Andrew L said:
Hello all,

What strategy should I use in solving the following problem? I have a list
of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Many thanks,

Well, were it my job, I'd just put all words into a big matching dictionary,
and
each entry has a pointer (or index) to the first occurring synonym. The
first synonym
can point to null or to itself.

Or if you want to be fancy, all the synonyms can point to the previous
synonym, and
the first point to the last. This would allow you to print all synonyms to
a given word.

Although you have to be careful. For city names, you may not have a
problem, but
for other types of synonym use you have to be careful. Just because A is
synonymous
with B, and C is synonymous with B, does not imply A is synonymous with C.
(B may
change meanings depending on the comparison).

Rufus

Bob Hairgrove · Jun 22, 2004

This won't really be a problem because the strings I'm dealing with are
geographical elements - placenames etc. I've dealt with localised versions
of these (eg Koln and Cologne are equivalent) it's essentially just a
problem with accents.

And watch out for Paris, France vs. Paris, Texas; Moscow, Idaho vs.
Moscow, Russia; ad infinitem...

JKop · Jun 22, 2004

Andrew L posted:

Hello all,

What strategy should I use in solving the following problem? I have a
list of unicode strings which I would like to compare with its English
language 'equivalent.' eg

8-Bit chars will suffice.

"reykjavík" (note the accent above the i) should match both "reykjavík"
and "reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted
a's, o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

Here's a function that checks if all the 'ß' in the German one are equal to
's','s' in the second English one;

bool Compare(const char* pGerman, const char* pEnglish)
{
const char* pGermanTemp = pGerman;

const char* pEnglishTemp = pEnglish;

for ( ; ; )
{
if (*pGermanTemp != 'ß') continue;

if (*pEnglishTemp != 's')
{
return false;
}
else
{
if (*++pEnglishTemp != 's') return false;
}

++pGermanTemp;
++pEnglishTemp;
}

//Reset the pointers and perform another test:

pGermanTemp = pGerman;

pEnglishTemp = pEnglish;

return true;

}

Or you could go through it charachter by character and perform tests based
upon each character, it'd be faster that way too.

-JKop

Meikel Weber · Jun 22, 2004

What strategy should I use in solving the following problem? I have a list

of unicode strings which I would like to compare with its English language
'equivalent.' eg

"reykjavík" (note the accent above the i) should match both "reykjavík" and
"reykjavik" (being the English equivalent).

Similarly the German language letter 'ß' should match "ss", umlauted a's,
o's etc should match a,o etc.

How would I go about doing this using the c++ stdlib?

I don't think the c++ standard libraries will help you here. You need to do
unicode normalization and comparison.

Here are a few hints:

http://oss.software.ibm.com/icu/
(open source)

http://www.roguewave.com/support/docs/leif/sourcepro/html/i18nug/5.html
(I think this one is commercial)

http://www.unicode.org/unicode/reports/tr15/
( the spec )

Of course there are many more libs out there, just google around.

Greetings from Bonn, Germany
Meikel Weber
http://www.meikel.com

CFG · Jun 23, 2004

I'm affraid C++ stdlib will be of little help.

Take a look at ICU library. They have related functionality.
1. Transliteration
http://oss.software.ibm.com/icu/userguide/Transform.html
2. Language specific case mapping
http://oss.software.ibm.com/icu/userguide/caseMappings.html#lang_specific
3. Unicode string normalization:
http://oss.software.ibm.com/icu/userguide/normalization.html

But the task is more difficult than it might look at first. What you are
trying to do is locale and language dependent and may require some
linguistic knowledge or even consulting the dictionary. For instance, things
like inflected forms, handling compound words in foreign language, matching
spelling variations such as "organization" and "organisation", and so forth,
and so on.

Simple & general character manipulation algorithms/rules will not be able to
handle it right.

You've been warned.

CFG · Jun 23, 2004

See also:

http://www.basistech.com/products/index.html

http://www-306.ibm.com/software/globalization/topics/languageware/functionality.jsp

I dont't understand UNICODE issues...	0	Jun 13, 2011
UTF-8 and strings	44	Jun 7, 2011
Converting EBCDIC to Unicode	3	Sep 28, 2010
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
How to get education and coding job coming from abroad starting new in the US? Advice of courses or places to look?	2	May 18, 2023
sorting german characters äöü...	0	Oct 30, 2013
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009
Support for Unicode strings	1	Jan 11, 2006

Unicode strings

Andrew L

Bob Hairgrove

Karl Heinz Buchegger

Andrew L

Ips

Rufus V. Smith

Bob Hairgrove

JKop

Meikel Weber

CFG

CFG

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads