Unicode strings

Discussion in 'C++' started by Andrew L, Jun 22, 2004.

  1. Andrew L

    Andrew L Guest

    Hello all,

    What strategy should I use in solving the following problem? I have a list
    of unicode strings which I would like to compare with its English language
    'equivalent.' eg

    "reykjavík" (note the accent above the i) should match both "reykjavík" and
    "reykjavik" (being the English equivalent).

    Similarly the German language letter 'ß' should match "ss", umlauted a's,
    o's etc should match a,o etc.

    How would I go about doing this using the c++ stdlib?

    Many thanks,

    Andrew
    Andrew L, Jun 22, 2004
    #1
    1. Advertising

  2. On Tue, 22 Jun 2004 13:12:33 +0100, Andrew L
    <> wrote:

    >Hello all,
    >
    >What strategy should I use in solving the following problem? I have a list
    >of unicode strings which I would like to compare with its English language
    >'equivalent.' eg
    >
    >"reykjavík" (note the accent above the i) should match both "reykjavík" and
    >"reykjavik" (being the English equivalent).
    >
    >Similarly the German language letter 'ß' should match "ss", umlauted a's,
    >o's etc should match a,o etc.
    >
    >How would I go about doing this using the c++ stdlib?
    >
    >Many thanks,
    >
    >Andrew


    You have to implement some kind of lookup table or dictionary.

    Although STL supports locales, I don't think there is a way of
    comparing two strings in *different* locales ... especially for
    Unicode strings, since there is no locale for Unicode -- Unicode
    covers *all* locales.

    Also, there are many words which mean one thing in one language (or
    locale) and something else in a different language, although they are
    spelled exactly the same. "Band" in German might be a different word
    than "band" in English, for example.

    Even if you get rid of the special characters, you must really watch
    out (e.g. German "Präservative" and English "preservative" <g>).


    --
    Bob Hairgrove
    Bob Hairgrove, Jun 22, 2004
    #2
    1. Advertising

  3. Bob Hairgrove wrote:
    >

    [snip]
    >
    > Even if you get rid of the special characters, you must really watch
    > out (e.g. German "Präservative" and English "preservative" <g>).


    One of the most puzzeling words for german speaking english students
    is the word: eventually. There is a very similar word in German: eventuell,
    which means: maybe. But eventually means: finally

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Jun 22, 2004
    #3
  4. Andrew L

    Andrew L Guest

    Bob Hairgrove wrote:
    > You have to implement some kind of lookup table or dictionary.
    >
    > Although STL supports locales, I don't think there is a way of
    > comparing two strings in *different* locales ... especially for
    > Unicode strings, since there is no locale for Unicode -- Unicode
    > covers *all* locales.


    This is what I suspected. Many thanks for that. Now, I wonder if such a
    dictionary has already been implemented?

    > Also, there are many words which mean one thing in one language (or
    > locale) and something else in a different language, although they are


    This won't really be a problem because the strings I'm dealing with are
    geographical elements - placenames etc. I've dealt with localised versions
    of these (eg Koln and Cologne are equivalent) it's essentially just a
    problem with accents.

    Many thanks,

    Andrew
    Andrew L, Jun 22, 2004
    #4
  5. Andrew L

    Ips Guest

    "Karl Heinz Buchegger" <> wrote in message
    news:...
    > Bob Hairgrove wrote:
    > >

    > [snip]
    > >
    > > Even if you get rid of the special characters, you must really watch
    > > out (e.g. German "Präservative" and English "preservative" <g>).

    >
    > One of the most puzzeling words for german speaking english students
    > is the word: eventually. There is a very similar word in German:

    eventuell,
    > which means: maybe. But eventually means: finally
    >

    The same in Polish. 'ewentualnie' means 'maybe'
    Other puzzling example 'aktualnie' (like English 'actually') means
    'currently' - very often misused word.
    Sorry for off-topic.

    regards,
    Ips
    Ips, Jun 22, 2004
    #5
  6. "Andrew L" <> wrote in message
    news:cb97lt$3qr$1$...
    > Hello all,
    >
    > What strategy should I use in solving the following problem? I have a list
    > of unicode strings which I would like to compare with its English language
    > 'equivalent.' eg
    >
    > "reykjavík" (note the accent above the i) should match both "reykjavík"

    and
    > "reykjavik" (being the English equivalent).
    >
    > Similarly the German language letter 'ß' should match "ss", umlauted a's,
    > o's etc should match a,o etc.
    >
    > How would I go about doing this using the c++ stdlib?
    >
    > Many thanks,
    >


    Well, were it my job, I'd just put all words into a big matching dictionary,
    and
    each entry has a pointer (or index) to the first occurring synonym. The
    first synonym
    can point to null or to itself.

    Or if you want to be fancy, all the synonyms can point to the previous
    synonym, and
    the first point to the last. This would allow you to print all synonyms to
    a given word.

    Although you have to be careful. For city names, you may not have a
    problem, but
    for other types of synonym use you have to be careful. Just because A is
    synonymous
    with B, and C is synonymous with B, does not imply A is synonymous with C.
    (B may
    change meanings depending on the comparison).

    Rufus
    Rufus V. Smith, Jun 22, 2004
    #6
  7. On Tue, 22 Jun 2004 13:49:10 +0100, Andrew L
    <> wrote:

    >This won't really be a problem because the strings I'm dealing with are
    >geographical elements - placenames etc. I've dealt with localised versions
    >of these (eg Koln and Cologne are equivalent) it's essentially just a
    >problem with accents.


    And watch out for Paris, France vs. Paris, Texas; Moscow, Idaho vs.
    Moscow, Russia; ad infinitem...


    --
    Bob Hairgrove
    Bob Hairgrove, Jun 22, 2004
    #7
  8. Andrew L

    JKop Guest

    Andrew L posted:

    > Hello all,
    >
    > What strategy should I use in solving the following problem? I have a
    > list of unicode strings which I would like to compare with its English
    > language 'equivalent.' eg


    8-Bit chars will suffice.

    > "reykjavík" (note the accent above the i) should match both "reykjavík"
    > and "reykjavik" (being the English equivalent).
    >
    > Similarly the German language letter 'ß' should match "ss", umlauted
    > a's, o's etc should match a,o etc.
    >
    > How would I go about doing this using the c++ stdlib?


    Here's a function that checks if all the 'ß' in the German one are equal to
    's','s' in the second English one;

    bool Compare(const char* pGerman, const char* pEnglish)
    {
    const char* pGermanTemp = pGerman;

    const char* pEnglishTemp = pEnglish;

    for ( ; ; )
    {
    if (*pGermanTemp != 'ß') continue;

    if (*pEnglishTemp != 's')
    {
    return false;
    }
    else
    {
    if (*++pEnglishTemp != 's') return false;
    }

    ++pGermanTemp;
    ++pEnglishTemp;
    }


    //Reset the pointers and perform another test:

    pGermanTemp = pGerman;

    pEnglishTemp = pEnglish;


    return true;

    }


    Or you could go through it charachter by character and perform tests based
    upon each character, it'd be faster that way too.


    -JKop
    JKop, Jun 22, 2004
    #8
  9. Andrew L

    Meikel Weber Guest

    > What strategy should I use in solving the following problem? I have a list
    > of unicode strings which I would like to compare with its English language
    > 'equivalent.' eg
    >
    > "reykjavík" (note the accent above the i) should match both "reykjavík"

    and
    > "reykjavik" (being the English equivalent).
    >
    > Similarly the German language letter 'ß' should match "ss", umlauted a's,
    > o's etc should match a,o etc.
    >
    > How would I go about doing this using the c++ stdlib?


    I don't think the c++ standard libraries will help you here. You need to do
    unicode normalization and comparison.

    Here are a few hints:

    http://oss.software.ibm.com/icu/
    (open source)

    http://www.roguewave.com/support/docs/leif/sourcepro/html/i18nug/5.html
    (I think this one is commercial)

    http://www.unicode.org/unicode/reports/tr15/
    ( the spec )

    Of course there are many more libs out there, just google around.

    Greetings from Bonn, Germany
    Meikel Weber
    http://www.meikel.com
    Meikel Weber, Jun 22, 2004
    #9
  10. Andrew L

    CFG Guest

    I'm affraid C++ stdlib will be of little help.

    Take a look at ICU library. They have related functionality.
    1. Transliteration
    http://oss.software.ibm.com/icu/userguide/Transform.html
    2. Language specific case mapping
    http://oss.software.ibm.com/icu/userguide/caseMappings.html#lang_specific
    3. Unicode string normalization:
    http://oss.software.ibm.com/icu/userguide/normalization.html

    But the task is more difficult than it might look at first. What you are
    trying to do is locale and language dependent and may require some
    linguistic knowledge or even consulting the dictionary. For instance, things
    like inflected forms, handling compound words in foreign language, matching
    spelling variations such as "organization" and "organisation", and so forth,
    and so on.

    Simple & general character manipulation algorithms/rules will not be able to
    handle it right.

    You've been warned.
    CFG, Jun 23, 2004
    #10
  11. Andrew L

    CFG Guest

    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,917
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    542
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    512
    Gabriele *darkbard* Farina
    May 16, 2006
  4. Ben

    Strings, Strings and Damned Strings

    Ben, Jun 22, 2006, in forum: C Programming
    Replies:
    14
    Views:
    754
    Malcolm
    Jun 24, 2006
  5. Asterix
    Replies:
    5
    Views:
    708
    Matt Nordhoff
    Aug 31, 2008
Loading...

Share This Page