Is there a better way to convert foreign characters?

Discussion in 'Perl Misc' started by Guy, Apr 19, 2009.

  1. Guy

    Guy Guest

    I'm sure there are many ways to do this, but is there a much better way?

    $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    $word=lc($value);

    I want $word to equal the english version of $value. So if
    $value="Théodore", I want $word="theodore". I'd like to do it in one
    statement if possible but I think I have to convert $value in one statement
    and then assign it to $word in another statement.

    Cheers!
    Guy
    Guy, Apr 19, 2009
    #1
    1. Advertising

  2. Guy wrote:
    > I'm sure there are many ways to do this, but is there a much better way?
    >
    > $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    > $word=lc($value);
    >
    > I want $word to equal the english version of $value. So if
    > $value="Théodore", I want $word="theodore". I'd like to do it in one
    > statement if possible


    ( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Apr 19, 2009
    #2
    1. Advertising

  3. Guy

    Guy Guest

    "Gunnar Hjalmarsson" <> a écrit dans le message de news:
    ...
    > Guy wrote:
    >> I'm sure there are many ways to do this, but is there a much better way?
    >>
    >> $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    >> $word=lc($value);
    >>
    >> I want $word to equal the english version of $value. So if
    >> $value="Théodore", I want $word="theodore". I'd like to do it in one
    >> statement if possible

    >
    > ( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;


    Perfect, thank you muchly!
    Guy

    >
    > --
    > Gunnar Hjalmarsson
    > Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Guy, Apr 19, 2009
    #3
  4. "Guy" <> wrote:
    >I'm sure there are many ways to do this, but is there a much better way?
    >
    >$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    >$word=lc($value);
    >
    >I want $word to equal the english version of $value. So if
    >$value="Théodore", I want $word="theodore".


    This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
    am suprised that people still want to make the same mistakes.

    First of all how would you react, if someone is mangling your name?
    There is no "English version" of my first name.

    And second there are cases, where your "English version" actually has a
    very different meaning, like in Swedish your method would rename Mrs.
    Hear into Mrs. Whore. Are you sure you want to do that?

    And last UTF-8 is such a nice character set, there is really, really no
    excuse any more to not use it. 10 years ago the story was somewhat
    different, because many programs didn't support it yet at that time.

    jue
    Jürgen Exner, Apr 19, 2009
    #4
  5. On 2009-04-19, Jürgen Exner <> wrote:
    > "Guy" <> wrote:
    >>I'm sure there are many ways to do this, but is there a much better way?
    >>
    >>$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    >>$word=lc($value);
    >>
    >>I want $word to equal the english version of $value. So if
    >>$value="Théodore", I want $word="theodore".


    > This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
    > am suprised that people still want to make the same mistakes.


    People not necessarily eat bullshit without objection. What a person
    wants is A Perfect Idea if *what one wants to do* is exactly this.

    And quite often it is.

    > First of all how would you react, if someone is mangling your name?


    > There is no "English version" of my first name.


    Try to tell this to somebody issuing IDs in English-speaking country...

    > And second there are cases, where your "English version" actually has a
    > very different meaning, like in Swedish your method would rename Mrs.
    > Hear into Mrs. Whore. Are you sure you want to do that?


    And what do you propose to do if you need to create a file name on a
    filesystem which supports ASCII only?!

    > And last UTF-8 is such a nice character set, there is really, really no
    > excuse any more to not use it.


    You are joking, really? (There is a small set of tasks which allows use
    of Unicode; but it is VERY far from being universal...)

    [And I even ignore the fact that UTF-8 is not a charset... ;-]

    > 10 years ago the story was somewhat
    > different, because many programs didn't support it yet at that time.


    Try to explain this to my DVD player...

    Hope this helps,
    Ilya
    Ilya Zakharevich, Apr 20, 2009
    #5
  6. Jürgen Exner wrote:
    > "Guy" <> wrote:
    >> I'm sure there are many ways to do this, but is there a much better way?
    >>
    >> $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    >> $word=lc($value);
    >>
    >> I want $word to equal the english version of $value. So if
    >> $value="Théodore", I want $word="theodore".

    >
    > This is A Very Bad Idea(TM).


    It's probably not the OP's idea, it's just homework.

    > First of all how would you react, if someone is mangling your name?
    > There is no "English version" of my first name.


    Agreed.

    > And second there are cases, where your "English version" actually has a
    > very different meaning, like in Swedish your method would rename Mrs.
    > Hear into Mrs. Whore.


    Confirmed.

    > And last UTF-8 is such a nice character set, there is really, really no
    > excuse any more to not use it.


    Well, personally I usually stick to latin1. Suppose all the characters
    in the above tr/// are recognized by that charset, which is also kind of
    Internet standard in the Western world, isn't it?

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Apr 20, 2009
    #6
  7. Guy

    Guest

    On Mon, 20 Apr 2009 02:00:49 +0200, Gunnar Hjalmarsson <> wrote:

    >Jürgen Exner wrote:
    >> "Guy" <> wrote:
    >>> I'm sure there are many ways to do this, but is there a much better way?
    >>>
    >>> $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    >>> $word=lc($value);
    >>>
    >>> I want $word to equal the english version of $value. So if
    >>> $value="Théodore", I want $word="theodore".

    >>
    >> This is A Very Bad Idea(TM).

    >
    >It's probably not the OP's idea, it's just homework.
    >
    >> First of all how would you react, if someone is mangling your name?
    >> There is no "English version" of my first name.

    >
    >Agreed.
    >
    >> And second there are cases, where your "English version" actually has a
    >> very different meaning, like in Swedish your method would rename Mrs.
    >> Hear into Mrs. Whore.

    >
    >Confirmed.
    >
    >> And last UTF-8 is such a nice character set, there is really, really no
    >> excuse any more to not use it.

    >
    >Well, personally I usually stick to latin1. Suppose all the characters
    >in the above tr/// are recognized by that charset, which is also kind of
    >Internet standard in the Western world, isn't it?


    Ahh... Twiddly De and Twidly Dum

    -sln
    , Apr 20, 2009
    #7
  8. Guy

    Dr.Ruud Guest

    Guy wrote:

    > I'm sure there are many ways to do this, but is there a much better way?
    >
    > $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    > $word=lc($value);
    >
    > I want $word to equal the english version of $value. So if
    > $value="Théodore", I want $word="theodore". I'd like to do it in one
    > statement if possible but I think I have to convert $value in one statement
    > and then assign it to $word in another statement.


    perl -Mstrict -Mutf8 -MText::Unidecode -wle '
    my $s = "àâÀéèëêÉÊçÇîïôÔùû";
    print Text::Unidecode::unidecode( $s );
    '
    aaAeeeeEEcCiioOuu

    --
    Ruud
    Dr.Ruud, Apr 20, 2009
    #8
  9. Dr.Ruud wrote:
    >
    > perl -Mstrict -Mutf8 -MText::Unidecode -wle '
    > my $s = "àâÀéèëêÉÊçÇîïôÔùû";
    > print Text::Unidecode::unidecode( $s );
    > '
    > aaAeeeeEEcCiioOuu


    The purpose of that module is to handle non-Roman characters. What makes
    you believe those characters are Unicode?

    $ perl -MEncode -le '
    $octets = "àâÀéèëêÉÊçÇîïôÔùû";
    print "Raw: ", $octets;
    print "Latin-1: ", decode "ISO-8859-1", $octets;
    print "ANSI: ", decode "Windows-1252", $octets;
    '
    Raw: àâÀéèëêÉÊçÇîïôÔùû
    Latin-1: àâÀéèëêÉÊçÇîïôÔùû
    ANSI: àâÀéèëêÉÊçÇîïôÔùû
    $

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, Apr 20, 2009
    #9
  10. On 2009-04-20, Gunnar Hjalmarsson <> wrote:
    > Dr.Ruud wrote:
    >>
    >> perl -Mstrict -Mutf8 -MText::Unidecode -wle '
    >> my $s = "àâÀéèëêÉÊçÇîïôÔùû";
    >> print Text::Unidecode::unidecode( $s );
    >> '
    >> aaAeeeeEEcCiioOuu

    >
    > The purpose of that module is to handle non-Roman characters. What makes
    > you believe those characters are Unicode?


    Since this code is not in scope of `use locale', the characters are,
    by Perl semantic, in Unicode...

    Hope this helps,
    Ilya
    Ilya Zakharevich, Apr 21, 2009
    #10
  11. On 2009-04-19, Jürgen Exner <> wrote:
    > This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
    > am suprised that people still want to make the same mistakes.


    Right.

    You have read a great book by an author called Żmiwór Ściepełkowski. You
    want to look up in your favorite library database what else he has written.
    What do you do?

    a) you enter "Zmiwor Sciepelkowski"
    b) you figure out what these characters are, from what set, and in the
    end you spend half an hour trying to locate "Åš" and other characters (and
    not Ŝ, Š, Ṥ, Ṧ or Ṩ or one of a dozen of other variants, which are, in
    fact, all very different, although they might look very similar).

    You guys from the former Latin 1 set have it easy talking. Latin 1
    characters (e.g. umlauts) can be found and easily inserted almost anywhere
    and on any system.

    Having a clever module that can uniquly assign various weird characters to
    the basic ASCII set would be a really great thing, and I would be really
    grateful to anyone who could offer a better solution than that of the OP.

    > And last UTF-8 is such a nice character set, there is really, really no
    > excuse any more to not use it. 10 years ago the story was somewhat
    > different, because many programs didn't support it yet at that time.


    Only from a (former) Latin-1 perspective.

    j.
    January Weiner, Apr 21, 2009
    #11
  12. I'm assuming the goal is something like ensuring straße.txt doesn't
    overwrite strase.txt nor strasse.txt on a filesystem where file names
    can only use the printable ASCII character repertoire.

    You want the target characters to resemble the originals for mnemonic
    purposes?

    January Weiner wrote:
    > Having a clever module that can uniquly assign various weird characters to
    > the basic ASCII set would be a really great thing,


    When Unicode has over 100,000 assigned code points and ASCII only 127 I
    don't see how you could do this "uniquely" on a single character to
    single character basis.

    If you are replacing single Unicode characters with multiple ASCII
    characters then you might as well either

    a) Substitute non-ASCII characters with their Unicode names from
    http://www.unicode.org/Public/UNIDATA/NamesList.txt

    b) Substitute the hex or base-64 representation of the Unicode code-point.

    Any mnemonic scheme will probably only cope with a tiny subset of the
    "weird characters".

    Would this clever module also be able to represent Sanscrit in file
    names that are mnemonic for Mandarin speakers on computers with
    file-systems where file names are restricted to Big-5 encoded characters?

    --
    RGB
    RedGrittyBrick, Apr 21, 2009
    #12
  13. bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >Jürgen Exner wrote:
    >> First of all how would you react, if someone is mangling your name?
    >> There is no "English version" of my first name.

    >
    >But an English speaker might well search for "Jurgen Exner"
    >and hope to find you.


    And my name may come up as the closest hit with a 91% match.

    >Accent folding is a key component of "loose" matching.


    Having a second, closer look you are right. The OPs character set is
    indeed very restricted to just simple accented characters and doesn't
    include any of the more complex or additional characters found in the
    other Latin-X sets.

    jue
    Jürgen Exner, Apr 21, 2009
    #13
  14. Jürgen Exner wrote:
    > "Guy" <> wrote:
    >> I'm sure there are many ways to do this, but is there a much better way?
    >>
    >> $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    >> $word=lc($value);
    >>
    >> I want $word to equal the english version of $value. So if
    >> $value="Théodore", I want $word="theodore".

    >
    > This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
    > am suprised that people still want to make the same mistakes.


    It's not a mistake, if you know what you are doing.

    There are reasons to 'unaccent' a string, e.g. in fault tolerant
    matching, similarity of queries, e-mail accounts like
    ''.

    The 'very bad idea' is the small selection of accented characters.

    > First of all how would you react, if someone is mangling your name?


    Will the name of Russian or Japanese people in a German phone book be in
    Cyrrilic or Katakana?

    > There is no "English version" of my first name.


    There are transliterations to ASCII.

    Helmut Wollmersdorfer
    Helmut Wollmersdorfer, Apr 21, 2009
    #14
  15. Guy wrote:
    > I'm sure there are many ways to do this, but is there a much better way?


    > $value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
    > $word=lc($value);


    use Text::Undiacritic qw(undiacritic);
    $ascii_string = lc(undiacritic( $value ));

    This is a general solution not restricted to your few characters.

    Helmut Wollmersdorfer
    Helmut Wollmersdorfer, Apr 21, 2009
    #15
  16. Helmut Wollmersdorfer <> wrote:
    >Jürgen Exner wrote:
    >> First of all how would you react, if someone is mangling your name?

    >
    >Will the name of Russian or Japanese people in a German phone book be in
    >Cyrrilic or Katakana?


    They will be as the _person_ wrote them in German characters. That is
    very different from a computer program deciding how to change the name,
    based on some programmer's ideas who's internationalization expertice
    typically is very questionable.

    I have seen variations of my first name ranging from 'Juergen' and
    'Jurgen' over 'Jrgen' and 'J Rgen' all the way to 'J¼Ãrgen', usually
    because some programmer decided to accept non-ASCII input but then
    didn't deal with properly.

    >> There is no "English version" of my first name.

    >
    >There are transliterations to ASCII.


    Yes. And to do them properly you need much, much more than a tr///
    command!

    jue
    Jürgen Exner, Apr 21, 2009
    #16
  17. Guy

    Ted Zlatanov Guest

    On Tue, 21 Apr 2009 11:58:00 +0200 (CEST) January Weiner <> wrote:

    JW> You have read a great book by an author called Żmiwór Ściepełkowski. You
    JW> want to look up in your favorite library database what else he has written.
    JW> What do you do?

    JW> a) you enter "Zmiwor Sciepelkowski"
    JW> b) you figure out what these characters are, from what set, and in the
    JW> end you spend half an hour trying to locate "Åš" and other characters (and
    JW> not Ŝ, Š, Ṥ, Ṧ or Ṩ or one of a dozen of other variants, which are, in
    JW> fact, all very different, although they might look very similar).
    ....
    JW> Having a clever module that can uniquly assign various weird characters to
    JW> the basic ASCII set would be a really great thing, and I would be really
    JW> grateful to anyone who could offer a better solution than that of the OP.

    Unicode::Transliterate does at least some of this. It uses the IBM ICU
    project; the ICU documentation section on transforms may be particularly
    useful. For example, see the "Any->Accents" transliteration:

    http://userguide.icu-project.org/transforms/general

    That may be a better solution in the long run, depending on the OP's
    goals, but a simple regex is not a bad thing as long as it's used
    carefully and documented sufficiently.

    Ted
    Ted Zlatanov, Apr 21, 2009
    #17
  18. On 2009-04-21, Jürgen Exner <> wrote:
    >>Will the name of Russian or Japanese people in a German phone book be in
    >>Cyrrilic or Katakana?


    > They will be as the _person_ wrote them in German characters.


    LOL! You behave as if never visited German bureaucratic
    establishments. E.g., in Soviet time Soviet foreign passports were
    transliterated using a Russian-->French scheme of latinization.

    Anybody who spent some time in Germany should immediately guess how
    German-issued IDs looked like for people with such passports...

    [But in US things happen in "your" way - at least if you have
    intelligence to *ask*...]

    >>There are transliterations to ASCII.


    > Yes. And to do them properly you need much, much more than a tr///
    > command!


    True. But the number of tasks which computers can do "properly" is
    minuscule anyway... Most of the time "good enough" is good enough.
    E.g., in many situations `convert what you can, and replace the rest
    by "_"' is good enough...

    Yours,
    Ilya
    Ilya Zakharevich, Apr 22, 2009
    #18
  19. bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >Jürgen Exner wrote:
    >> bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    >>> Jürgen Exner wrote:
    >>>> First of all how would you react, if someone is mangling your name?
    >>>> There is no "English version" of my first name.
    >>> But an English speaker might well search for "Jurgen Exner"
    >>> and hope to find you.

    >>
    >> And my name may come up as the closest hit with a 91% match.
    >>
    >>> Accent folding is a key component of "loose" matching.

    >>
    >> Having a second, closer look you are right. The OPs character set is
    >> indeed very restricted to just simple accented characters and doesn't
    >> include any of the more complex or additional characters found in the
    >> other Latin-X sets.

    >
    >Of course, accent folding only helps searching in a limited context.
    >
    >If you have (e.g.) Japanese, Thai, Arabic data,
    >you're stuffed.


    Not even talking about those but simple Skandinavian, Baltic, and even
    German or Polish letters.

    jue
    Jürgen Exner, Apr 22, 2009
    #19
  20. On 2009-04-22, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:

    > Unicode only helps with representation of data; manipulation
    > is still down to the application.


    I strongly disagree. Unicode has its weak points, but it is still
    incomparably better that any scheme a Joe Xispack would invent
    herself.... Witness the disaster with Emacs Internationalization.

    Just:

    existence of the notion of "Unicode character",

    a possibility of specifying a character unambiguously (with some
    minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
    in CJK), and

    having a list of "property" *names* (which is, basically, the
    information about how other people look at individual characters)

    should be, IMO, an enormous help in the design of what you call
    "manipulations". And I did not even touch "tables", i.e., the *values*
    of these properties: it is a major work in itself...

    Yours,
    Ilya
    Ilya Zakharevich, Apr 22, 2009
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    385
  2. Replies:
    1
    Views:
    514
    Joerg Jooss
    Aug 22, 2005
  3. H5N1
    Replies:
    0
    Views:
    436
  4. Peter Bencsik
    Replies:
    2
    Views:
    822
  5. Paul Rubin
    Replies:
    5
    Views:
    416
    Hendrik van Rooyen
    Aug 6, 2009
Loading...

Share This Page