Is there a better way to convert foreign characters?

Guy · Apr 19, 2009

I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible but I think I have to convert $value in one statement
and then assign it to $word in another statement.

Cheers!
Guy

Gunnar Hjalmarsson · Apr 19, 2009

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible

( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;

Guy · Apr 19, 2009

Gunnar Hjalmarsson said:
( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;

Perfect, thank you muchly!
Guy

Jürgen Exner · Apr 19, 2009

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore".

This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

First of all how would you react, if someone is mangling your name?
There is no "English version" of my first name.

And second there are cases, where your "English version" actually has a
very different meaning, like in Swedish your method would rename Mrs.
Hear into Mrs. Whore. Are you sure you want to do that?

And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it. 10 years ago the story was somewhat
different, because many programs didn't support it yet at that time.

jue

Ilya Zakharevich · Apr 20, 2009

This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

People not necessarily eat bullshit without objection. What a person
wants is A Perfect Idea if *what one wants to do* is exactly this.

And quite often it is.

First of all how would you react, if someone is mangling your name?

There is no "English version" of my first name.

Try to tell this to somebody issuing IDs in English-speaking country...

And second there are cases, where your "English version" actually has a
very different meaning, like in Swedish your method would rename Mrs.
Hear into Mrs. Whore. Are you sure you want to do that?

And what do you propose to do if you need to create a file name on a
filesystem which supports ASCII only?!

And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it.

You are joking, really? (There is a small set of tasks which allows use
of Unicode; but it is VERY far from being universal...)

[And I even ignore the fact that UTF-8 is not a charset... ;-]

10 years ago the story was somewhat
different, because many programs didn't support it yet at that time.

Try to explain this to my DVD player...

Hope this helps,
Ilya

Gunnar Hjalmarsson · Apr 20, 2009

Jürgen Exner said:
This is A Very Bad Idea(TM).

It's probably not the OP's idea, it's just homework.

First of all how would you react, if someone is mangling your name?
There is no "English version" of my first name.
Agreed.

And second there are cases, where your "English version" actually has a
very different meaning, like in Swedish your method would rename Mrs.
Hear into Mrs. Whore.
Confirmed.

And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it.

Well, personally I usually stick to latin1. Suppose all the characters
in the above tr/// are recognized by that charset, which is also kind of
Internet standard in the Western world, isn't it?

sln · Apr 20, 2009

It's probably not the OP's idea, it's just homework.

Well, personally I usually stick to latin1. Suppose all the characters
in the above tr/// are recognized by that charset, which is also kind of
Internet standard in the Western world, isn't it?

Ahh... Twiddly De and Twidly Dum

-sln

Dr.Ruud · Apr 20, 2009

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible but I think I have to convert $value in one statement
and then assign it to $word in another statement.

perl -Mstrict -Mutf8 -MText::Unidecode -wle '
my $s = "àâÀéèëêÉÊçÇîïôÔùû";
print Text::Unidecode::unidecode( $s );
'
aaAeeeeEEcCiioOuu

Gunnar Hjalmarsson · Apr 20, 2009

Dr.Ruud said:
perl -Mstrict -Mutf8 -MText::Unidecode -wle '
my $s = "àâÀéèëêÉÊçÇîïôÔùû";
print Text::Unidecode::unidecode( $s );
'
aaAeeeeEEcCiioOuu

The purpose of that module is to handle non-Roman characters. What makes
you believe those characters are Unicode?

$ perl -MEncode -le '
$octets = "àâÀéèëêÉÊçÇîïôÔùû";
print "Raw: ", $octets;
print "Latin-1: ", decode "ISO-8859-1", $octets;
print "ANSI: ", decode "Windows-1252", $octets;
'
Raw: àâÀéèëêÉÊçÇîïôÔùû
Latin-1: àâÀéèëêÉÊçÇîïôÔùû
ANSI: àâÀéèëêÉÊçÇîïôÔùû
$

Ilya Zakharevich · Apr 21, 2009

The purpose of that module is to handle non-Roman characters. What makes
you believe those characters are Unicode?

Since this code is not in scope of `use locale', the characters are,
by Perl semantic, in Unicode...

Hope this helps,
Ilya

January Weiner · Apr 21, 2009

This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

Right.

You have read a great book by an author called Å»miwÃ³r ÅšciepeÅ‚kowski. You
want to look up in your favorite library database what else he has written.
What do you do?

a) you enter "Zmiwor Sciepelkowski"
b) you figure out what these characters are, from what set, and in the
end you spend half an hour trying to locate "Åš" and other characters (and
not Åœ, Å , á¹¤, á¹¦ or á¹¨ or one of a dozen of other variants, which are, in
fact, all very different, although they might look very similar).

You guys from the former Latin 1 set have it easy talking. Latin 1
characters (e.g. umlauts) can be found and easily inserted almost anywhere
and on any system.

Having a clever module that can uniquly assign various weird characters to
the basic ASCII set would be a really great thing, and I would be really
grateful to anyone who could offer a better solution than that of the OP.

And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it. 10 years ago the story was somewhat
different, because many programs didn't support it yet at that time.

Only from a (former) Latin-1 perspective.

j.

RedGrittyBrick · Apr 21, 2009

I'm assuming the goal is something like ensuring straÃŸe.txt doesn't
overwrite strase.txt nor strasse.txt on a filesystem where file names
can only use the printable ASCII character repertoire.

You want the target characters to resemble the originals for mnemonic
purposes?

January said:
Having a clever module that can uniquly assign various weird characters to
the basic ASCII set would be a really great thing,

When Unicode has over 100,000 assigned code points and ASCII only 127 I
don't see how you could do this "uniquely" on a single character to
single character basis.

If you are replacing single Unicode characters with multiple ASCII
characters then you might as well either

a) Substitute non-ASCII characters with their Unicode names from
http://www.unicode.org/Public/UNIDATA/NamesList.txt

b) Substitute the hex or base-64 representation of the Unicode code-point.

Any mnemonic scheme will probably only cope with a tiny subset of the
"weird characters".

Would this clever module also be able to represent Sanscrit in file
names that are mnemonic for Mandarin speakers on computers with
file-systems where file names are restricted to Big-5 encoded characters?

Jürgen Exner · Apr 21, 2009

bugbear said:
But an English speaker might well search for "Jurgen Exner"
and hope to find you.

And my name may come up as the closest hit with a 91% match.

Accent folding is a key component of "loose" matching.

Having a second, closer look you are right. The OPs character set is
indeed very restricted to just simple accented characters and doesn't
include any of the more complex or additional characters found in the
other Latin-X sets.

jue

Helmut Wollmersdorfer · Apr 21, 2009

Jürgen Exner said:
This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

It's not a mistake, if you know what you are doing.

There are reasons to 'unaccent' a string, e.g. in fault tolerant
matching, similarity of queries, e-mail accounts like
'(e-mail address removed)'.

The 'very bad idea' is the small selection of accented characters.

First of all how would you react, if someone is mangling your name?

Will the name of Russian or Japanese people in a German phone book be in
Cyrrilic or Katakana?

There is no "English version" of my first name.

There are transliterations to ASCII.

Helmut Wollmersdorfer

Helmut Wollmersdorfer · Apr 21, 2009

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

use Text::Undiacritic qw(undiacritic);
$ascii_string = lc(undiacritic( $value ));

This is a general solution not restricted to your few characters.

Helmut Wollmersdorfer

Jürgen Exner · Apr 21, 2009

Helmut Wollmersdorfer said:
Will the name of Russian or Japanese people in a German phone book be in
Cyrrilic or Katakana?

They will be as the _person_ wrote them in German characters. That is
very different from a computer program deciding how to change the name,
based on some programmer's ideas who's internationalization expertice
typically is very questionable.

I have seen variations of my first name ranging from 'Juergen' and
'Jurgen' over 'Jrgen' and 'J Rgen' all the way to 'J¼Ãrgen', usually
because some programmer decided to accept non-ASCII input but then
didn't deal with properly.

There are transliterations to ASCII.

Yes. And to do them properly you need much, much more than a tr///
command!

jue

Ted Zlatanov · Apr 21, 2009

JW> You have read a great book by an author called Å»miwÃ³r ÅšciepeÅ‚kowski. You
JW> want to look up in your favorite library database what else he has written.
JW> What do you do?

JW> a) you enter "Zmiwor Sciepelkowski"
JW> b) you figure out what these characters are, from what set, and in the
JW> end you spend half an hour trying to locate "Åš" and other characters (and
JW> not Åœ, Å , á¹¤, á¹¦ or á¹¨ or one of a dozen of other variants, which are, in
JW> fact, all very different, although they might look very similar).
....
JW> Having a clever module that can uniquly assign various weird characters to
JW> the basic ASCII set would be a really great thing, and I would be really
JW> grateful to anyone who could offer a better solution than that of the OP.

Unicode::Transliterate does at least some of this. It uses the IBM ICU
project; the ICU documentation section on transforms may be particularly
useful. For example, see the "Any->Accents" transliteration:

http://userguide.icu-project.org/transforms/general

That may be a better solution in the long run, depending on the OP's
goals, but a simple regex is not a bad thing as long as it's used
carefully and documented sufficiently.

Ted

Ilya Zakharevich · Apr 22, 2009

They will be as the _person_ wrote them in German characters.

LOL! You behave as if never visited German bureaucratic
establishments. E.g., in Soviet time Soviet foreign passports were
transliterated using a Russian-->French scheme of latinization.

Anybody who spent some time in Germany should immediately guess how
German-issued IDs looked like for people with such passports...

[But in US things happen in "your" way - at least if you have
intelligence to *ask*...]

Yes. And to do them properly you need much, much more than a tr///
command!

True. But the number of tasks which computers can do "properly" is
minuscule anyway... Most of the time "good enough" is good enough.
E.g., in many situations `convert what you can, and replace the rest
by "_"' is good enough...

Yours,
Ilya

Jürgen Exner · Apr 22, 2009

bugbear said:
Of course, accent folding only helps searching in a limited context.

If you have (e.g.) Japanese, Thai, Arabic data,
you're stuffed.

Not even talking about those but simple Skandinavian, Baltic, and even
German or Polish letters.

jue

Ilya Zakharevich · Apr 22, 2009

Unicode only helps with representation of data; manipulation
is still down to the application.

I strongly disagree. Unicode has its weak points, but it is still
incomparably better that any scheme a Joe Xispack would invent
herself.... Witness the disaster with Emacs Internationalization.

Just:

existence of the notion of "Unicode character",

a possibility of specifying a character unambiguously (with some
minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
in CJK), and

having a list of "property" *names* (which is, basically, the
information about how other people look at individual characters)

should be, IMO, an enormous help in the design of what you call
"manipulations". And I did not even touch "tables", i.e., the *values*
of these properties: it is a major work in itself...

Yours,
Ilya

The current way of software code indentation masks the software control flow. There is a better alternate way	2	Mar 28, 2023
Optimal way to make a table for large lists	2	Jul 7, 2022
Is this right way to convert data attributes values to number in javascipt? Need to get valid numeric value or 0	2	May 30, 2023
Is there a way to pass this state from component to the fetch?	1	Apr 24, 2023
Is there a way where i can limit the array output results?	1	Oct 19, 2022
Is there a way to get a single mode using all the points within a 2D array?	2	Oct 17, 2022
Is there a way to add strings to a list without the quotation marks in C++?	1	Nov 9, 2020
Is there a better way to do this snippet?	6	Apr 3, 2012

Is there a better way to convert foreign characters?

Guy

Gunnar Hjalmarsson

Guy

Jürgen Exner

Ilya Zakharevich

Gunnar Hjalmarsson

sln

Dr.Ruud

Gunnar Hjalmarsson

Ilya Zakharevich

January Weiner

RedGrittyBrick

Jürgen Exner

Helmut Wollmersdorfer

Helmut Wollmersdorfer

Jürgen Exner

Ted Zlatanov

Ilya Zakharevich

Jürgen Exner

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads