Is there a better way to convert foreign characters?

G

Guy

I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible but I think I have to convert $value in one statement
and then assign it to $word in another statement.

Cheers!
Guy
 
G

Gunnar Hjalmarsson

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible

( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
 
J

Jürgen Exner

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore".

This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

First of all how would you react, if someone is mangling your name?
There is no "English version" of my first name.

And second there are cases, where your "English version" actually has a
very different meaning, like in Swedish your method would rename Mrs.
Hear into Mrs. Whore. Are you sure you want to do that?

And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it. 10 years ago the story was somewhat
different, because many programs didn't support it yet at that time.

jue
 
I

Ilya Zakharevich

This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

People not necessarily eat bullshit without objection. What a person
wants is A Perfect Idea if *what one wants to do* is exactly this.

And quite often it is.
First of all how would you react, if someone is mangling your name?
There is no "English version" of my first name.

Try to tell this to somebody issuing IDs in English-speaking country...
And second there are cases, where your "English version" actually has a
very different meaning, like in Swedish your method would rename Mrs.
Hear into Mrs. Whore. Are you sure you want to do that?

And what do you propose to do if you need to create a file name on a
filesystem which supports ASCII only?!
And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it.

You are joking, really? (There is a small set of tasks which allows use
of Unicode; but it is VERY far from being universal...)

[And I even ignore the fact that UTF-8 is not a charset... ;-]
10 years ago the story was somewhat
different, because many programs didn't support it yet at that time.

Try to explain this to my DVD player...

Hope this helps,
Ilya
 
G

Gunnar Hjalmarsson

Jürgen Exner said:
This is A Very Bad Idea(TM).

It's probably not the OP's idea, it's just homework.
First of all how would you react, if someone is mangling your name?
There is no "English version" of my first name.
Agreed.

And second there are cases, where your "English version" actually has a
very different meaning, like in Swedish your method would rename Mrs.
Hear into Mrs. Whore.
Confirmed.

And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it.

Well, personally I usually stick to latin1. Suppose all the characters
in the above tr/// are recognized by that charset, which is also kind of
Internet standard in the Western world, isn't it?
 
S

sln

It's probably not the OP's idea, it's just homework.


Well, personally I usually stick to latin1. Suppose all the characters
in the above tr/// are recognized by that charset, which is also kind of
Internet standard in the Western world, isn't it?

Ahh... Twiddly De and Twidly Dum

-sln
 
D

Dr.Ruud

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible but I think I have to convert $value in one statement
and then assign it to $word in another statement.

perl -Mstrict -Mutf8 -MText::Unidecode -wle '
my $s = "àâÀéèëêÉÊçÇîïôÔùû";
print Text::Unidecode::unidecode( $s );
'
aaAeeeeEEcCiioOuu
 
G

Gunnar Hjalmarsson

Dr.Ruud said:
perl -Mstrict -Mutf8 -MText::Unidecode -wle '
my $s = "àâÀéèëêÉÊçÇîïôÔùû";
print Text::Unidecode::unidecode( $s );
'
aaAeeeeEEcCiioOuu

The purpose of that module is to handle non-Roman characters. What makes
you believe those characters are Unicode?

$ perl -MEncode -le '
$octets = "àâÀéèëêÉÊçÇîïôÔùû";
print "Raw: ", $octets;
print "Latin-1: ", decode "ISO-8859-1", $octets;
print "ANSI: ", decode "Windows-1252", $octets;
'
Raw: àâÀéèëêÉÊçÇîïôÔùû
Latin-1: àâÀéèëêÉÊçÇîïôÔùû
ANSI: àâÀéèëêÉÊçÇîïôÔùû
$
 
I

Ilya Zakharevich

The purpose of that module is to handle non-Roman characters. What makes
you believe those characters are Unicode?

Since this code is not in scope of `use locale', the characters are,
by Perl semantic, in Unicode...

Hope this helps,
Ilya
 
J

January Weiner

This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

Right.

You have read a great book by an author called Żmiwór Ściepełkowski. You
want to look up in your favorite library database what else he has written.
What do you do?

a) you enter "Zmiwor Sciepelkowski"
b) you figure out what these characters are, from what set, and in the
end you spend half an hour trying to locate "Åš" and other characters (and
not Ŝ, Š, Ṥ, Ṧ or Ṩ or one of a dozen of other variants, which are, in
fact, all very different, although they might look very similar).

You guys from the former Latin 1 set have it easy talking. Latin 1
characters (e.g. umlauts) can be found and easily inserted almost anywhere
and on any system.

Having a clever module that can uniquly assign various weird characters to
the basic ASCII set would be a really great thing, and I would be really
grateful to anyone who could offer a better solution than that of the OP.
And last UTF-8 is such a nice character set, there is really, really no
excuse any more to not use it. 10 years ago the story was somewhat
different, because many programs didn't support it yet at that time.

Only from a (former) Latin-1 perspective.

j.
 
R

RedGrittyBrick

I'm assuming the goal is something like ensuring straße.txt doesn't
overwrite strase.txt nor strasse.txt on a filesystem where file names
can only use the printable ASCII character repertoire.

You want the target characters to resemble the originals for mnemonic
purposes?

January said:
Having a clever module that can uniquly assign various weird characters to
the basic ASCII set would be a really great thing,

When Unicode has over 100,000 assigned code points and ASCII only 127 I
don't see how you could do this "uniquely" on a single character to
single character basis.

If you are replacing single Unicode characters with multiple ASCII
characters then you might as well either

a) Substitute non-ASCII characters with their Unicode names from
http://www.unicode.org/Public/UNIDATA/NamesList.txt

b) Substitute the hex or base-64 representation of the Unicode code-point.

Any mnemonic scheme will probably only cope with a tiny subset of the
"weird characters".

Would this clever module also be able to represent Sanscrit in file
names that are mnemonic for Mandarin speakers on computers with
file-systems where file names are restricted to Big-5 encoded characters?
 
J

Jürgen Exner

bugbear said:
But an English speaker might well search for "Jurgen Exner"
and hope to find you.

And my name may come up as the closest hit with a 91% match.
Accent folding is a key component of "loose" matching.

Having a second, closer look you are right. The OPs character set is
indeed very restricted to just simple accented characters and doesn't
include any of the more complex or additional characters found in the
other Latin-X sets.

jue
 
H

Helmut Wollmersdorfer

Jürgen Exner said:
This is A Very Bad Idea(TM). We had those discussions 10 years ago and I
am suprised that people still want to make the same mistakes.

It's not a mistake, if you know what you are doing.

There are reasons to 'unaccent' a string, e.g. in fault tolerant
matching, similarity of queries, e-mail accounts like
'(e-mail address removed)'.

The 'very bad idea' is the small selection of accented characters.
First of all how would you react, if someone is mangling your name?

Will the name of Russian or Japanese people in a German phone book be in
Cyrrilic or Katakana?
There is no "English version" of my first name.

There are transliterations to ASCII.

Helmut Wollmersdorfer
 
H

Helmut Wollmersdorfer

Guy said:
I'm sure there are many ways to do this, but is there a much better way?
$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

use Text::Undiacritic qw(undiacritic);
$ascii_string = lc(undiacritic( $value ));

This is a general solution not restricted to your few characters.

Helmut Wollmersdorfer
 
J

Jürgen Exner

Helmut Wollmersdorfer said:
Will the name of Russian or Japanese people in a German phone book be in
Cyrrilic or Katakana?

They will be as the _person_ wrote them in German characters. That is
very different from a computer program deciding how to change the name,
based on some programmer's ideas who's internationalization expertice
typically is very questionable.

I have seen variations of my first name ranging from 'Juergen' and
'Jurgen' over 'Jrgen' and 'J Rgen' all the way to 'J¼Ãrgen', usually
because some programmer decided to accept non-ASCII input but then
didn't deal with properly.
There are transliterations to ASCII.

Yes. And to do them properly you need much, much more than a tr///
command!

jue
 
T

Ted Zlatanov

JW> You have read a great book by an author called Żmiwór Ściepełkowski. You
JW> want to look up in your favorite library database what else he has written.
JW> What do you do?

JW> a) you enter "Zmiwor Sciepelkowski"
JW> b) you figure out what these characters are, from what set, and in the
JW> end you spend half an hour trying to locate "Åš" and other characters (and
JW> not Ŝ, Š, Ṥ, Ṧ or Ṩ or one of a dozen of other variants, which are, in
JW> fact, all very different, although they might look very similar).
....
JW> Having a clever module that can uniquly assign various weird characters to
JW> the basic ASCII set would be a really great thing, and I would be really
JW> grateful to anyone who could offer a better solution than that of the OP.

Unicode::Transliterate does at least some of this. It uses the IBM ICU
project; the ICU documentation section on transforms may be particularly
useful. For example, see the "Any->Accents" transliteration:

http://userguide.icu-project.org/transforms/general

That may be a better solution in the long run, depending on the OP's
goals, but a simple regex is not a bad thing as long as it's used
carefully and documented sufficiently.

Ted
 
I

Ilya Zakharevich

They will be as the _person_ wrote them in German characters.

LOL! You behave as if never visited German bureaucratic
establishments. E.g., in Soviet time Soviet foreign passports were
transliterated using a Russian-->French scheme of latinization.

Anybody who spent some time in Germany should immediately guess how
German-issued IDs looked like for people with such passports...

[But in US things happen in "your" way - at least if you have
intelligence to *ask*...]
Yes. And to do them properly you need much, much more than a tr///
command!

True. But the number of tasks which computers can do "properly" is
minuscule anyway... Most of the time "good enough" is good enough.
E.g., in many situations `convert what you can, and replace the rest
by "_"' is good enough...

Yours,
Ilya
 
J

Jürgen Exner

bugbear said:
Of course, accent folding only helps searching in a limited context.

If you have (e.g.) Japanese, Thai, Arabic data,
you're stuffed.

Not even talking about those but simple Skandinavian, Baltic, and even
German or Polish letters.

jue
 
I

Ilya Zakharevich

Unicode only helps with representation of data; manipulation
is still down to the application.

I strongly disagree. Unicode has its weak points, but it is still
incomparably better that any scheme a Joe Xispack would invent
herself.... Witness the disaster with Emacs Internationalization.

Just:

existence of the notion of "Unicode character",

a possibility of specifying a character unambiguously (with some
minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
in CJK), and

having a list of "property" *names* (which is, basically, the
information about how other people look at individual characters)

should be, IMO, an enormous help in the design of what you call
"manipulations". And I did not even touch "tables", i.e., the *values*
of these properties: it is a major work in itself...

Yours,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top