Is there a better way to convert foreign characters?

S

sln

I strongly disagree. Unicode has its weak points, but it is still
incomparably better that any scheme a Joe Xispack would invent
herself.... Witness the disaster with Emacs Internationalization.

Just:

existence of the notion of "Unicode character",

a possibility of specifying a character unambiguously (with some
minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
in CJK), and

having a list of "property" *names* (which is, basically, the
information about how other people look at individual characters)

should be, IMO, an enormous help in the design of what you call
"manipulations". And I did not even touch "tables", i.e., the *values*
of these properties: it is a major work in itself...

Yours,
Ilya

Unicode is a nightmare. Encoding 1-6 bytes (or more) to represent the
whole range of possible multiple code rendering(s) of character(s) of all
the languages in the world is just out of control.

Internal data manipulation is a nightmare, a hog, and slow as hell.
Is it a byte, a word, int or more? 0 .. (2**32-1) or more! Optimizations?
Encode/Decoding, back and forth. Just a nightmare. And what is it, what
is the encoding of that? Dunno, take a guess! "L,that sucks man!";

Unicode, the expression of everything that does nothing (good).

-sln
 
H

Helmut Wollmersdorfer

Ilya Zakharevich wrote:
[Unicode]
a possibility of specifying a character unambiguously (with some
minor hair-splitting needed sometimes, as in o-trema vs o-umlaut, or
in CJK), and

.... can not decompose 'overlay diacritics' like l-stroke or o-stroke
having a list of "property" *names* (which is, basically, the
information about how other people look at individual characters)

e.g. distinguish 'confusables' like cyr-A versus latin-A
should be, IMO, an enormous help in the design of what you call
"manipulations". And I did not even touch "tables", i.e., the *values*
of these properties: it is a major work in itself...

Of course. Matching Unicode properties may be slow, but it's far better
than maintaining a table myself (for a language or script I do not know).

Also the tables in Unicode locales are a great work, very incomplete
(e.g. transliteration), but save time in some other topics.

Helmut Wollmersdorfer
 
H

Helmut Wollmersdorfer

Unicode is a nightmare.

Writing systems of the world are a nightmare. Unicode just documents them.
Encoding 1-6 bytes (or more) to represent the
whole range of possible multiple code rendering(s) of character(s) of all
the languages in the world is just out of control.

Unicode defines a character *set* not a character *encoding*.
Internal data manipulation is a nightmare, a hog, and slow as hell.

I disagree. Maybe slow, if you use property matching in Perl5.
Is it a byte, a word, int or more? 0 .. (2**32-1) or more!

It is a character - nice to handle in Perl 5.8.
Encode/Decoding, back and forth.

Every system needs encode/decode between internal and external
representation of characters, if encodings differ. If your Perl programs
are well designed, you need it just in one place - in the open statement.
Just a nightmare. And what is it, what
is the encoding of that?

That's the problem which Unicode helps to solve. There are hundreds of
non-Unicode encodings in the wild, some very exotic like
7-bit-ASCII-German, some undocumented.
Unicode, the expression of everything that does nothing (good).

It's your responsibility to use it in a good or bad way.

Helmut Wollmersdorfer
 
T

Tim McDaniel

( $word = lc $value ) =~ tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;

I don't combine s///, tr///, or chomp with assignments -- personal
idiom and I'm not familiar with the Perl effects. The above assigns
the lowercase translation of $value to $word, and then does a tr/// on
$word, right? Then there should be no need for the capitalized
characters in the tr///, because there shouldn't be any to match.

I agree with the other posters who suggest using standard modules,
like Undiacritical or whatever it was.
 
G

Guy

Guy said:
I'm sure there are many ways to do this, but is there a much better way?

$value=~tr/àâÀéèëêÉÊçÇîïôÔùû/aaaeeeeeecciioouu/;
$word=lc($value);

I want $word to equal the english version of $value. So if
$value="Théodore", I want $word="theodore". I'd like to do it in one
statement if possible but I think I have to convert $value in one
statement and then assign it to $word in another statement.

Cheers!
Guy

Just to explain a little. I have a few hundred old pictures of this city
from the 1900 to 1940, when the city was just a town of about 100 houses. I
want to allow the local population to search through the photos, perhaps
find their grand-parents or even great-grand-parents. Like today, many of
the folks back then were french, with names like "Roméo" or "Théodore".
Despite this, most people here have english keyboards and I suspect that
many don't even know how to type french characters like "é". Therefore, I
suspect that people will just search for "Theodore" or "Romeo", and perhaps
in lowercase too, such as "theodore" or "romeo". The names are just english
and french, nothing else, at least not in this project. Thanks for all,
Guy
 
P

Peter J. Holzer

Finding out the effects is trivial, isn't it?


Yes. So you do know about the effects, after all. ;-)


That's true only if a suitable locale is enabled.

Or if the $value is a character string.
If a programmer wants to do that kind of transliteration, there is a
great chance that s/he doesn't care about any kind of i18n or l10n.

The simple fact that he does specifically operate on accented characters
shows that he *does* care.

If $value is a byte string and no locale is in effect, lc on a non-ASCII
string is poorly defined. If the string is in a multi-byte encoding lc
might convert a byte which happens to be part of a character, which is
almost certainly wrong. Also, tr almost certainly doesn't work as
intended.

In a single-byte encoding which is a superset of ASCII (e.g. ISO-8859-X)
the code works, because lc is a noop on all accented characters. But I
still think this is unclean. You should convert to ASCII first and then
case-fold.

(of course I really think you should use character strings if you do
operations on characters, and not muck around with byte strings)

hp
 
T

Tim McDaniel

In a single-byte encoding which is a superset of ASCII
(e.g. ISO-8859-X) the code works, because lc is a noop on all
accented characters.

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?! What sort of
nonsense is that?
 
G

Gunnar Hjalmarsson

Peter said:
Or if the $value is a character string.

Hmm.. Yes, so it seems. I wasn't aware of that.

(of course I really think you should use character strings if you do
operations on characters, and not muck around with byte strings)

So far, I haven't bothered with encoding/decoding when I have been
working with Latin-1. Are you saying that encoding/decoding is advisable
even if you are not dealing with UTF-8 or some other encoding with wide
characters?
 
G

Gunnar Hjalmarsson

Tim said:
In a single-byte encoding which is a superset of ASCII
(e.g. ISO-8859-X) the code works, because lc is a noop on all
accented characters.

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?!

No.

$ perl -MPOSIX -le '
setlocale LC_CTYPE, "sv_SE.iso88591";
print lc "ÀÉÊÇÔ";
use locale;
print lc "ÀÉÊÇÔ";
'
ÀÉÊÇÔ
àéêçô
$

But there was no playing with locales in the code we were discussing.
What sort of nonsense is that?

Quoting out of context?
 
J

Jürgen Exner

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?! What sort of
nonsense is that?

Aside of the other replies please keep in mind that for some letters
there is no upper case or lower case equivalent letter.
Just one example would be the German sharp s: ß, which never occurs at
the beginning of a word and if capitalized in an all-uppercase word
would be written as a double S: SS.
There are also examples where two lower-case letters are mapped to the
same upper-case letter. How do you map the upper-case letter back into
lower case without knowing the context, i.e. the word it is used in?

jue
 
I

Ilya Zakharevich

If $value is a byte string and no locale is in effect, lc on a non-ASCII
string is poorly defined.

??? In absense of `use locale', lc should convert to lower-case using
Unicode case-conversion tables. What is "poorly defined" in this semantic?
In a single-byte encoding which is a superset of ASCII (e.g. ISO-8859-X)
the code works, because lc is a noop on all accented characters.

What exactly do you mean here?
perl -Mcharnames=latin -wle "print qq(\N{AE})" Æ
perl -Mcharnames=latin -wle "print lc qq(\N{AE})"
æ

I must be missing something...

Yours,
Ilya
 
I

Ilya Zakharevich

Peter is talking about byte strings. \N produces utf8 strings, even if
it didn't need to.

There is no such thing as "byte strings" or "utf8 strings". Strings
are strings...

However, there are such things as bugs in perl:

perl -Mcharnames=latin -wle "print ord chr ord qq(\N{AE})"
198
perl -Mcharnames=latin -wle "print ord lc chr ord qq(\N{AE})"
198

Just a bug,
Ilya
 
P

Peter J. Holzer

In a single-byte encoding which is a superset of ASCII
(e.g. ISO-8859-X) the code works, because lc is a noop on all
accented characters.

!?!?! So, even if the local is set to Latin-1, lc('A') produces 'a',
but lc([A with acute accent]) is [A with acute accent]?!

Gunnar and I were specifically talking about the case where *no* locale
is active.
What sort of nonsense is that?

The sort of nonsense you when when you don't read postings carefully
enough.

hp
 
P

Peter J. Holzer

Peter said:
Tim McDaniel wrote: [after calling lc]
Then there should be no need for the capitalized characters in the
tr///, because there shouldn't be any to match.

That's true only if a suitable locale is enabled.

Or if the $value is a character string.

Hmm.. Yes, so it seems. I wasn't aware of that.

(of course I really think you should use character strings if you do
operations on characters, and not muck around with byte strings)

So far, I haven't bothered with encoding/decoding when I have been
working with Latin-1. Are you saying that encoding/decoding is advisable
even if you are not dealing with UTF-8 or some other encoding with wide
characters?

Yes.

* Perl knows that a character string is a character string. So matching
against character classes, lc, uc, etc. works automatically.
* You don't have to care about the encoding within your program.
Only for I/O you have to decode/encode, and that can usually be done
with an I/O-layer. So all the encoding-specific stuff is centralized
in one place: Where the file is opened.

For me the rule of thumb is

* When you read character data from an external source, decode it
immediately. If there is an automatic way to do that (I/O layer,
option for the DBD, etc.) use that.
* When you write character data to an external source, encode it as
late as possible. Again, use an automatic way if there is one.

Then, within my program, I know that all character data is in character
strings and everything "just works" whether the data came from a latin-1
file or a utf-8 file or database in big-5. And all the byte data (e.g.,
blobs, images, etc.) is in byte strings, and that also just works.

hp
 
P

Peter J. Holzer

??? In absense of `use locale', lc should convert to lower-case using
Unicode case-conversion tables.

For a byte-string? No, it doesn't, and I think it shouldn't (I know some
people disagree on the latter).
What is "poorly defined" in this semantic?

You don't know whether an octet is a character. For example, IIRC the
ISO-2022-JP encoding uses octets in the range 0x20-0x7F in multi-byte
encodings. If you apply lc (without locale information) to an
ISO-2022-JP encoded string, it blindly replace all octets 0x41-0x5A with
0x61-0x7A, thereby replacing Japanese characters with completely
unrelated Japanese characters.

What exactly do you mean here?

æ

bernon:~/tmp 12:27 :) 117% perl -CO -wle 'print qq(\x{C6})'
Æ
bernon:~/tmp 12:27 :) 118% perl -CO -wle 'print lc qq(\x{C6})'
Æ

hp
 
P

Peter J. Holzer

There is no such thing as "byte strings" or "utf8 strings".

There is. You may wish that this wasn't the case but that's just wishful
thinking. There *are* differences between byte strings and character
strings. These differences are documented. So they exist in both "perl"
and "Perl".

hp
 
P

Peter J. Holzer

The sort of nonsense you when when you don't read postings carefully
enough.

And this sentence is the sort of nonsense you get when you press the
send button before proof-reading :-(. s/when/get/

hp
 
I

Ilya Zakharevich

For a byte-string? No, it doesn't, and I think it shouldn't

I'd like to hear the logic behind this...
You don't know whether an octet is a character.

If lc is applied to a string, it consists of characters.
For example, IIRC the ISO-2022-JP encoding uses octets in the range
0x20-0x7F in multi-byte encodings. If you apply lc (without locale
information)

So do not... You do not apply lc() to gzipped strings, right? 1/3 ;-)

Yours,
Ilya
 
G

Gunnar Hjalmarsson

Peter said:
Yes.

* Perl knows that a character string is a character string. So matching
against character classes, lc, uc, etc. works automatically.
* You don't have to care about the encoding within your program.
Only for I/O you have to decode/encode, and that can usually be done
with an I/O-layer. So all the encoding-specific stuff is centralized
in one place: Where the file is opened.

For me the rule of thumb is

* When you read character data from an external source, decode it
immediately. If there is an automatic way to do that (I/O layer,
option for the DBD, etc.) use that.
* When you write character data to an external source, encode it as
late as possible. Again, use an automatic way if there is one.

Then, within my program, I know that all character data is in character
strings and everything "just works" whether the data came from a latin-1
file or a utf-8 file or database in big-5. And all the byte data (e.g.,
blobs, images, etc.) is in byte strings, and that also just works.

Thanks for those useful comments, Peter. You gave me something to think
about.

I suppose, though, that your rule of thumb is only applicable as long as
you don't want backwards compatibility with pre 5.8 perl versions.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top