regex and utf8 characters (german umlauts)

D

Dirk Heinrichs

Hi,

the following little perl snippet

perl -e '($string = "AAA ÄÄÄ BBB CCC DDD") =~ s/(\p{IsUpper}+)/\L\u\1\E/g;
print $string . "\n"'

gives this result:

Aaa ÄÄÄ Bbb Ccc Ddd

How do I turn those umlauts into "Äää" also? I tried adding "use utf8;", but
that didn't help.

Thanks...

Dirk
--
Dirk Heinrichs | Tel: +49 (0)162 234 3408
Configuration Manager | Fax: +49 (0)211 47068 111
Capgemini Deutschland | Mail: (e-mail address removed)
Hambornerstraße 55 | Web: http://www.capgemini.com
D-40472 Düsseldorf | ICQ#: 110037733
GPG Public Key C2E467BB | Keyserver: www.keyserver.net
 
D

Dave

Dirk Heinrichs said:
Hi,

the following little perl snippet

perl -e '($string = "AAA ÄÄÄ BBB CCC DDD") =~ s/(\p{IsUpper}+)/\L\u\1\E/g;
print $string . "\n"'

gives this result:

Aaa ÄÄÄ Bbb Ccc Ddd

How do I turn those umlauts into "Äää" also? I tried adding "use utf8;",
but
that didn't help.

Thanks...

Dirk
--
Dirk Heinrichs | Tel: +49 (0)162 234 3408
Configuration Manager | Fax: +49 (0)211 47068 111
Capgemini Deutschland | Mail: (e-mail address removed)
Hambornerstraße 55 | Web: http://www.capgemini.com
D-40472 Düsseldorf | ICQ#: 110037733
GPG Public Key C2E467BB | Keyserver: www.keyserver.net

Are you running this in a unicode shell with unicode input? Also what
version of perl?
 
D

Dirk Heinrichs

Dave said:
Are you running this in a unicode shell with unicode input? Also what
version of perl?

Yes, in KDE's Konsole configured for UTF-8 and LANG set to de_DE.utf8. Perl
version is 5.8.8, OS is (Gentoo) Linux.

Bye...

Dirk
--
Dirk Heinrichs | Tel: +49 (0)162 234 3408
Configuration Manager | Fax: +49 (0)211 47068 111
Capgemini Deutschland | Mail: (e-mail address removed)
Hambornerstraße 55 | Web: http://www.capgemini.com
D-40472 Düsseldorf | ICQ#: 110037733
GPG Public Key C2E467BB | Keyserver: www.keyserver.net
 
T

Ted Zlatanov

the following little perl snippet

perl -e '($string = "AAA ÄÄÄ BBB CCC DDD") =~ s/(\p{IsUpper}+)/\L\u\1\E/g;
print $string . "\n"'

gives this result:

Aaa ÄÄÄ Bbb Ccc Ddd

How do I turn those umlauts into "Äää" also? I tried adding "use utf8;", but
that didn't help.

The utf8 pragma won't make a difference. Ä is ASCII code 196.

Try this:

perl -MPOSIX -e '$loc = setlocale( LC_ALL, "" ); print "$loc => ", lc(chr(196))'
en_US => Ä
perl -MPOSIX -e '$loc = setlocale( LC_ALL, "de_AT" ); print "$loc => ", lc(chr(196))'
=> Ä

(or whatever locale is appropriate for you)

I don't have the German locales installed here so I can't test it, but
it's supposed to work :) That's why the second line doesn't show
anything for $loc with my test.

Ted
 
B

Ben Morrow

Posting 8bit data on Usenet is not a good idea. There is no way of
indicating its encoding. In what appears below, I have replaced the
literal byte "\xc4" with "<c4>", and re-wrapped the result.

^^
This is a sed-ism. In Perl backreferences (outside of the pattern
itself) are spelt $1.

Also, I would consider it much clearer to write this as

s/(\p{IsUpper}+)/ucfirst lc $1/ge;
The utf8 pragma won't make a difference. <e4> is ASCII code 196.

There is No Such Thing as 'ASCII code 196'. ASCII only goes up to 127.

As the post arrived here, the section of code represented above by
'<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
three bytes are actually in your file you have a problem. I suspect your
file is actually encoded in ISO8859-1; you can tell Perl this by putting

use encoding 'iso8859-1';

before any 8bit bytes occur. You may also want to tell Perl what
encoding you expect the output in; for this you need to use the
:encoding() PerlIO layer.

The behaviour of perl's builtins on strings containing bytes \x80-\xff
but which don't have the internal utf8 flag set can be somewhat weird.
This is the result of perl trying to reconcile the (basically
irreconcilable (sp?)) conditions of behaving properly Unicode-y if you
use Unicode and behaving the same as 5.6 used to if you don't. If you
always stick to properly en/decoding your data (with the encoding
pragma, the :encoding() layer and Encode::{en,de}code) you should be OK.

You probably also want to avoid using non-ascii chars from the shell.
What your terminal/shell do with the data is distinctly unpredictable.

Ben
 
D

Dirk Heinrichs

Ben said:
As the post arrived here, the section of code represented above by
'<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
three bytes are actually in your file you have a problem. I suspect your
file is actually encoded in ISO8859-1; you can tell Perl this by putting

This was just the sample code I typed into the shell to test the regex. The
actual input file I want to process is indeed utf-8.

What I've seen was that umlauts and the following character were not
converted to lower case. So it seems umlauts were considered word
boundaries.

However, I finally solved it by adding

use open ':utf8';
binmode(STDOUT, ":utf8");

to my program.

Thanks to anybody for your effords.

Bye...

Dirk
--
Dirk Heinrichs | Tel: +49 (0)162 234 3408
Configuration Manager | Fax: +49 (0)211 47068 111
Capgemini Deutschland | Mail: (e-mail address removed)
Hambornerstraße 55 | Web: http://www.capgemini.com
D-40472 Düsseldorf | ICQ#: 110037733
GPG Public Key C2E467BB | Keyserver: www.keyserver.net
 
D

Dirk Heinrichs

Dirk said:
Thanks to anybody for your effords.

s/any/every/

Bye...

Dirk
--
Dirk Heinrichs | Tel: +49 (0)162 234 3408
Configuration Manager | Fax: +49 (0)211 47068 111
Capgemini Deutschland | Mail: (e-mail address removed)
Hambornerstraße 55 | Web: http://www.capgemini.com
D-40472 Düsseldorf | ICQ#: 110037733
GPG Public Key C2E467BB | Keyserver: www.keyserver.net
 
T

Ted Zlatanov

Posting 8bit data on Usenet is not a good idea. There is no way of
indicating its encoding. In what appears below, I have replaced the
literal byte "\xc4" with "<c4>", and re-wrapped the result.


^^
This is a sed-ism. In Perl backreferences (outside of the pattern
itself) are spelt $1.

Also, I would consider it much clearer to write this as

s/(\p{IsUpper}+)/ucfirst lc $1/ge;


There is No Such Thing as 'ASCII code 196'. ASCII only goes up to 127.

As the post arrived here, the section of code represented above by
'<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
three bytes are actually in your file you have a problem.

The OP had a word made of three A-umlaut characters, to indicate that
the second and third were not lowercased automatically. The ord() of
those is 196, which is 0xC4 in hex. The OP wants the second and third
to become 0xE4 which is a-umlaut. Did I misunderstand something?
Where is it implied that utf8 encoding matters? I really think this
is a locale issue.

Ted
 
B

Ben Morrow

Quoth Ted Zlatanov said:
The OP had a word made of three A-umlaut characters, to indicate that
the second and third were not lowercased automatically.

The OP had three bytes 0xc4. Whether or not this is three A-umlaut
characters depends on what encoding you are assuming the source is
written in. In UTF-8, these three bytes are invalid. In ASCII, these
three bytes are invalid. In ISO8859-1 they are three A-umlaut
characters. In ISO8859-7 (to pick a random example) it is three capital
deltas.
The ord() of those is 196, which is 0xC4 in hex. The OP wants the
second and third to become 0xE4 which is a-umlaut. Did I
misunderstand something?

The ord of A-umlaut is 0xc4, yes. This is not relevant here: which bytes
are used to represent a character depend on which encoding is in use.

This is not just irrelevant nit-picking: it really matters. See
http://www.joelonsoftware.com/articles/Unicode.html .
Where is it implied that utf8 encoding matters?

The OP stated that he tried adding 'use utf8;'. This is a statement to
Perl that his source is in UTF8, which in this case is not true. What he
should have done was added the statement 'use encoding "iso8859-1";',
which is true. Lieing to Perl is almost never a good idea :).
I really think this is a locale issue.

It's not. It's to do with perl's rather nasty[0] bytewards-compatibility
mode.

Ben

[0] In case anyone gets the wrong idea, this is not a criticism. The
problem required to be solved (work both with people who want proper
Unicode handling and people who want to carry on assuming all charsets
are single-byte supersets of ASCII, without anyone noticing anything
weird's going on) is ultimately insoluble, and perl generally does a
good job. When it doesn't it can always be persuaded to by the addition
of appropriate calls to Encode.
 
T

Ted Zlatanov

The OP had three bytes 0xc4. Whether or not this is three A-umlaut
characters depends on what encoding you are assuming the source is
written in. In UTF-8, these three bytes are invalid. In ASCII, these
three bytes are invalid. In ISO8859-1 they are three A-umlaut
characters. In ISO8859-7 (to pick a random example) it is three capital
deltas.

I checked the original article. It is encoded in utf-8. I don't know
where you got the <c4> from in your followup, but the text between
"AAA" and "BBB" correctly decodes to three A-umlauts in my newsreader
and to a UTF-8 capable terminal. I think your newsreader transformed
to 8859-1 encoding somehow. What I saw is three ocurrences of the
2-byte sequence 0xC384 that you can actually find at
http://home.tiscali.nl/t876506/utf8tbl.html as the first entry for
A-umlaut (Adieresis is the PostScript name for it, I guess). So
you're right in general terms, 0xC4 can mean many things, but here the
OP provided the correct text in the correct encoding.
The ord of A-umlaut is 0xc4, yes. This is not relevant here: which bytes
are used to represent a character depend on which encoding is in use.

This is not just irrelevant nit-picking: it really matters. See
http://www.joelonsoftware.com/articles/Unicode.html .

Thanks for the pointer. I'm pretty conversant with Unicode and
character encodings. I think you were looking at something strange in
your newsreader, hence the confusion. I compounded it by assuming you
actually saw three of the 0xC4 bytes in the original message. Sorry.
[0] In case anyone gets the wrong idea, this is not a criticism. The
problem required to be solved (work both with people who want proper
Unicode handling and people who want to carry on assuming all charsets
are single-byte supersets of ASCII, without anyone noticing anything
weird's going on) is ultimately insoluble, and perl generally does a
good job. When it doesn't it can always be persuaded to by the addition
of appropriate calls to Encode.

Good advice. I advocate UTF-8 wherever possible, since it's compact,
unambigous, and can cover the whole UCS.

Ted
 
B

Ben Morrow

Quoth Ted Zlatanov said:
I checked the original article. It is encoded in utf-8. I don't know
where you got the <c4> from in your followup, but the text between
"AAA" and "BBB" correctly decodes to three A-umlauts in my newsreader
and to a UTF-8 capable terminal.

Yes, I went back and did the same, and, as they arrived here,

The original article appears to be in UTF8, with the string in
question represented by six bytes.

Your first reply (that I was replying to) recoded it as ISO8859-1,
with the string in question in three bytes.

This just re-emphasises what I said in my first reply: Usenet is an
ASCII medium. All posts are assumed to be in ASCII, and there is no way
to specify otherwise. So don't try to post in other character sets.

Ben
 
T

Ted Zlatanov

Yes, I went back and did the same, and, as they arrived here,

The original article appears to be in UTF8, with the string in
question represented by six bytes.

Your first reply (that I was replying to) recoded it as ISO8859-1,
with the string in question in three bytes.

I think this was a decision made by Gnus automatically,
unfortunately. I thought it was preserving the original encoding. As
I said, sorry for the confusion.

Ted
 

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top