regex and utf8 characters (german umlauts)

Discussion in 'Perl Misc' started by Dirk Heinrichs, Aug 10, 2006.

  1. Hi,

    the following little perl snippet

    perl -e '($string = "AAA ÄÄÄ BBB CCC DDD") =~ s/(\p{IsUpper}+)/\L\u\1\E/g;
    print $string . "\n"'

    gives this result:

    Aaa ÄÄÄ Bbb Ccc Ddd

    How do I turn those umlauts into "Äää" also? I tried adding "use utf8;", but
    that didn't help.

    Thanks...

    Dirk
    --
    Dirk Heinrichs | Tel: +49 (0)162 234 3408
    Configuration Manager | Fax: +49 (0)211 47068 111
    Capgemini Deutschland | Mail:
    Hambornerstraße 55 | Web: http://www.capgemini.com
    D-40472 Düsseldorf | ICQ#: 110037733
    GPG Public Key C2E467BB | Keyserver: www.keyserver.net
     
    Dirk Heinrichs, Aug 10, 2006
    #1
    1. Advertising

  2. Dirk Heinrichs

    Dave Guest

    "Dirk Heinrichs" <> wrote in message
    news:mTECg.34879$...
    > Hi,
    >
    > the following little perl snippet
    >
    > perl -e '($string = "AAA ÄÄÄ BBB CCC DDD") =~ s/(\p{IsUpper}+)/\L\u\1\E/g;
    > print $string . "\n"'
    >
    > gives this result:
    >
    > Aaa ÄÄÄ Bbb Ccc Ddd
    >
    > How do I turn those umlauts into "Äää" also? I tried adding "use utf8;",
    > but
    > that didn't help.
    >
    > Thanks...
    >
    > Dirk
    > --
    > Dirk Heinrichs | Tel: +49 (0)162 234 3408
    > Configuration Manager | Fax: +49 (0)211 47068 111
    > Capgemini Deutschland | Mail:
    > Hambornerstraße 55 | Web: http://www.capgemini.com
    > D-40472 Düsseldorf | ICQ#: 110037733
    > GPG Public Key C2E467BB | Keyserver: www.keyserver.net


    Are you running this in a unicode shell with unicode input? Also what
    version of perl?
     
    Dave, Aug 10, 2006
    #2
    1. Advertising

  3. Dave wrote:

    > Are you running this in a unicode shell with unicode input? Also what
    > version of perl?


    Yes, in KDE's Konsole configured for UTF-8 and LANG set to de_DE.utf8. Perl
    version is 5.8.8, OS is (Gentoo) Linux.

    Bye...

    Dirk
    --
    Dirk Heinrichs | Tel: +49 (0)162 234 3408
    Configuration Manager | Fax: +49 (0)211 47068 111
    Capgemini Deutschland | Mail:
    Hambornerstraße 55 | Web: http://www.capgemini.com
    D-40472 Düsseldorf | ICQ#: 110037733
    GPG Public Key C2E467BB | Keyserver: www.keyserver.net
     
    Dirk Heinrichs, Aug 10, 2006
    #3
  4. Dirk Heinrichs

    Ted Zlatanov Guest

    On 10 Aug 2006, wrote:

    > the following little perl snippet
    >
    > perl -e '($string = "AAA ÄÄÄ BBB CCC DDD") =~ s/(\p{IsUpper}+)/\L\u\1\E/g;
    > print $string . "\n"'
    >
    > gives this result:
    >
    > Aaa ÄÄÄ Bbb Ccc Ddd
    >
    > How do I turn those umlauts into "Äää" also? I tried adding "use utf8;", but
    > that didn't help.


    The utf8 pragma won't make a difference. Ä is ASCII code 196.

    Try this:

    perl -MPOSIX -e '$loc = setlocale( LC_ALL, "" ); print "$loc => ", lc(chr(196))'
    en_US => Ä
    perl -MPOSIX -e '$loc = setlocale( LC_ALL, "de_AT" ); print "$loc => ", lc(chr(196))'
    => Ä

    (or whatever locale is appropriate for you)

    I don't have the German locales installed here so I can't test it, but
    it's supposed to work :) That's why the second line doesn't show
    anything for $loc with my test.

    Ted
     
    Ted Zlatanov, Aug 10, 2006
    #4
  5. Dirk Heinrichs

    Ben Morrow Guest

    Posting 8bit data on Usenet is not a good idea. There is no way of
    indicating its encoding. In what appears below, I have replaced the
    literal byte "\xc4" with "<c4>", and re-wrapped the result.

    Quoth Ted Zlatanov <>:
    > On 10 Aug 2006, wrote:
    >
    > > the following little perl snippet
    > >
    > > perl -e '($string = "AAA <c4><c4><c4> BBB CCC DDD") =~
    > > s/(\p{IsUpper}+)/\L\u\1\E/g; print $string . "\n"'

    ^^
    This is a sed-ism. In Perl backreferences (outside of the pattern
    itself) are spelt $1.

    Also, I would consider it much clearer to write this as

    s/(\p{IsUpper}+)/ucfirst lc $1/ge;

    > > gives this result:
    > >
    > > Aaa <c4><c4><c4> Bbb Ccc Ddd
    > >
    > > How do I turn those umlauts into "<c4><e4><e4>" also? I tried adding
    > > "use utf8;", but that didn't help.

    >
    > The utf8 pragma won't make a difference. <e4> is ASCII code 196.


    There is No Such Thing as 'ASCII code 196'. ASCII only goes up to 127.

    As the post arrived here, the section of code represented above by
    '<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
    three bytes are actually in your file you have a problem. I suspect your
    file is actually encoded in ISO8859-1; you can tell Perl this by putting

    use encoding 'iso8859-1';

    before any 8bit bytes occur. You may also want to tell Perl what
    encoding you expect the output in; for this you need to use the
    :encoding() PerlIO layer.

    The behaviour of perl's builtins on strings containing bytes \x80-\xff
    but which don't have the internal utf8 flag set can be somewhat weird.
    This is the result of perl trying to reconcile the (basically
    irreconcilable (sp?)) conditions of behaving properly Unicode-y if you
    use Unicode and behaving the same as 5.6 used to if you don't. If you
    always stick to properly en/decoding your data (with the encoding
    pragma, the :encoding() layer and Encode::{en,de}code) you should be OK.

    You probably also want to avoid using non-ascii chars from the shell.
    What your terminal/shell do with the data is distinctly unpredictable.

    Ben

    --
    Razors pain you / Rivers are damp
    Acids stain you / And drugs cause cramp. [Dorothy Parker]
    Guns aren't lawful / Nooses give
    Gas smells awful / You might as well live.
     
    Ben Morrow, Aug 10, 2006
    #5
  6. Ben Morrow wrote:

    > As the post arrived here, the section of code represented above by
    > '<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
    > three bytes are actually in your file you have a problem. I suspect your
    > file is actually encoded in ISO8859-1; you can tell Perl this by putting


    This was just the sample code I typed into the shell to test the regex. The
    actual input file I want to process is indeed utf-8.

    What I've seen was that umlauts and the following character were not
    converted to lower case. So it seems umlauts were considered word
    boundaries.

    However, I finally solved it by adding

    use open ':utf8';
    binmode(STDOUT, ":utf8");

    to my program.

    Thanks to anybody for your effords.

    Bye...

    Dirk
    --
    Dirk Heinrichs | Tel: +49 (0)162 234 3408
    Configuration Manager | Fax: +49 (0)211 47068 111
    Capgemini Deutschland | Mail:
    Hambornerstraße 55 | Web: http://www.capgemini.com
    D-40472 Düsseldorf | ICQ#: 110037733
    GPG Public Key C2E467BB | Keyserver: www.keyserver.net
     
    Dirk Heinrichs, Aug 11, 2006
    #6
  7. Dirk Heinrichs wrote:

    > Thanks to anybody for your effords.


    s/any/every/

    Bye...

    Dirk
    --
    Dirk Heinrichs | Tel: +49 (0)162 234 3408
    Configuration Manager | Fax: +49 (0)211 47068 111
    Capgemini Deutschland | Mail:
    Hambornerstraße 55 | Web: http://www.capgemini.com
    D-40472 Düsseldorf | ICQ#: 110037733
    GPG Public Key C2E467BB | Keyserver: www.keyserver.net
     
    Dirk Heinrichs, Aug 11, 2006
    #7
  8. Dirk Heinrichs

    Ted Zlatanov Guest

    On 10 Aug 2006, wrote:


    > Posting 8bit data on Usenet is not a good idea. There is no way of
    > indicating its encoding. In what appears below, I have replaced the
    > literal byte "\xc4" with "<c4>", and re-wrapped the result.
    >
    > Quoth Ted Zlatanov <>:
    >> On 10 Aug 2006, wrote:
    >>
    >>> the following little perl snippet
    >>>
    >>> perl -e '($string = "AAA <c4><c4><c4> BBB CCC DDD") =~
    >>> s/(\p{IsUpper}+)/\L\u\1\E/g; print $string . "\n"'

    > ^^
    > This is a sed-ism. In Perl backreferences (outside of the pattern
    > itself) are spelt $1.
    >
    > Also, I would consider it much clearer to write this as
    >
    > s/(\p{IsUpper}+)/ucfirst lc $1/ge;
    >
    >>> gives this result:
    >>>
    >>> Aaa <c4><c4><c4> Bbb Ccc Ddd
    >>>
    >>> How do I turn those umlauts into "<c4><e4><e4>" also? I tried adding
    >>> "use utf8;", but that didn't help.

    >>
    >> The utf8 pragma won't make a difference. <e4> is ASCII code 196.

    >
    > There is No Such Thing as 'ASCII code 196'. ASCII only goes up to 127.
    >
    > As the post arrived here, the section of code represented above by
    > '<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
    > three bytes are actually in your file you have a problem.


    The OP had a word made of three A-umlaut characters, to indicate that
    the second and third were not lowercased automatically. The ord() of
    those is 196, which is 0xC4 in hex. The OP wants the second and third
    to become 0xE4 which is a-umlaut. Did I misunderstand something?
    Where is it implied that utf8 encoding matters? I really think this
    is a locale issue.

    Ted
     
    Ted Zlatanov, Aug 11, 2006
    #8
  9. Dirk Heinrichs

    Ben Morrow Guest

    Quoth Ted Zlatanov <>:
    > On 10 Aug 2006, wrote:
    > > Quoth Ted Zlatanov <>:
    > >> On 10 Aug 2006, wrote:
    > >>
    > >>> the following little perl snippet
    > >>>
    > >>> perl -e '($string = "AAA <c4><c4><c4> BBB CCC DDD") =~
    > >>> s/(\p{IsUpper}+)/\L\u\1\E/g; print $string . "\n"'

    <snip my comment>
    > >>> gives this result:
    > >>>
    > >>> Aaa <c4><c4><c4> Bbb Ccc Ddd
    > >>>
    > >>> How do I turn those umlauts into "<c4><e4><e4>" also? I tried adding
    > >>> "use utf8;", but that didn't help.
    > >>
    > >> The utf8 pragma won't make a difference. <e4> is ASCII code 196.

    > >
    > > There is No Such Thing as 'ASCII code 196'. ASCII only goes up to 127.
    > >
    > > As the post arrived here, the section of code represented above by
    > > '<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
    > > three bytes are actually in your file you have a problem.

    >
    > The OP had a word made of three A-umlaut characters, to indicate that
    > the second and third were not lowercased automatically.


    The OP had three bytes 0xc4. Whether or not this is three A-umlaut
    characters depends on what encoding you are assuming the source is
    written in. In UTF-8, these three bytes are invalid. In ASCII, these
    three bytes are invalid. In ISO8859-1 they are three A-umlaut
    characters. In ISO8859-7 (to pick a random example) it is three capital
    deltas.

    > The ord() of those is 196, which is 0xC4 in hex. The OP wants the
    > second and third to become 0xE4 which is a-umlaut. Did I
    > misunderstand something?


    The ord of A-umlaut is 0xc4, yes. This is not relevant here: which bytes
    are used to represent a character depend on which encoding is in use.

    This is not just irrelevant nit-picking: it really matters. See
    http://www.joelonsoftware.com/articles/Unicode.html .

    > Where is it implied that utf8 encoding matters?


    The OP stated that he tried adding 'use utf8;'. This is a statement to
    Perl that his source is in UTF8, which in this case is not true. What he
    should have done was added the statement 'use encoding "iso8859-1";',
    which is true. Lieing to Perl is almost never a good idea :).

    > I really think this is a locale issue.


    It's not. It's to do with perl's rather nasty[0] bytewards-compatibility
    mode.

    Ben

    [0] In case anyone gets the wrong idea, this is not a criticism. The
    problem required to be solved (work both with people who want proper
    Unicode handling and people who want to carry on assuming all charsets
    are single-byte supersets of ASCII, without anyone noticing anything
    weird's going on) is ultimately insoluble, and perl generally does a
    good job. When it doesn't it can always be persuaded to by the addition
    of appropriate calls to Encode.

    --
    Heracles: Vulture! Here's a titbit for you / A few dried molecules of the gall
    From the liver of a friend of yours. / Excuse the arrow but I have no spoon.
    (Ted Hughes, [ Heracles shoots Vulture with arrow. Vulture bursts into ]
    'Alcestis') [ flame, and falls out of sight. ]
     
    Ben Morrow, Aug 11, 2006
    #9
  10. Dirk Heinrichs

    Ted Zlatanov Guest

    On 11 Aug 2006, wrote:

    > Quoth Ted Zlatanov <>:
    >> On 10 Aug 2006, wrote:
    >>> Quoth Ted Zlatanov <>:
    >>>> On 10 Aug 2006, wrote:
    >>>>
    >>>>> the following little perl snippet
    >>>>>
    >>>>> perl -e '($string = "AAA <c4><c4><c4> BBB CCC DDD") =~
    >>>>> s/(\p{IsUpper}+)/\L\u\1\E/g; print $string . "\n"'

    > <snip my comment>
    >>>>> gives this result:
    >>>>>
    >>>>> Aaa <c4><c4><c4> Bbb Ccc Ddd
    >>>>>
    >>>>> How do I turn those umlauts into "<c4><e4><e4>" also? I tried adding
    >>>>> "use utf8;", but that didn't help.
    >>>>
    >>>> The utf8 pragma won't make a difference. <e4> is ASCII code 196.
    >>>
    >>> There is No Such Thing as 'ASCII code 196'. ASCII only goes up to 127.
    >>>
    >>> As the post arrived here, the section of code represented above by
    >>> '<c4><c4><c4>' is 3 bytes long. This is not valid UTF8, so if these
    >>> three bytes are actually in your file you have a problem.

    >>
    >> The OP had a word made of three A-umlaut characters, to indicate that
    >> the second and third were not lowercased automatically.

    >
    > The OP had three bytes 0xc4. Whether or not this is three A-umlaut
    > characters depends on what encoding you are assuming the source is
    > written in. In UTF-8, these three bytes are invalid. In ASCII, these
    > three bytes are invalid. In ISO8859-1 they are three A-umlaut
    > characters. In ISO8859-7 (to pick a random example) it is three capital
    > deltas.


    I checked the original article. It is encoded in utf-8. I don't know
    where you got the <c4> from in your followup, but the text between
    "AAA" and "BBB" correctly decodes to three A-umlauts in my newsreader
    and to a UTF-8 capable terminal. I think your newsreader transformed
    to 8859-1 encoding somehow. What I saw is three ocurrences of the
    2-byte sequence 0xC384 that you can actually find at
    http://home.tiscali.nl/t876506/utf8tbl.html as the first entry for
    A-umlaut (Adieresis is the PostScript name for it, I guess). So
    you're right in general terms, 0xC4 can mean many things, but here the
    OP provided the correct text in the correct encoding.

    > The ord of A-umlaut is 0xc4, yes. This is not relevant here: which bytes
    > are used to represent a character depend on which encoding is in use.
    >
    > This is not just irrelevant nit-picking: it really matters. See
    > http://www.joelonsoftware.com/articles/Unicode.html .


    Thanks for the pointer. I'm pretty conversant with Unicode and
    character encodings. I think you were looking at something strange in
    your newsreader, hence the confusion. I compounded it by assuming you
    actually saw three of the 0xC4 bytes in the original message. Sorry.

    > [0] In case anyone gets the wrong idea, this is not a criticism. The
    > problem required to be solved (work both with people who want proper
    > Unicode handling and people who want to carry on assuming all charsets
    > are single-byte supersets of ASCII, without anyone noticing anything
    > weird's going on) is ultimately insoluble, and perl generally does a
    > good job. When it doesn't it can always be persuaded to by the addition
    > of appropriate calls to Encode.


    Good advice. I advocate UTF-8 wherever possible, since it's compact,
    unambigous, and can cover the whole UCS.

    Ted
     
    Ted Zlatanov, Aug 14, 2006
    #10
  11. Dirk Heinrichs

    Ben Morrow Guest

    Quoth Ted Zlatanov <>:
    > On 11 Aug 2006, wrote:
    >
    > > The OP had three bytes 0xc4. Whether or not this is three A-umlaut
    > > characters depends on what encoding you are assuming the source is
    > > written in. In UTF-8, these three bytes are invalid. In ASCII, these
    > > three bytes are invalid. In ISO8859-1 they are three A-umlaut
    > > characters. In ISO8859-7 (to pick a random example) it is three capital
    > > deltas.

    >
    > I checked the original article. It is encoded in utf-8. I don't know
    > where you got the <c4> from in your followup, but the text between
    > "AAA" and "BBB" correctly decodes to three A-umlauts in my newsreader
    > and to a UTF-8 capable terminal.


    Yes, I went back and did the same, and, as they arrived here,

    The original article appears to be in UTF8, with the string in
    question represented by six bytes.

    Your first reply (that I was replying to) recoded it as ISO8859-1,
    with the string in question in three bytes.

    This just re-emphasises what I said in my first reply: Usenet is an
    ASCII medium. All posts are assumed to be in ASCII, and there is no way
    to specify otherwise. So don't try to post in other character sets.

    Ben

    --
    Every twenty-four hours about 34k children die from the effects of poverty.
    Meanwhile, the latest estimate is that 2800 people died on 9/11, so it's like
    that image, that ghastly, grey-billowing, double-barrelled fall, repeated
    twelve times every day. Full of children. [Iain Banks]
     
    Ben Morrow, Aug 14, 2006
    #11
  12. Dirk Heinrichs

    Ted Zlatanov Guest

    On 14 Aug 2006, wrote:

    > Quoth Ted Zlatanov <>:
    >> On 11 Aug 2006, wrote:
    >>
    >>> The OP had three bytes 0xc4. Whether or not this is three A-umlaut
    >>> characters depends on what encoding you are assuming the source is
    >>> written in. In UTF-8, these three bytes are invalid. In ASCII, these
    >>> three bytes are invalid. In ISO8859-1 they are three A-umlaut
    >>> characters. In ISO8859-7 (to pick a random example) it is three capital
    >>> deltas.

    >>
    >> I checked the original article. It is encoded in utf-8. I don't know
    >> where you got the <c4> from in your followup, but the text between
    >> "AAA" and "BBB" correctly decodes to three A-umlauts in my newsreader
    >> and to a UTF-8 capable terminal.

    >
    > Yes, I went back and did the same, and, as they arrived here,
    >
    > The original article appears to be in UTF8, with the string in
    > question represented by six bytes.
    >
    > Your first reply (that I was replying to) recoded it as ISO8859-1,
    > with the string in question in three bytes.


    I think this was a decision made by Gnus automatically,
    unfortunately. I thought it was preserving the original encoding. As
    I said, sorry for the confusion.

    Ted
     
    Ted Zlatanov, Aug 14, 2006
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Asus
    Replies:
    1
    Views:
    173
    Joerg Jooss
    Mar 19, 2005
  2. Axel Friedrich
    Replies:
    3
    Views:
    168
    Axel Friedrich
    Jun 20, 2004
  3. Sholto Douglas
    Replies:
    0
    Views:
    449
    Sholto Douglas
    Jun 6, 2011
  4. Dennis Winter

    Perl and German Umlauts

    Dennis Winter, May 22, 2007, in forum: Perl Misc
    Replies:
    5
    Views:
    155
    Ben Bacarisse
    May 23, 2007
  5. Replies:
    2
    Views:
    395
Loading...

Share This Page