HTML in utf8 and perl


P

Pawel Niewiadomski

I have been looking all over for an answer to this and haven't found a
satisfactory one. Please tell me what's going on. I want to write a perl
script generating an html page encoded in utf8. I was wondering why the
following code

#!/usr/bin/perl
binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";
printf "\x{d184}\n";

produces two characters encoded differently, although theoretically it
should generate two russian ef's identically encoded. The first character
is normaly visible in a browser (provided I set utf8 encoding) and the
second is not. Other than that, the second character is coded by three,
not two bytes, as I would expect. Changing :utf8 to :raw in the second
line only produces additional "Wide character in print at..." warnings
but doesn't change the general output. Writing printf "\xd1\x84\n" would
be a solution, but I am wondering what the problem here is with "\x
{d184}". If what I am asking has an obvious answer, please be so kind and
refer me to a sensible source of information.
Thanks very much in advance,
Pawel
 
Ad

Advertisements

A

Alan J. Flavell

#!/usr/bin/perl

Where's your strict and warnings ? Please read the group guidelines
and help yourself before asking others to help you. (Even if it isn't
the actual issue here).
binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";

Why on Earth use printf() instead of print() here? Again not the
specific issue here - but one day that's going to bite.
http://www.perldoc.com/perl5.8.0/pod/func/printf.html
printf "\x{d184}\n";

The character U+D184 is in the Hangul Syllables area, not Cyrillic
http://www.unicode.org/charts/PDF/UAC00.pdf

Your CYRILLIC SMALL LETTER EF character is \x{0444}
produces two characters encoded differently, although theoretically it
should generate two russian ef's identically encoded.

No, theoretically the second one should generate the Unicode character
which you specified. You're confusing Unicode values with their utf-8
encodings.
Other than that, the second character is coded by three,
not two bytes,

Indeed it is, although you shouldn't normally be concerned with that
if you are using Perl 5.8 Unicode characters as intended. It's not
the character that you wanted.
refer me to a sensible source of information.

perldoc perluniintro, perldoc perlunicode, and
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

good luck
 
A

Andy Hassall

I have been looking all over for an answer to this and haven't found a
satisfactory one. Please tell me what's going on. I want to write a perl
script generating an html page encoded in utf8. I was wondering why the
following code

#!/usr/bin/perl
binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";
printf "\x{d184}\n";

produces two characters encoded differently, although theoretically it
should generate two russian ef's identically encoded. The first character
is normaly visible in a browser (provided I set utf8 encoding) and the
second is not. Other than that, the second character is coded by three,
not two bytes, as I would expect. Changing :utf8 to :raw in the second
line only produces additional "Wide character in print at..." warnings
but doesn't change the general output. Writing printf "\xd1\x84\n" would
be a solution, but I am wondering what the problem here is with "\x
{d184}". If what I am asking has an obvious answer, please be so kind and
refer me to a sensible source of information.

\x{} produces a 'wide hex char' (see perlop). The main point here, I think, is
that it is a char, and not just dumping a series of bytes out.

CYRILLIC SMALL LETTER EF is U+0444, which in UTF-8 encoding is represented in
two bytes by 0xd1 0x84.
printf "\N{CYRILLIC SMALL LETTER EF}\n";

You'd expect that to output 0xd1 0x84, no surprises here.
printf "\x{d184}\n";

This outputs 0xed 0x86 0x84.
This is the UTF-8 representation of U+D184, HANGUL SYLLABLE TYE.

Do you really mean:

printf "\x{444}\n";

i.e. print U+0444 CYRILLIC SMALL LETTER EF, which gets encoded by the :utf8
specification as two bytes 0xd1 0x84?

#!/usr/bin/perl
use strict;
use warnings;

binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";
printf "\x{444}\n";
printf "\x{d184}\n";

__END__

[email protected]:~$ test.pl | hexdump -C
00000000 d1 84 0a d1 84 0a ed 86 84 0a |..........|
 
P

Pawel Niewiadomski

On Sat, 28 Feb 2004, Pawel Niewiadomski wrote:

Where's your strict and warnings ? Please read the group guidelines
and help yourself before asking others to help you. (Even if it isn't
the actual issue here). Indeed. Thanks for your advice
Why on Earth use printf() instead of print() here? Again not the
specific issue here - but one day that's going to bite.
http://www.perldoc.com/perl5.8.0/pod/func/printf.html
Just the old C habits. You're absolutely right
The character U+D184 is in the Hangul Syllables area, not Cyrillic
http://www.unicode.org/charts/PDF/UAC00.pdf

Your CYRILLIC SMALL LETTER EF character is \x{0444}


No, theoretically the second one should generate the Unicode character
which you specified. You're confusing Unicode values with their utf-8
encodings.
That was the answer I was looking for. I didn't really quite understand
the difference between the encoding of the character in utf8 and its
value in Unicode. I swear I have searched at least 3 faq's and looked
through the newsgroups archives. I must have been looking in the wrong
places :)
perldoc perluniintro, perldoc perlunicode, and
Been there, seen that. Again it didn't give me much as I was mixing up
the idea of Unicode value and utf8 encoding. I was using a UTF-8
translation table instead of the original Unicode tables.

Thanks a lot,
Pawel
 
Ad

Advertisements

A

Alan J. Flavell

That was the answer I was looking for. I didn't really quite understand
the difference between the encoding of the character in utf8 and its
value in Unicode.

Glad it helped.

Of course, now that you know the answer, it should easy to find it in
the documentation. :-}

The Unicode "code points" (the term used in the perluniintro) are
encoded in different ways (different bit-patterns) in utf-8, utf-16 or
indeed other applicable Unicode encodings, but they still represent
the same "code point". It just so happens that Perl chose internally
to represent characters by using utf-8 representation, but the ord()
values of the Unicode characters are still their code point values,
and, as you've now seen, the wide character constant is represented by
\x{...} using its code point value, the same as is tabulated in the
character code charts at the Unicode site,
http://www.unicode.org/charts/

all the best
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top