HTML in utf8 and perl

Pawel Niewiadomski · Feb 28, 2004

I have been looking all over for an answer to this and haven't found a
satisfactory one. Please tell me what's going on. I want to write a perl
script generating an html page encoded in utf8. I was wondering why the
following code

#!/usr/bin/perl
binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";
printf "\x{d184}\n";

produces two characters encoded differently, although theoretically it
should generate two russian ef's identically encoded. The first character
is normaly visible in a browser (provided I set utf8 encoding) and the
second is not. Other than that, the second character is coded by three,
not two bytes, as I would expect. Changing :utf8 to :raw in the second
line only produces additional "Wide character in print at..." warnings
but doesn't change the general output. Writing printf "\xd1\x84\n" would
be a solution, but I am wondering what the problem here is with "\x
{d184}". If what I am asking has an obvious answer, please be so kind and
refer me to a sensible source of information.
Thanks very much in advance,
Pawel

Alan J. Flavell · Feb 28, 2004

#!/usr/bin/perl

Where's your strict and warnings ? Please read the group guidelines
and help yourself before asking others to help you. (Even if it isn't
the actual issue here).

binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";

Why on Earth use printf() instead of print() here? Again not the
specific issue here - but one day that's going to bite.
http://www.perldoc.com/perl5.8.0/pod/func/printf.html

printf "\x{d184}\n";

The character U+D184 is in the Hangul Syllables area, not Cyrillic
http://www.unicode.org/charts/PDF/UAC00.pdf

Your CYRILLIC SMALL LETTER EF character is \x{0444}

produces two characters encoded differently, although theoretically it
should generate two russian ef's identically encoded.

No, theoretically the second one should generate the Unicode character
which you specified. You're confusing Unicode values with their utf-8
encodings.

Other than that, the second character is coded by three,
not two bytes,

Indeed it is, although you shouldn't normally be concerned with that
if you are using Perl 5.8 Unicode characters as intended. It's not
the character that you wanted.

refer me to a sensible source of information.

perldoc perluniintro, perldoc perlunicode, and
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

good luck

Andy Hassall · Feb 28, 2004

I have been looking all over for an answer to this and haven't found a
satisfactory one. Please tell me what's going on. I want to write a perl
script generating an html page encoded in utf8. I was wondering why the
following code

#!/usr/bin/perl
binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";
printf "\x{d184}\n";

produces two characters encoded differently, although theoretically it
should generate two russian ef's identically encoded. The first character
is normaly visible in a browser (provided I set utf8 encoding) and the
second is not. Other than that, the second character is coded by three,
not two bytes, as I would expect. Changing :utf8 to :raw in the second
line only produces additional "Wide character in print at..." warnings
but doesn't change the general output. Writing printf "\xd1\x84\n" would
be a solution, but I am wondering what the problem here is with "\x
{d184}". If what I am asking has an obvious answer, please be so kind and
refer me to a sensible source of information.

\x{} produces a 'wide hex char' (see perlop). The main point here, I think, is
that it is a char, and not just dumping a series of bytes out.

CYRILLIC SMALL LETTER EF is U+0444, which in UTF-8 encoding is represented in
two bytes by 0xd1 0x84.

printf "\N{CYRILLIC SMALL LETTER EF}\n";

You'd expect that to output 0xd1 0x84, no surprises here.

printf "\x{d184}\n";

This outputs 0xed 0x86 0x84.
This is the UTF-8 representation of U+D184, HANGUL SYLLABLE TYE.

Do you really mean:

printf "\x{444}\n";

i.e. print U+0444 CYRILLIC SMALL LETTER EF, which gets encoded by the :utf8
specification as two bytes 0xd1 0x84?

#!/usr/bin/perl
use strict;
use warnings;

binmode (STDOUT, ":utf8");
use charnames ':full';
printf "\N{CYRILLIC SMALL LETTER EF}\n";
printf "\x{444}\n";
printf "\x{d184}\n";

__END__

andyh@server:~$ test.pl | hexdump -C
00000000 d1 84 0a d1 84 0a ed 86 84 0a |..........|

Pawel Niewiadomski · Feb 28, 2004

On Sat, 28 Feb 2004, Pawel Niewiadomski wrote:

Where's your strict and warnings ? Please read the group guidelines
and help yourself before asking others to help you. (Even if it isn't
the actual issue here). Indeed. Thanks for your advice
Why on Earth use printf() instead of print() here? Again not the
specific issue here - but one day that's going to bite.
http://www.perldoc.com/perl5.8.0/pod/func/printf.html

Just the old C habits. You're absolutely right

The character U+D184 is in the Hangul Syllables area, not Cyrillic
http://www.unicode.org/charts/PDF/UAC00.pdf

Your CYRILLIC SMALL LETTER EF character is \x{0444}

No, theoretically the second one should generate the Unicode character
which you specified. You're confusing Unicode values with their utf-8
encodings.

That was the answer I was looking for. I didn't really quite understand
the difference between the encoding of the character in utf8 and its
value in Unicode. I swear I have searched at least 3 faq's and looked
through the newsgroups archives. I must have been looking in the wrong
places

perldoc perluniintro, perldoc perlunicode, and

Been there, seen that. Again it didn't give me much as I was mixing up
the idea of Unicode value and utf8 encoding. I was using a UTF-8
translation table instead of the original Unicode tables.

http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

good luck

Thanks a lot,
Pawel

Alan J. Flavell · Feb 28, 2004

That was the answer I was looking for. I didn't really quite understand
the difference between the encoding of the character in utf8 and its
value in Unicode.

Glad it helped.

Of course, now that you know the answer, it should easy to find it in
the documentation. :-}

The Unicode "code points" (the term used in the perluniintro) are
encoded in different ways (different bit-patterns) in utf-8, utf-16 or
indeed other applicable Unicode encodings, but they still represent
the same "code point". It just so happens that Perl chose internally
to represent characters by using utf-8 representation, but the ord()
values of the Unicode characters are still their code point values,
and, as you've now seen, the wide character constant is represented by
\x{...} using its code point value, the same as is tabulated in the
character code charts at the Unicode site,
http://www.unicode.org/charts/

all the best

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
utf8 and chomp	13	Feb 22, 2009
utf8, length and syswrite are killing me	2	Feb 17, 2010
character classes, locale and utf8 - strange behaviour	0	Apr 29, 2011
Displaying utf8 text in perl -d	1	Sep 14, 2007
Confused by utf8/sysread/syswrite/DBD::Pg	1	Dec 29, 2009
UTF8 strings and filesystem access	6	Oct 10, 2007
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 9, 2009

HTML in utf8 and perl

Pawel Niewiadomski

Alan J. Flavell

Andy Hassall

Pawel Niewiadomski

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads