utf8 pragma - strange behavior


R

ryang

I am trying to understand how to work with Unicode in Perl. I have
read the relevant man pages (perluniintro, perlunicode, etc.) and have
written severl scripts to test/verifiy my understanding. However, I
created a script that has unexpected output. The script is below and
it contains some UTF-8 encoded characters which represent all five
Spanish accented vowels plus the enye (n with a tilde over it) in upper
and lower case. I hope that this post comes through as UTF-8 encoded
as the source code is. I am posting from Google groups which does use
UTF-8 encoding.

BEGIN CODE >>
#!/usr/bin/perl

use warnings;
use strict;
#use utf8;
use Encode;

# using utf8 causes the characters to be printed in latin-1 encoding

my %table = (
# spanish
# hexidecimal UTF-8 => actual UTF-8
'0xc381' => chr(hex('c3')) . chr(hex('81')), # 'Á',
'0xc389' => encode("utf8", "\x{00c9}"), # 'É',
'0xc38d' => 'Í',
'0xc393' => 'Ó',
'0xc391' => 'Ñ',
'0xc39a' => 'Ú',
'0xc3a1' => 'á',
'0xc3a9' => 'é',
'0xc3ad' => 'í',
'0xc3b3' => 'ó',
'0xc3b1' => 'ñ',
'0xc3ba' => 'ú',
);

foreach (sort keys %table) {
print "$_ = $table{$_}\n";
}
<< END CODE

When the 'use utf8' line is commented out, the script outputs the UTF-8
characters correctly. However, when the utf8 pragma is used, the
characters that are actually hard coded into the hash as UTF-8 (not the
Á or É) are printed in Latin-1. To my understanding, in Perl 5.8.x,
the only effect of the utf8 pragma is to tell the parser that literals
and variables may contain UTF-8 encoded characters. However in
practice, the utf8 pragma is effecting the script's output.

I have tested the script on Mac OSX 10.3.8 with Perl 5.8.1 and on
Fedora Core (not sure which version) running perl 5.8.3.

Can anyone explain why the utf8 pragma effects the output of the script?
 
Ad

Advertisements

W

Wes Groleau

ryang said:
I am trying to understand how to work with Unicode in Perl. I have
read the relevant man pages (perluniintro, perlunicode, etc.) and have
written severl scripts to test/verifiy my understanding. However, I
created a script that has unexpected output. The script is below and

Welcome to the club. :)
Can anyone explain why the utf8 pragma effects the output of the script?

My problem (different post) is slightly different, but
I'm going to try commenting out the pragma to see what happens.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top