Bëelphazoar said:
I am working on a problem, I have text in a database which
includes the word "más". The "á" is ASCII value 225/E1 .
Dear Joe,
It will help a lot if you give us the output of "perl -v". I'm
sure Unicode has something to do with your problem, but Unicode
support has been changing (updating) in recent versions of Perl.
Without knowing the version of Perl you're using and the platform
you're using it on, we can only guess as to what the problem is.
By the way, are you SURE that "á" is the extended ASCII value 225?
According to one source I have, it is extended ASCII value 160. Maybe
we're using different code pages, but it's worth checking.
ASCII only defines the low 7 bits, whcih are the same
character representations in most english-based code
pages.
In addition to ASCII there is unicode, which is 16-bit,
and which, somewhere in my application, is apparently
being used when the "á" is used because it is greater
than 127.
You're wrong about Unicode being 16-bit. That's a myth. It CAN be
encoded in two bytes (16 bits), but it can also be encoded using a
different method called UTF-8 (which is what Perl normally uses
internally). The UTF-8 encoding uses variable-length character
encoding, which means that a character can be encoded in one to six
bytes. In your case, the character whose value is greater than 127 is
being encoded in two bytes, whereas the other characters (< 128) are
being encoded in one byte.
Understand? If you don't, here's a great link to an FAQ I used to
understand more about how Unicode is encoded:
http://www.cl.cam.ac.uk/~mgk25/unicode.html
You may also want to check the following perldocs (which, depending on
your version of Perl, you may or may not have all of):
perldoc Encode
perldoc perluniintro
perldoc Unicode::String
The code pulls the text out of the database and
assigns it to a variable, but when I print the
variable it is now "más", the "á" has been
replaced by C3A1 .
This certainly looks to me like UTF-8 Unicode encoding, but let's
check just to make sure:
According to the FAQ (whose link I mentioned above), a Unicode
character value can be UTF-8 encoded using one to six bytes:
1: 0xxxxxxx
2: 110xxxxx 10xxxxxx
3: 1110xxxx 10xxxxxx 10xxxxxx
4: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
where "x" is a bit that stands for the Unicode value.
0xC3A1 is two bytes long. Its bit representation is:
11000011 10100001
So when you apply the 2-byte bit pattern to it:
110xxxxx 10xxxxxx
the "x"s stand represent the bits: 00011 100001
Put them together and you get 11100001 which is the binary
representation of 225. Therefore, we now know that character number
225, when encoded into UTF-8 encoding, results in the two bytes 0xc3
and 0xa1, which is exactly what you're seeing.
I am PRETTY sure that this is not happening
within the code I am working on, if I am following
the code flow correctly it looks like it does
nothing but pull the text from the database and
pass it back.
SOMEWHERE in the code the characters greater than 127 are being
converted from extended-ASCII to UTF-8 encoding, but it's hard to say
exactly where unless I have access to the code. Therefore, I'll leave
it up to you to figure out where it's happening.
But even if you do find where this is happening, you will still
have to deal with the problem of converting the two-byte UTF-8
representation (of characters greater than 127) to their one-byte
extended-ASCII equivalent. ¿Comprende?
I'm not sure how to do this, but here are three things you can try.
Whether or not each one works may depend on the version of Perl you
are using, so letting me know your "perl -v" output may help me out.
----------------------------------------
# Method 1: Convince Perl that your string
# is UTF-8 encoded:
use Encode;
$string = pullTextFromDb();
# Convince Perl that $string is in UTF-8 format:
$string = decode_utf8($string);
# Convert UTF-8 string to extended-ASCII:
$string = encode("iso-8859-1", $string);
----------------------------------------
# Method 2: Tell Perl that $string is UTF-8
# encoded and that you want its
# latin1 equivalent:
use Unicode::String qw(utf8 latin1);
$string = pullTextFromDb();
$string = utf8($string)->latin1();
----------------------------------------
# Method 3: Tell Perl to pack each character's
# Unicode value into just one byte
# of a larger string:
$string = pullTextFromDb();
$string = pack "C*", map ord, split //, $string;
----------------------------------------
Try all these and see if any of them work. Again, what works and
what doesn't work might very well depend on the version of Perl that
you're using. Also, even if one of them does work, some other part of
your code might be converting it back to UTF-8 encoding, undo-ing the
conversion you just made.
But it's still worth a shot to try them out. Hopefully one of the
above three methods will work for you, and your problem will be "no
más."
I hope this helps, Joe.
-- Jean-Luc