Perl opting for double-byte chars?

Bëelphazoar · Sep 12, 2004

I am working on a problem, I have text in a database which includes
the word "más". The "á" is ASCII value 225/E1 .

It is definitely this inside the database.

The code pulls the text out of the database and assigns it to a
variable, but when I print the variable it is now "mÃ¡s", the "á" has
been replaced by C3A1 .

I am PRETTY sure that this is not happening within the code I am
working on, if I am following the code flow correctly it looks like it
does nothing but pull the text from the database and pass it back.

Digging around in various Perl docs, I found some references which say
that Perl will decide whether to use double-byte for chars > 127, it
looks like that is what's happening here.

I tried doing this:

use bytes;
$myVar = pullTextFromDb();
no bytes;

but I still got the double-byte translation.

Does anybody have any pointers about how to proceed further debugging
this?

Should the use bytes pragram affect code that is not in the current
module? That is, the pullTextFromDb() function call goes through
several modules object-oriented style, should the pragma still be in
effect for that code, or is it only useful in the current module?

Thanks for any help.

Alan J. Flavell · Sep 12, 2004

I am working on a problem, I have text in a database which includes
the word "más". The "á" is ASCII value 225/E1 .

ASCII is a 7-bit code. There is no such "value" in ASCII.

It is definitely this inside the database.

You need to learn a little more about character coding.,

The code pulls the text out of the database and assigns it to a
variable, but when I print the variable it is now "mÃ¡s"

Your usenet posting claims:

Content-Type: text/plain; charset=ISO-8859-1

You don't seem to understand what that means.

, the "á" has been replaced by C3A1 .

Perhaps you confused the software into believing that you wanted
characters (for which Perl has an internal representation) rather than
bytes.

Digging around in various Perl docs, I found some references which say
that Perl will decide whether to use double-byte for chars > 127, it
looks like that is what's happening here.

Do you have utf8 in your locale?

Bëelphazoar · Sep 12, 2004

ASCII is a 7-bit code. There is no such "value" in ASCII.

You need to learn a little more about character coding.,

Your usenet posting claims:

Content-Type: text/plain; charset=ISO-8859-1

You don't seem to understand what that means.

It means that I am telling any client which tries to read my post that
I am using ISO code page 8859-1

This is a table of character representations corresponding to 8-bit
values, with 256 members.

As you say, ASCII only defines the low 7 bits, whcih are the same
character representations in most english-based code pages.

In addition to ASCII there is unicode, which is 16-bit, and which,
somewhere in my application, is apparently being used when the "á" is
used because it is greater than 127.

Perl, from what I understand from the documentation I could find, will
sometimes decide to use Unicode values where text is in the 128-255
range. This appears to be the heart of the problem, because at one
point the application appears to be trying to represent the "á"
character in Unicode, but then anywhere subsequent the environment is
failing to translate the resulting 2-byte character back to the
appropriate represenation.

I apologize if I was not sufficiently rigorous in my description of
what I know so far. I thought it was reasonably clear what I was
saying.

Perhaps you confused the software into believing that you wanted
characters (for which Perl has an internal representation) rather than
bytes.

Do you have utf8 in your locale?

I don't know. Can you tell me how I would check that? I don't know a
great deal about the Perl environment.

--
Joe Cosby
http://joecosby.com/
"I will be warned of the dangers of time travel!",
remembered Tilly, of the warning she was given in the
future, of the perils of the past, which she presently
thought had been both historic and foresighted, "though
knowing now what I will know then makes it somewhat
anachronistic".
-Dr. Hieronymous Zinn, from The Novel

Tassilo v. Parseval · Sep 12, 2004

Also sprach Bëelphazoar:

In addition to ASCII there is unicode, which is 16-bit,

No, that's not Unicode. Unicode is foremost just a mapping between
numbers and characters. Each character thusly has a unique number. When
you talk about bits, you are really talking about encodings. Unicode
defines three encodings: UTF-(8|16|32). Perl internally uses UTF-8 which
is a variable width encoding meaning a character can have anything
between one and four bytes.

The same is true for UTF-16 which you must have been thinking of. The
most common characters fit into two bytes. However, all the other
characters do exist as well in this encoding. They are encoded in two
byte-couples.

This distinction sounds like hairsplitting, but it's crucial if you ever
want to understand what Unicode is about and how to use it properly.

I don't know. Can you tell me how I would check that? I don't know a
great deal about the Perl environment.

Perl uses the environment of your system, not its own. So check your
environment variables.

Tassilo

Bëelphazoar · Sep 12, 2004

Also sprach Bëelphazoar:

No, that's not Unicode. Unicode is foremost just a mapping between
numbers and characters. Each character thusly has a unique number. When
you talk about bits, you are really talking about encodings. Unicode
defines three encodings: UTF-(8|16|32). Perl internally uses UTF-8 which
is a variable width encoding meaning a character can have anything
between one and four bytes.

The same is true for UTF-16 which you must have been thinking of. The
most common characters fit into two bytes. However, all the other
characters do exist as well in this encoding. They are encoded in two
byte-couples.

This distinction sounds like hairsplitting, but it's crucial if you ever
want to understand what Unicode is about and how to use it properly.

Perl uses the environment of your system, not its own. So check your
environment variables.

Thanks.

At some point, Perl does seem to be making the decision to alter the
data which I am pulling from the database, changing the particular
character from an 8-bit value to a 16-bit value.

The job at hand for me is to make it stop doing this.

As you and the preceding person have pointed out, I don't know
everything there is to know about character encodings. I apologize if
I have caused any confusion in describing character encoding
incorrectly.

I would appreciate any pointers you might have on where would be a
good place to start looking at system variables to find the relevant
environment variables, but it does seem clear enough, assuming I am
understanding the code I am looking at, that Perl is changing a text
value for whatever reason and based on whatever system of character
encoding from an 8-bit value which works to a 16-bit value which
doesn't.

It seems, as far as I can tell, as if that is something I will need to
solve within Perl. Maybe I am mistaken, but I don't see how the
operating system is going to make a decision to force data inside a
Perl application to alter based on it's active character encoding
setup.

If how Perl makes the decision to change the 8-bit value to a 16-bit
value is based on the active system character encoding setup, then any
pointers anybody could provide as to how it makes this decision, or
what exactly I should be looking at, would be most appreciated.

Again, as I say, my job at hand here is to convince Perl not to change
the existing 8-bit value I am pulling from the database into a 16-bit
value which no longer works for what I am doing.

Alan J. Flavell · Sep 12, 2004

On Sun, 12 Sep 2004, it was written:

[snip]

At some point, Perl does seem to be making the decision to alter the
data which I am pulling from the database, changing the particular
character

So write and instrument a small test case, small enough to be posted
here (minus the database itself, OK) with some sample printouts of the
data at the various points in the processing, preferably in
hexadecimal (any attempt to splatter 8-bit characters into a Usenet
posting usually turns into a failure to communicate, in my
experience).

from an 8-bit value to a 16-bit value.

This may seem like hair splitting, but what you exhibited so far
appeared to be a utf-8 character. Which in this case consisted of two
octets (bytes), but that's not the same thing as "a 16-bit value".

The job at hand for me is to make it stop doing this.

Possibly. That depends on what range of characters you hope to be
able to handle in your system. But let's try to understand where
we're at, before discussing where to go from there.

As you and the preceding person have pointed out, I don't know
everything there is to know about character encodings. I apologize if
I have caused any confusion in describing character encoding
incorrectly.

Oh, it's quite normal... Naturally I'd urge you to take time to learn
a bit more about it, believing - as I do - that it'll save you effort
later; but as it's one of my specialist subjects, "I would say that,
wouldn't I?"...

I would appreciate any pointers you might have on where would be a
good place to start looking at system variables to find the relevant
environment variables,

man printenv
man locale

(assuming unix-family OS),

but it does seem clear enough, assuming I am
understanding the code I am looking at, that Perl is changing a text
value

Perl doesn't magically "change text values": it handles text in the
way that it thinks it's been asked to handle it.

My feeling is that, sooner rather than later, you're going to need
this stuff anyway, so I'd start on perldoc perluniintro and then
perldoc perlunicode (or the links near the foot of the index page
http://www.perldoc.com/perl5.8.0/pod.html or whichever version you are
using).

But if you're determined that you just want to get utf8 out of the way
for the moment, and you're sure you'll never be showing Perl a
character outside of the iso-8859-1 range, then look for discussions
on apparent incompatibilities between RedHat 9 and Perl 5.8, which
discuss how RedHat's introduction of utf8 into the locale caused Perl
to switch into its Unicode mode, and how to take it out again (I don't
have the details at my fingertips right now, sorry).

It seems, as far as I can tell, as if that is something I will need to
solve within Perl. Maybe I am mistaken, but I don't see how the
operating system is going to make a decision to force data inside a
Perl application to alter based on it's active character encoding
setup.

Oh, but it does. At least in 5.8.0. Google for "redhat perl 5.8.0
utf8 locale" (without the quotes) and read the first few links, I
think they'll help.

If how Perl makes the decision to change the 8-bit value to a 16-bit
value

Please stop saying "16 bit value"; it's sure to cause confusion
somewhere down the line. What you're talking about here is a
character stored in Perl's native unicode format, which is utf-8: this
particular character happens to occupy two bytes in storage, but it's
not useful to talk about it as a "16-bit value", and it risks
confusing it with utf-16 format (which is the OS's native storage
format on Windows NT-based systems, by the way, and commonly used also
for storing unicode characters in databases).

good luck

Shawn Corey · Sep 12, 2004

Hi,

I got caught on this one too. See perldoc perluniintro and perldoc
perlunicode. Perl v5.8+ has a feature that automatically and silently
converts its standard (pre-v5.8) strings into UTF-8 strings if it
encounters a Unicode character. I haven't figure a reliable way around
this yet but you could try:

$s = pack( 'C*', unpack( 'U*', $s ));

On Sun, 12 Sep 2004, it was written:

[snip]

At some point, Perl does seem to be making the decision to alter the
data which I am pulling from the database, changing the particular
character

Click to expand...

So write and instrument a small test case, small enough to be posted
here (minus the database itself, OK) with some sample printouts of the
data at the various points in the processing, preferably in
hexadecimal (any attempt to splatter 8-bit characters into a Usenet
posting usually turns into a failure to communicate, in my
experience).

from an 8-bit value to a 16-bit value.

Click to expand...

This may seem like hair splitting, but what you exhibited so far
appeared to be a utf-8 character. Which in this case consisted of two
octets (bytes), but that's not the same thing as "a 16-bit value".

The job at hand for me is to make it stop doing this.

Click to expand...

Possibly. That depends on what range of characters you hope to be
able to handle in your system. But let's try to understand where
we're at, before discussing where to go from there.

As you and the preceding person have pointed out, I don't know
everything there is to know about character encodings. I apologize if
I have caused any confusion in describing character encoding
incorrectly.

Click to expand...

Oh, it's quite normal... Naturally I'd urge you to take time to learn
a bit more about it, believing - as I do - that it'll save you effort
later; but as it's one of my specialist subjects, "I would say that,
wouldn't I?"...

I would appreciate any pointers you might have on where would be a
good place to start looking at system variables to find the relevant
environment variables,

Click to expand...

man printenv
man locale

(assuming unix-family OS),

but it does seem clear enough, assuming I am
understanding the code I am looking at, that Perl is changing a text
value

Click to expand...

Perl doesn't magically "change text values": it handles text in the
way that it thinks it's been asked to handle it.

My feeling is that, sooner rather than later, you're going to need
this stuff anyway, so I'd start on perldoc perluniintro and then
perldoc perlunicode (or the links near the foot of the index page
http://www.perldoc.com/perl5.8.0/pod.html or whichever version you are
using).

But if you're determined that you just want to get utf8 out of the way
for the moment, and you're sure you'll never be showing Perl a
character outside of the iso-8859-1 range, then look for discussions
on apparent incompatibilities between RedHat 9 and Perl 5.8, which
discuss how RedHat's introduction of utf8 into the locale caused Perl
to switch into its Unicode mode, and how to take it out again (I don't
have the details at my fingertips right now, sorry).

It seems, as far as I can tell, as if that is something I will need to
solve within Perl. Maybe I am mistaken, but I don't see how the
operating system is going to make a decision to force data inside a
Perl application to alter based on it's active character encoding
setup.

Click to expand...

Oh, but it does. At least in 5.8.0. Google for "redhat perl 5.8.0
utf8 locale" (without the quotes) and read the first few links, I
think they'll help.

If how Perl makes the decision to change the 8-bit value to a 16-bit
value

Click to expand...

Please stop saying "16 bit value"; it's sure to cause confusion
somewhere down the line. What you're talking about here is a
character stored in Perl's native unicode format, which is utf-8: this
particular character happens to occupy two bytes in storage, but it's
not useful to talk about it as a "16-bit value", and it risks
confusing it with utf-16 format (which is the OS's native storage
format on Windows NT-based systems, by the way, and commonly used also
for storing unicode characters in databases).

good luck

Alan J. Flavell · Sep 12, 2004

A: No!

I got caught on this one too.

Are you sure it was the same?

See perldoc perluniintro and perldoc perlunicode.

Yup, good advice, already offered.

Perl v5.8+ has a feature that automatically and silently converts
its standard (pre-v5.8) strings into UTF-8 strings if it encounters
a Unicode character.

If by "a Unicode character" you mean one whose code value is greater
than 255, then you're right; but we've been given no evidence here
that such a character has been involved. The only "interesting"
character under discussion has been one which fell into the range
occupied by printable characters in iso-8859-1, namely 160-255
decimal.

Perl 5.8 would only have "upgraded" that to utf8 if it had been
given cause to do so. In 5.8.0, one such cause is the presence
of utf-8 in the locale. See also the discussion in
http://use.perl.org/articles/03/09/26/2231256.shtml?tid=6 , or
http://twiki.org/cgi-bin/view/Codev/UsingPerl58OnRedHat8 , or
the various other articles that pop up when one tries the search that
I had suggested.

My hunch is that's what happened. Maybe I'll be proved wrong; we'll
see.

I haven't figure a reliable way around this yet

(which suggests you haven't read the relevant perldocs closely enough)

There are various approaches, depending on what your problem field is
and what you're trying to achieve.

If you force the old behaviour, then you can get what you'd have been
accustomed to before, and you won't suffer the overhead of Perl
processing Unicode; but you'll cut yourself off from the ability to
process a fuller range of characters, writing systems etc.

If you learn how to work with Unicode - and your database /also/ knows
how to work with it - then you can write software that can handle
writing systems which are way outside of mere Latin 1; but you may
incur some processing overhead due to the extra work of Perl handling
Unicode characters.

With care, code can be written such that the overhead only cuts in
when charcters outside of the iso-8859-1 repertoire are used. Thus
getting the best of both worlds - without having to write messy
dual-path code, because Perl takes care of it for you (if you're
asking it right).

In general I'd say (except perhaps for diagnostic purposes), if you're
messing around with packing and unpacking characters, then you're
doing it wrong. The key is to grasp Perl's character representation
model, and to work *with* it, not to fight it with hand-packed and
-unpacked representations.

This assumes that your code only needs to run on >= 5.8.0. If you're
writing code meant to be runnable on older Perls, then you have to put
quite a lot more care into the task of producing something compatible.

ttfn

Q: Should I put my Usenet response on the top of a quote of the entire
previous posting?

http://www.faqs.org/docs/jargon/T/top-post.html

Bëelphazoar · Sep 12, 2004

Thanks for your help Alan and Shawn, I think you have given me enough
to work with, I will post back if the leads you've presented don't
resolve the issue I'm getting.

--
Bëelphazoar
International Satanic Conspiracy
Customer Support Specialist
http://joecosby.com/
You mystics are a sorry lot, always whimping about so-and-so's "ego"
getting in the way of their "detachment." Take it to alt.zen.ego-death,
for the love of pete! This is alt.MAGICK.

J. Romano · Sep 12, 2004

Bëelphazoar said:
I am working on a problem, I have text in a database which
includes the word "más". The "á" is ASCII value 225/E1 .

Dear Joe,

It will help a lot if you give us the output of "perl -v". I'm
sure Unicode has something to do with your problem, but Unicode
support has been changing (updating) in recent versions of Perl.
Without knowing the version of Perl you're using and the platform
you're using it on, we can only guess as to what the problem is.

By the way, are you SURE that "á" is the extended ASCII value 225?
According to one source I have, it is extended ASCII value 160. Maybe
we're using different code pages, but it's worth checking.

ASCII only defines the low 7 bits, whcih are the same
character representations in most english-based code
pages.

In addition to ASCII there is unicode, which is 16-bit,
and which, somewhere in my application, is apparently
being used when the "á" is used because it is greater
than 127.

You're wrong about Unicode being 16-bit. That's a myth. It CAN be
encoded in two bytes (16 bits), but it can also be encoded using a
different method called UTF-8 (which is what Perl normally uses
internally). The UTF-8 encoding uses variable-length character
encoding, which means that a character can be encoded in one to six
bytes. In your case, the character whose value is greater than 127 is
being encoded in two bytes, whereas the other characters (< 128) are
being encoded in one byte.

Understand? If you don't, here's a great link to an FAQ I used to
understand more about how Unicode is encoded:

http://www.cl.cam.ac.uk/~mgk25/unicode.html

You may also want to check the following perldocs (which, depending on
your version of Perl, you may or may not have all of):

perldoc Encode
perldoc perluniintro
perldoc Unicode::String

The code pulls the text out of the database and
assigns it to a variable, but when I print the
variable it is now "mÃ¡s", the "á" has been
replaced by C3A1 .

This certainly looks to me like UTF-8 Unicode encoding, but let's
check just to make sure:

According to the FAQ (whose link I mentioned above), a Unicode
character value can be UTF-8 encoded using one to six bytes:

1: 0xxxxxxx
2: 110xxxxx 10xxxxxx
3: 1110xxxx 10xxxxxx 10xxxxxx
4: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

where "x" is a bit that stands for the Unicode value.

0xC3A1 is two bytes long. Its bit representation is:

11000011 10100001

So when you apply the 2-byte bit pattern to it:

110xxxxx 10xxxxxx

the "x"s stand represent the bits: 00011 100001

Put them together and you get 11100001 which is the binary
representation of 225. Therefore, we now know that character number
225, when encoded into UTF-8 encoding, results in the two bytes 0xc3
and 0xa1, which is exactly what you're seeing.

I am PRETTY sure that this is not happening
within the code I am working on, if I am following
the code flow correctly it looks like it does
nothing but pull the text from the database and
pass it back.

SOMEWHERE in the code the characters greater than 127 are being
converted from extended-ASCII to UTF-8 encoding, but it's hard to say
exactly where unless I have access to the code. Therefore, I'll leave
it up to you to figure out where it's happening.

But even if you do find where this is happening, you will still
have to deal with the problem of converting the two-byte UTF-8
representation (of characters greater than 127) to their one-byte
extended-ASCII equivalent. ¿Comprende?

I'm not sure how to do this, but here are three things you can try.
Whether or not each one works may depend on the version of Perl you
are using, so letting me know your "perl -v" output may help me out.

----------------------------------------
# Method 1: Convince Perl that your string
# is UTF-8 encoded:
use Encode;
$string = pullTextFromDb();
# Convince Perl that $string is in UTF-8 format:
$string = decode_utf8($string);
# Convert UTF-8 string to extended-ASCII:
$string = encode("iso-8859-1", $string);
----------------------------------------
# Method 2: Tell Perl that $string is UTF-8
# encoded and that you want its
# latin1 equivalent:
use Unicode::String qw(utf8 latin1);
$string = pullTextFromDb();
$string = utf8($string)->latin1();
----------------------------------------
# Method 3: Tell Perl to pack each character's
# Unicode value into just one byte
# of a larger string:
$string = pullTextFromDb();
$string = pack "C*", map ord, split //, $string;
----------------------------------------

Try all these and see if any of them work. Again, what works and
what doesn't work might very well depend on the version of Perl that
you're using. Also, even if one of them does work, some other part of
your code might be converting it back to UTF-8 encoding, undo-ing the
conversion you just made.

But it's still worth a shot to try them out. Hopefully one of the
above three methods will work for you, and your problem will be "no
más."

I hope this helps, Joe.

-- Jean-Luc

Alan J. Flavell · Sep 12, 2004

Also sprach Bëelphazoar:

[...nothing that was quoted here...]

No, that's not Unicode. Unicode is foremost just a mapping between
numbers and characters. Each character thusly has a unique number.

Agreed. And those numbers no longer fit into 16 bits, in general.
As you indeed imply later.

When you talk about bits, you are really talking about encodings.

Right; and in fact Unicode has now specialised the terminology even
further. See chapter 2,
http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

The abstract code point values are embodied in an "Encoding Form"
consisting of "code units" of a particular size (may be 8, 16 or 32
bits), and then the "Encoding form" is transmitted by an "Encoding
Scheme" which represents those units as a sequence of octets (bytes)
on a transmission channel.

Fortunately, for utf-8 that final step is one-to-one. But the
distinction becomes important for utf-16-based and utf-32-based
encoding schemes.

Unicode defines three encodings: UTF-(8|16|32).

Right. Those are "Encoding Forms" in the new terminology, and they
become the "Seven Encoding Schemes" (one utf-8, three utf-16 and
three utf-32, as shown in Table 2-3 in chapter 2.

Perl internally uses UTF-8 which is a variable width encoding
meaning a character can have anything between one and four bytes.

Indeed. The original algorithm which defined utf-8 could have
represented code point values up to 7fff ffff (which needs 6 octets in
encoded form); but Unicode has stated that no characters will be
defined beyond 0010 ffff, and thus 4 octets are now sufficient.
rfc3629 obsoletes 2279 ("film at 11").

This distinction sounds like hairsplitting, but it's crucial if you ever
want to understand what Unicode is about and how to use it properly.

Agreed. The hardest part is un-learning things which used to seem
obvious!

all the best

(No offence meant - just trying to build on what you had already
posted.)

Alan J. Flavell · Sep 12, 2004

By the way, are you SURE that "á" is the extended ASCII value 225?

There is no such thing as "extended ASCII", so the question is moot.

There are large numbers of 8-bit character codings which have
ASCII as their low half. The one that's used in polite Latin-1
circles is iso-8859-1, in which 225 decimal is small a-acute.

According to one source I have, it is extended ASCII value 160.

That would be the old MS-DOS encodings, such as CP-437 (for US
residents) or CP-850 (for the Latin-1 locale). Dinosaurs.

so letting me know your "perl -v" output may help me out.

Good advice, indeed!

[some useful diagnostic suggestions snipped]

[but please, let's hear no more of this mythical "extended ASCII"
character code.]

Cannot convert (double) to (double*)	1	Sep 5, 2022
Perl SSH2 not working. Double login required.	2	Jun 10, 2014
regex reserved chars	23	Feb 7, 2013
ctypes return char array with null chars	1	Apr 19, 2010
How should i edit the code of this program; so it looks like the wirframe i attached; the buttons with a cross myst be positioned vertically ?	1	Sep 29, 2022
FAQ 3.21 How can I compile my Perl program into byte code or C?	0	Jan 19, 2011
HTML purifier for Perl	2	Jan 21, 2011
Regex: match double OR single quote	4	Jul 12, 2012

Perl opting for double-byte chars?

Bëelphazoar

Alan J. Flavell

Bëelphazoar

Tassilo v. Parseval

Bëelphazoar

Alan J. Flavell

Shawn Corey

Alan J. Flavell

Bëelphazoar

J. Romano

Alan J. Flavell

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads