Perl opting for double-byte chars?

Discussion in 'Perl Misc' started by Bëelphazoar, Sep 12, 2004.

  1. Bëelphazoar

    Bëelphazoar Guest

    I am working on a problem, I have text in a database which includes
    the word "más". The "á" is ASCII value 225/E1 .

    It is definitely this inside the database.

    The code pulls the text out of the database and assigns it to a
    variable, but when I print the variable it is now "más", the "á" has
    been replaced by C3A1 .

    I am PRETTY sure that this is not happening within the code I am
    working on, if I am following the code flow correctly it looks like it
    does nothing but pull the text from the database and pass it back.

    Digging around in various Perl docs, I found some references which say
    that Perl will decide whether to use double-byte for chars > 127, it
    looks like that is what's happening here.

    I tried doing this:

    use bytes;
    $myVar = pullTextFromDb();
    no bytes;

    but I still got the double-byte translation.

    Does anybody have any pointers about how to proceed further debugging
    this?

    Should the use bytes pragram affect code that is not in the current
    module? That is, the pullTextFromDb() function call goes through
    several modules object-oriented style, should the pragma still be in
    effect for that code, or is it only useful in the current module?

    Thanks for any help.



    --
    Joe Cosby
    http://joecosby.com/
    IT'S THE DARK ANGEL OF MACARONI! COMING TO GET ME! COMING TO
    FEED ME MACARONI!
    Bëelphazoar, Sep 12, 2004
    #1
    1. Advertising

  2. On Sat, 11 Sep 2004, it was written:

    > I am working on a problem, I have text in a database which includes
    > the word "más". The "á" is ASCII value 225/E1 .


    ASCII is a 7-bit code. There is no such "value" in ASCII.

    > It is definitely this inside the database.


    You need to learn a little more about character coding.,

    > The code pulls the text out of the database and assigns it to a
    > variable, but when I print the variable it is now "más"


    Your usenet posting claims:

    Content-Type: text/plain; charset=ISO-8859-1

    You don't seem to understand what that means.

    >, the "á" has been replaced by C3A1 .


    Perhaps you confused the software into believing that you wanted
    characters (for which Perl has an internal representation) rather than
    bytes.

    > Digging around in various Perl docs, I found some references which say
    > that Perl will decide whether to use double-byte for chars > 127, it
    > looks like that is what's happening here.


    Do you have utf8 in your locale?
    Alan J. Flavell, Sep 12, 2004
    #2
    1. Advertising

  3. Bëelphazoar

    Bëelphazoar Guest

    On Sun, 12 Sep 2004 02:12:26 +0100, "Alan J. Flavell"
    <> wrote:

    >On Sat, 11 Sep 2004, it was written:
    >
    >> I am working on a problem, I have text in a database which includes
    >> the word "más". The "á" is ASCII value 225/E1 .

    >
    >ASCII is a 7-bit code. There is no such "value" in ASCII.
    >
    >> It is definitely this inside the database.

    >
    >You need to learn a little more about character coding.,
    >
    >> The code pulls the text out of the database and assigns it to a
    >> variable, but when I print the variable it is now "más"

    >
    >Your usenet posting claims:
    >
    >Content-Type: text/plain; charset=ISO-8859-1
    >
    >You don't seem to understand what that means.
    >


    It means that I am telling any client which tries to read my post that
    I am using ISO code page 8859-1

    This is a table of character representations corresponding to 8-bit
    values, with 256 members.

    As you say, ASCII only defines the low 7 bits, whcih are the same
    character representations in most english-based code pages.

    In addition to ASCII there is unicode, which is 16-bit, and which,
    somewhere in my application, is apparently being used when the "á" is
    used because it is greater than 127.

    Perl, from what I understand from the documentation I could find, will
    sometimes decide to use Unicode values where text is in the 128-255
    range. This appears to be the heart of the problem, because at one
    point the application appears to be trying to represent the "á"
    character in Unicode, but then anywhere subsequent the environment is
    failing to translate the resulting 2-byte character back to the
    appropriate represenation.

    I apologize if I was not sufficiently rigorous in my description of
    what I know so far. I thought it was reasonably clear what I was
    saying.

    >>, the "á" has been replaced by C3A1 .

    >
    >Perhaps you confused the software into believing that you wanted
    >characters (for which Perl has an internal representation) rather than
    >bytes.
    >
    >> Digging around in various Perl docs, I found some references which say
    >> that Perl will decide whether to use double-byte for chars > 127, it
    >> looks like that is what's happening here.

    >
    >Do you have utf8 in your locale?


    I don't know. Can you tell me how I would check that? I don't know a
    great deal about the Perl environment.


    --
    Joe Cosby
    http://joecosby.com/
    "I will be warned of the dangers of time travel!",
    remembered Tilly, of the warning she was given in the
    future, of the perils of the past, which she presently
    thought had been both historic and foresighted, "though
    knowing now what I will know then makes it somewhat
    anachronistic".
    -Dr. Hieronymous Zinn, from The Novel
    Bëelphazoar, Sep 12, 2004
    #3
  4. Also sprach Bëelphazoar:

    > On Sun, 12 Sep 2004 02:12:26 +0100, "Alan J. Flavell"
    ><> wrote:


    > In addition to ASCII there is unicode, which is 16-bit,


    No, that's not Unicode. Unicode is foremost just a mapping between
    numbers and characters. Each character thusly has a unique number. When
    you talk about bits, you are really talking about encodings. Unicode
    defines three encodings: UTF-(8|16|32). Perl internally uses UTF-8 which
    is a variable width encoding meaning a character can have anything
    between one and four bytes.

    The same is true for UTF-16 which you must have been thinking of. The
    most common characters fit into two bytes. However, all the other
    characters do exist as well in this encoding. They are encoded in two
    byte-couples.

    This distinction sounds like hairsplitting, but it's crucial if you ever
    want to understand what Unicode is about and how to use it properly.

    >>Do you have utf8 in your locale?

    >
    > I don't know. Can you tell me how I would check that? I don't know a
    > great deal about the Perl environment.


    Perl uses the environment of your system, not its own. So check your
    environment variables.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Sep 12, 2004
    #4
  5. Bëelphazoar

    Bëelphazoar Guest

    On Sun, 12 Sep 2004 09:29:43 +0200, "Tassilo v. Parseval"
    <> wrote:

    >Also sprach Bëelphazoar:
    >
    >> On Sun, 12 Sep 2004 02:12:26 +0100, "Alan J. Flavell"
    >><> wrote:

    >
    >> In addition to ASCII there is unicode, which is 16-bit,

    >
    >No, that's not Unicode. Unicode is foremost just a mapping between
    >numbers and characters. Each character thusly has a unique number. When
    >you talk about bits, you are really talking about encodings. Unicode
    >defines three encodings: UTF-(8|16|32). Perl internally uses UTF-8 which
    >is a variable width encoding meaning a character can have anything
    >between one and four bytes.
    >
    >The same is true for UTF-16 which you must have been thinking of. The
    >most common characters fit into two bytes. However, all the other
    >characters do exist as well in this encoding. They are encoded in two
    >byte-couples.
    >
    >This distinction sounds like hairsplitting, but it's crucial if you ever
    >want to understand what Unicode is about and how to use it properly.
    >
    >>>Do you have utf8 in your locale?

    >>
    >> I don't know. Can you tell me how I would check that? I don't know a
    >> great deal about the Perl environment.

    >
    >Perl uses the environment of your system, not its own. So check your
    >environment variables.
    >


    Thanks.

    At some point, Perl does seem to be making the decision to alter the
    data which I am pulling from the database, changing the particular
    character from an 8-bit value to a 16-bit value.

    The job at hand for me is to make it stop doing this.

    As you and the preceding person have pointed out, I don't know
    everything there is to know about character encodings. I apologize if
    I have caused any confusion in describing character encoding
    incorrectly.

    I would appreciate any pointers you might have on where would be a
    good place to start looking at system variables to find the relevant
    environment variables, but it does seem clear enough, assuming I am
    understanding the code I am looking at, that Perl is changing a text
    value for whatever reason and based on whatever system of character
    encoding from an 8-bit value which works to a 16-bit value which
    doesn't.

    It seems, as far as I can tell, as if that is something I will need to
    solve within Perl. Maybe I am mistaken, but I don't see how the
    operating system is going to make a decision to force data inside a
    Perl application to alter based on it's active character encoding
    setup.

    If how Perl makes the decision to change the 8-bit value to a 16-bit
    value is based on the active system character encoding setup, then any
    pointers anybody could provide as to how it makes this decision, or
    what exactly I should be looking at, would be most appreciated.

    Again, as I say, my job at hand here is to convince Perl not to change
    the existing 8-bit value I am pulling from the database into a 16-bit
    value which no longer works for what I am doing.

    >Tassilo


    --
    Joe Cosby
    http://joecosby.com/
    "Typhoon Rips Through Cemetery, Hundreds Dead."
        - Newspaper Headline
    Bëelphazoar, Sep 12, 2004
    #5
  6. On Sun, 12 Sep 2004, it was written:

    [snip]

    > At some point, Perl does seem to be making the decision to alter the
    > data which I am pulling from the database, changing the particular
    > character


    So write and instrument a small test case, small enough to be posted
    here (minus the database itself, OK) with some sample printouts of the
    data at the various points in the processing, preferably in
    hexadecimal (any attempt to splatter 8-bit characters into a Usenet
    posting usually turns into a failure to communicate, in my
    experience).

    > from an 8-bit value to a 16-bit value.


    This may seem like hair splitting, but what you exhibited so far
    appeared to be a utf-8 character. Which in this case consisted of two
    octets (bytes), but that's not the same thing as "a 16-bit value".

    > The job at hand for me is to make it stop doing this.


    Possibly. That depends on what range of characters you hope to be
    able to handle in your system. But let's try to understand where
    we're at, before discussing where to go from there.

    > As you and the preceding person have pointed out, I don't know
    > everything there is to know about character encodings. I apologize if
    > I have caused any confusion in describing character encoding
    > incorrectly.


    Oh, it's quite normal... Naturally I'd urge you to take time to learn
    a bit more about it, believing - as I do - that it'll save you effort
    later; but as it's one of my specialist subjects, "I would say that,
    wouldn't I?"...

    > I would appreciate any pointers you might have on where would be a
    > good place to start looking at system variables to find the relevant
    > environment variables,


    man printenv
    man locale

    (assuming unix-family OS),

    > but it does seem clear enough, assuming I am
    > understanding the code I am looking at, that Perl is changing a text
    > value


    Perl doesn't magically "change text values": it handles text in the
    way that it thinks it's been asked to handle it.

    My feeling is that, sooner rather than later, you're going to need
    this stuff anyway, so I'd start on perldoc perluniintro and then
    perldoc perlunicode (or the links near the foot of the index page
    http://www.perldoc.com/perl5.8.0/pod.html or whichever version you are
    using).

    But if you're determined that you just want to get utf8 out of the way
    for the moment, and you're sure you'll never be showing Perl a
    character outside of the iso-8859-1 range, then look for discussions
    on apparent incompatibilities between RedHat 9 and Perl 5.8, which
    discuss how RedHat's introduction of utf8 into the locale caused Perl
    to switch into its Unicode mode, and how to take it out again (I don't
    have the details at my fingertips right now, sorry).

    > It seems, as far as I can tell, as if that is something I will need to
    > solve within Perl. Maybe I am mistaken, but I don't see how the
    > operating system is going to make a decision to force data inside a
    > Perl application to alter based on it's active character encoding
    > setup.


    Oh, but it does. At least in 5.8.0. Google for "redhat perl 5.8.0
    utf8 locale" (without the quotes) and read the first few links, I
    think they'll help.

    > If how Perl makes the decision to change the 8-bit value to a 16-bit
    > value


    Please stop saying "16 bit value"; it's sure to cause confusion
    somewhere down the line. What you're talking about here is a
    character stored in Perl's native unicode format, which is utf-8: this
    particular character happens to occupy two bytes in storage, but it's
    not useful to talk about it as a "16-bit value", and it risks
    confusing it with utf-16 format (which is the OS's native storage
    format on Windows NT-based systems, by the way, and commonly used also
    for storing unicode characters in databases).

    good luck
    Alan J. Flavell, Sep 12, 2004
    #6
  7. Bëelphazoar

    Shawn Corey Guest

    Hi,

    I got caught on this one too. See perldoc perluniintro and perldoc
    perlunicode. Perl v5.8+ has a feature that automatically and silently
    converts its standard (pre-v5.8) strings into UTF-8 strings if it
    encounters a Unicode character. I haven't figure a reliable way around
    this yet but you could try:

    $s = pack( 'C*', unpack( 'U*', $s ));

    Alan J. Flavell wrote:
    > On Sun, 12 Sep 2004, it was written:
    >
    > [snip]
    >
    >
    >>At some point, Perl does seem to be making the decision to alter the
    >>data which I am pulling from the database, changing the particular
    >>character

    >
    >
    > So write and instrument a small test case, small enough to be posted
    > here (minus the database itself, OK) with some sample printouts of the
    > data at the various points in the processing, preferably in
    > hexadecimal (any attempt to splatter 8-bit characters into a Usenet
    > posting usually turns into a failure to communicate, in my
    > experience).
    >
    >
    >>from an 8-bit value to a 16-bit value.

    >
    >
    > This may seem like hair splitting, but what you exhibited so far
    > appeared to be a utf-8 character. Which in this case consisted of two
    > octets (bytes), but that's not the same thing as "a 16-bit value".
    >
    >
    >>The job at hand for me is to make it stop doing this.

    >
    >
    > Possibly. That depends on what range of characters you hope to be
    > able to handle in your system. But let's try to understand where
    > we're at, before discussing where to go from there.
    >
    >
    >>As you and the preceding person have pointed out, I don't know
    >>everything there is to know about character encodings. I apologize if
    >>I have caused any confusion in describing character encoding
    >>incorrectly.

    >
    >
    > Oh, it's quite normal... Naturally I'd urge you to take time to learn
    > a bit more about it, believing - as I do - that it'll save you effort
    > later; but as it's one of my specialist subjects, "I would say that,
    > wouldn't I?"...
    >
    >
    >>I would appreciate any pointers you might have on where would be a
    >>good place to start looking at system variables to find the relevant
    >>environment variables,

    >
    >
    > man printenv
    > man locale
    >
    > (assuming unix-family OS),
    >
    >
    >>but it does seem clear enough, assuming I am
    >>understanding the code I am looking at, that Perl is changing a text
    >>value

    >
    >
    > Perl doesn't magically "change text values": it handles text in the
    > way that it thinks it's been asked to handle it.
    >
    > My feeling is that, sooner rather than later, you're going to need
    > this stuff anyway, so I'd start on perldoc perluniintro and then
    > perldoc perlunicode (or the links near the foot of the index page
    > http://www.perldoc.com/perl5.8.0/pod.html or whichever version you are
    > using).
    >
    > But if you're determined that you just want to get utf8 out of the way
    > for the moment, and you're sure you'll never be showing Perl a
    > character outside of the iso-8859-1 range, then look for discussions
    > on apparent incompatibilities between RedHat 9 and Perl 5.8, which
    > discuss how RedHat's introduction of utf8 into the locale caused Perl
    > to switch into its Unicode mode, and how to take it out again (I don't
    > have the details at my fingertips right now, sorry).
    >
    >
    >>It seems, as far as I can tell, as if that is something I will need to
    >>solve within Perl. Maybe I am mistaken, but I don't see how the
    >>operating system is going to make a decision to force data inside a
    >>Perl application to alter based on it's active character encoding
    >>setup.

    >
    >
    > Oh, but it does. At least in 5.8.0. Google for "redhat perl 5.8.0
    > utf8 locale" (without the quotes) and read the first few links, I
    > think they'll help.
    >
    >
    >>If how Perl makes the decision to change the 8-bit value to a 16-bit
    >>value

    >
    >
    > Please stop saying "16 bit value"; it's sure to cause confusion
    > somewhere down the line. What you're talking about here is a
    > character stored in Perl's native unicode format, which is utf-8: this
    > particular character happens to occupy two bytes in storage, but it's
    > not useful to talk about it as a "16-bit value", and it risks
    > confusing it with utf-16 format (which is the OS's native storage
    > format on Windows NT-based systems, by the way, and commonly used also
    > for storing unicode characters in databases).
    >
    > good luck
    Shawn Corey, Sep 12, 2004
    #7
  8. A: No!

    On Sun, 12 Sep 2004, Shawn Corey blurted out atop a fullquote:

    > I got caught on this one too.


    Are you sure it was the same?

    > See perldoc perluniintro and perldoc perlunicode.


    Yup, good advice, already offered.

    > Perl v5.8+ has a feature that automatically and silently converts
    > its standard (pre-v5.8) strings into UTF-8 strings if it encounters
    > a Unicode character.


    If by "a Unicode character" you mean one whose code value is greater
    than 255, then you're right; but we've been given no evidence here
    that such a character has been involved. The only "interesting"
    character under discussion has been one which fell into the range
    occupied by printable characters in iso-8859-1, namely 160-255
    decimal.

    Perl 5.8 would only have "upgraded" that to utf8 if it had been
    given cause to do so. In 5.8.0, one such cause is the presence
    of utf-8 in the locale. See also the discussion in
    http://use.perl.org/articles/03/09/26/2231256.shtml?tid=6 , or
    http://twiki.org/cgi-bin/view/Codev/UsingPerl58OnRedHat8 , or
    the various other articles that pop up when one tries the search that
    I had suggested.

    My hunch is that's what happened. Maybe I'll be proved wrong; we'll
    see.

    > I haven't figure a reliable way around this yet


    (which suggests you haven't read the relevant perldocs closely enough)

    There are various approaches, depending on what your problem field is
    and what you're trying to achieve.

    If you force the old behaviour, then you can get what you'd have been
    accustomed to before, and you won't suffer the overhead of Perl
    processing Unicode; but you'll cut yourself off from the ability to
    process a fuller range of characters, writing systems etc.

    If you learn how to work with Unicode - and your database /also/ knows
    how to work with it - then you can write software that can handle
    writing systems which are way outside of mere Latin 1; but you may
    incur some processing overhead due to the extra work of Perl handling
    Unicode characters.

    With care, code can be written such that the overhead only cuts in
    when charcters outside of the iso-8859-1 repertoire are used. Thus
    getting the best of both worlds - without having to write messy
    dual-path code, because Perl takes care of it for you (if you're
    asking it right).

    In general I'd say (except perhaps for diagnostic purposes), if you're
    messing around with packing and unpacking characters, then you're
    doing it wrong. The key is to grasp Perl's character representation
    model, and to work *with* it, not to fight it with hand-packed and
    -unpacked representations.

    This assumes that your code only needs to run on >= 5.8.0. If you're
    writing code meant to be runnable on older Perls, then you have to put
    quite a lot more care into the task of producing something compatible.

    ttfn

    Q: Should I put my Usenet response on the top of a quote of the entire
    previous posting?

    http://www.faqs.org/docs/jargon/T/top-post.html
    Alan J. Flavell, Sep 12, 2004
    #8
  9. Bëelphazoar

    Bëelphazoar Guest

    Thanks for your help Alan and Shawn, I think you have given me enough
    to work with, I will post back if the leads you've presented don't
    resolve the issue I'm getting.

    --
    Bëelphazoar
    International Satanic Conspiracy
    Customer Support Specialist
    http://joecosby.com/
    You mystics are a sorry lot, always whimping about so-and-so's "ego"
    getting in the way of their "detachment." Take it to alt.zen.ego-death,
    for the love of pete! This is alt.MAGICK.
    Bëelphazoar, Sep 12, 2004
    #9
  10. Bëelphazoar

    J. Romano Guest

    Bëelphazoar <http://joecosby.com/code/mail.pl> wrote in message news:<>...
    >
    > I am working on a problem, I have text in a database which
    > includes the word "más". The "á" is ASCII value 225/E1 .



    Dear Joe,

    It will help a lot if you give us the output of "perl -v". I'm
    sure Unicode has something to do with your problem, but Unicode
    support has been changing (updating) in recent versions of Perl.
    Without knowing the version of Perl you're using and the platform
    you're using it on, we can only guess as to what the problem is.

    By the way, are you SURE that "á" is the extended ASCII value 225?
    According to one source I have, it is extended ASCII value 160. Maybe
    we're using different code pages, but it's worth checking.

    > ASCII only defines the low 7 bits, whcih are the same
    > character representations in most english-based code
    > pages.
    >
    > In addition to ASCII there is unicode, which is 16-bit,
    > and which, somewhere in my application, is apparently
    > being used when the "á" is used because it is greater
    > than 127.


    You're wrong about Unicode being 16-bit. That's a myth. It CAN be
    encoded in two bytes (16 bits), but it can also be encoded using a
    different method called UTF-8 (which is what Perl normally uses
    internally). The UTF-8 encoding uses variable-length character
    encoding, which means that a character can be encoded in one to six
    bytes. In your case, the character whose value is greater than 127 is
    being encoded in two bytes, whereas the other characters (< 128) are
    being encoded in one byte.

    Understand? If you don't, here's a great link to an FAQ I used to
    understand more about how Unicode is encoded:

    http://www.cl.cam.ac.uk/~mgk25/unicode.html

    You may also want to check the following perldocs (which, depending on
    your version of Perl, you may or may not have all of):

    perldoc Encode
    perldoc perluniintro
    perldoc Unicode::String

    > The code pulls the text out of the database and
    > assigns it to a variable, but when I print the
    > variable it is now "más", the "á" has been
    > replaced by C3A1 .


    This certainly looks to me like UTF-8 Unicode encoding, but let's
    check just to make sure:

    According to the FAQ (whose link I mentioned above), a Unicode
    character value can be UTF-8 encoded using one to six bytes:

    1: 0xxxxxxx
    2: 110xxxxx 10xxxxxx
    3: 1110xxxx 10xxxxxx 10xxxxxx
    4: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    5: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    6: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    where "x" is a bit that stands for the Unicode value.

    0xC3A1 is two bytes long. Its bit representation is:

    11000011 10100001

    So when you apply the 2-byte bit pattern to it:

    110xxxxx 10xxxxxx

    the "x"s stand represent the bits: 00011 100001

    Put them together and you get 11100001 which is the binary
    representation of 225. Therefore, we now know that character number
    225, when encoded into UTF-8 encoding, results in the two bytes 0xc3
    and 0xa1, which is exactly what you're seeing.

    > I am PRETTY sure that this is not happening
    > within the code I am working on, if I am following
    > the code flow correctly it looks like it does
    > nothing but pull the text from the database and
    > pass it back.


    SOMEWHERE in the code the characters greater than 127 are being
    converted from extended-ASCII to UTF-8 encoding, but it's hard to say
    exactly where unless I have access to the code. Therefore, I'll leave
    it up to you to figure out where it's happening.

    But even if you do find where this is happening, you will still
    have to deal with the problem of converting the two-byte UTF-8
    representation (of characters greater than 127) to their one-byte
    extended-ASCII equivalent. ¿Comprende?

    I'm not sure how to do this, but here are three things you can try.
    Whether or not each one works may depend on the version of Perl you
    are using, so letting me know your "perl -v" output may help me out.

    ----------------------------------------
    # Method 1: Convince Perl that your string
    # is UTF-8 encoded:
    use Encode;
    $string = pullTextFromDb();
    # Convince Perl that $string is in UTF-8 format:
    $string = decode_utf8($string);
    # Convert UTF-8 string to extended-ASCII:
    $string = encode("iso-8859-1", $string);
    ----------------------------------------
    # Method 2: Tell Perl that $string is UTF-8
    # encoded and that you want its
    # latin1 equivalent:
    use Unicode::String qw(utf8 latin1);
    $string = pullTextFromDb();
    $string = utf8($string)->latin1();
    ----------------------------------------
    # Method 3: Tell Perl to pack each character's
    # Unicode value into just one byte
    # of a larger string:
    $string = pullTextFromDb();
    $string = pack "C*", map ord, split //, $string;
    ----------------------------------------

    Try all these and see if any of them work. Again, what works and
    what doesn't work might very well depend on the version of Perl that
    you're using. Also, even if one of them does work, some other part of
    your code might be converting it back to UTF-8 encoding, undo-ing the
    conversion you just made.

    But it's still worth a shot to try them out. Hopefully one of the
    above three methods will work for you, and your problem will be "no
    más."

    I hope this helps, Joe.

    -- Jean-Luc
    J. Romano, Sep 12, 2004
    #10
  11. On Sun, 12 Sep 2004, Tassilo v. Parseval wrote:

    > Also sprach Bëelphazoar:
    >
    > > On Sun, 12 Sep 2004 02:12:26 +0100, "Alan J. Flavell"
    > ><> wrote:


    [...nothing that was quoted here...]

    > > In addition to ASCII there is unicode, which is 16-bit,

    >
    > No, that's not Unicode. Unicode is foremost just a mapping between
    > numbers and characters. Each character thusly has a unique number.


    Agreed. And those numbers no longer fit into 16 bits, in general.
    As you indeed imply later.

    > When you talk about bits, you are really talking about encodings.


    Right; and in fact Unicode has now specialised the terminology even
    further. See chapter 2,
    http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

    The abstract code point values are embodied in an "Encoding Form"
    consisting of "code units" of a particular size (may be 8, 16 or 32
    bits), and then the "Encoding form" is transmitted by an "Encoding
    Scheme" which represents those units as a sequence of octets (bytes)
    on a transmission channel.

    Fortunately, for utf-8 that final step is one-to-one. But the
    distinction becomes important for utf-16-based and utf-32-based
    encoding schemes.

    > Unicode defines three encodings: UTF-(8|16|32).


    Right. Those are "Encoding Forms" in the new terminology, and they
    become the "Seven Encoding Schemes" (one utf-8, three utf-16 and
    three utf-32, as shown in Table 2-3 in chapter 2.

    > Perl internally uses UTF-8 which is a variable width encoding
    > meaning a character can have anything between one and four bytes.


    Indeed. The original algorithm which defined utf-8 could have
    represented code point values up to 7fff ffff (which needs 6 octets in
    encoded form); but Unicode has stated that no characters will be
    defined beyond 0010 ffff, and thus 4 octets are now sufficient.
    rfc3629 obsoletes 2279 ("film at 11").

    > This distinction sounds like hairsplitting, but it's crucial if you ever
    > want to understand what Unicode is about and how to use it properly.


    Agreed. The hardest part is un-learning things which used to seem
    obvious!

    all the best

    (No offence meant - just trying to build on what you had already
    posted.)
    Alan J. Flavell, Sep 12, 2004
    #11
  12. On Sun, 12 Sep 2004, J. Romano wrote:

    > By the way, are you SURE that "á" is the extended ASCII value 225?


    There is no such thing as "extended ASCII", so the question is moot.

    There are large numbers of 8-bit character codings which have
    ASCII as their low half. The one that's used in polite Latin-1
    circles is iso-8859-1, in which 225 decimal is small a-acute.

    > According to one source I have, it is extended ASCII value 160.


    That would be the old MS-DOS encodings, such as CP-437 (for US
    residents) or CP-850 (for the Latin-1 locale). Dinosaurs.

    > so letting me know your "perl -v" output may help me out.


    Good advice, indeed!

    [some useful diagnostic suggestions snipped]

    [but please, let's hear no more of this mythical "extended ASCII"
    character code.]
    Alan J. Flavell, Sep 12, 2004
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ike

    opting out of Iterators

    Ike, Jan 2, 2006, in forum: Java
    Replies:
    14
    Views:
    666
    Chris Uppal
    Jan 4, 2006
  2. Sydex
    Replies:
    12
    Views:
    6,484
    Victor Bazarov
    Feb 17, 2005
  3. jeko

    reading a double var byte per byte

    jeko, Jan 18, 2005, in forum: C Programming
    Replies:
    12
    Views:
    548
  4. Kosio

    Floats to chars and chars to floats

    Kosio, Sep 16, 2005, in forum: C Programming
    Replies:
    44
    Views:
    1,285
    Tim Rentsch
    Sep 23, 2005
  5. Amit Save
    Replies:
    0
    Views:
    158
    Amit Save
    Sep 6, 2005
Loading...

Share This Page