how is the string encoded

Discussion in 'Perl Misc' started by dn.perl@gmail.com, Jan 3, 2012.

  1. Guest

    I know the question must have been asked many times, there are many
    web-pages which are supposed to help, but after going through many of
    them, I still need help.
    I am running a simple program on linux, perl 5.8.8 ;

    use strict ;
    use warnings ;
    ## use utf8 ;

    my $str ;
    $str = "ä" ;
    print "str is $str\n" ;
    ---
    Works well. But my question is: how do I know which encoding is being
    used to read/write $str?

    If I uncomment 'use utf8' like, I get a warning: Malformed UTF-8
    character. And the string no longer prints correct. Why, and how to
    remove this warning, and print the string correctly? I should have
    guesses that 'use utf8' adds more power to the code and would not stop
    running code which was otherwise running correct.
     
    , Jan 3, 2012
    #1
    1. Advertising

  2. "" <> writes:
    > I know the question must have been asked many times, there are many
    > web-pages which are supposed to help, but after going through many of
    > them, I still need help.
    > I am running a simple program on linux, perl 5.8.8 ;
    >
    > use strict ;
    > use warnings ;
    > ## use utf8 ;
    >
    > my $str ;
    > $str = "ä" ;
    > print "str is $str\n" ;
    > ---
    > Works well. But my question is: how do I know which encoding is being
    > used to read/write $str?


    According to the people who dabble in this area, you are not supposed
    to know that. You are supposed to convert any data flowing into perl
    from the encoding known to you into 'the super-secret, proprietary
    internal Perl encoding' (patent pending) and any data flowing out of
    perl from said 'super-secret internal Perl encoding' into whatever
    encoding you'd like to have. Should the encoding you want to use (for
    whatever reason) not be among the ones Perl supports natively, you're
    fucked and advised to take your petty problems elsewhere. That's the
    theory. Practically, Perl uses utf8 (which presumably cause a lot of
    people sour bumpers because Microsoft [reportedly] uses UCS-2).

    Another practical piece of advice: Stick to ASCII. That's the only
    thing no American comittee is going to uninvent tomorrow and thus, a
    safe choice for all communication needs among educated people. Let all
    those club-bearing natives draw their weird krikel-krakels to their
    hearts content and ignore them.
     
    Rainer Weikusat, Jan 3, 2012
    #2
    1. Advertising

  3. On Tue, 3 Jan 2012, Rainer Weikusat wrote:

    > "" <> writes:
    > > I know the question must have been asked many times, there are many
    > > web-pages which are supposed to help, but after going through many of
    > > them, I still need help.
    > > I am running a simple program on linux, perl 5.8.8 ;
    > >
    > > use strict ;
    > > use warnings ;
    > > ## use utf8 ;
    > >
    > > my $str ;
    > > $str = "ä" ;
    > > print "str is $str\n" ;
    > > ---
    > > Works well. But my question is: how do I know which encoding is being
    > > used to read/write $str?

    >
    > According to the people who dabble in this area, you are not supposed
    > to know that. You are supposed to convert any data flowing into perl
    > from the encoding known to you into 'the super-secret, proprietary
    > internal Perl encoding' (patent pending) and any data flowing out of
    > perl from said 'super-secret internal Perl encoding' into whatever
    > encoding you'd like to have.


    You have to make a difference between the encoding used by *you* while you
    are writing your perl program, and the encoding used by *perl* while it is
    running your program.

    You have to know what *you* are using. The answer has nothing to do with
    perl. If you can look at your program in an environment where UTF-8 is
    expected and you read it correctly there, then the program is in UTF-8.
    Use the "use utf8" to tell perl about it. It has no effect on what the
    program does with the strings in it.

    For the encoding used by *perl* while it is running your program, Rainer
    Weikusat's comment applies. You should not try to know. As long as all
    characters in a string are in the ISO-8859-1 character set, it is probable
    that ISO-8859-1 is internally used; there is an additional flag in the
    internal representation to indicate how the string is internally stored.
    Don't mess around with the internal encoding. Rather, *you* have to know
    how you meant the string: either as sequence of bytes whose character
    meaning only you know, or as a sequence of characters whose encoding as
    bytes only perl knows. Do not try to share such knowledge between perl and
    you. This is fairly well explained in perlunitut (e.g.
    http://search.cpan.org/~flora/perl-5.14.2/pod/perlunitut.pod).

    --
    Helmut Richter
     
    Helmut Richter, Jan 3, 2012
    #3
  4. Helmut Richter <> writes:
    > On Tue, 3 Jan 2012, Rainer Weikusat wrote:
    >> "" <> writes:
    >> > I know the question must have been asked many times, there are many
    >> > web-pages which are supposed to help, but after going through many of
    >> > them, I still need help.
    >> > I am running a simple program on linux, perl 5.8.8 ;
    >> >
    >> > use strict ;
    >> > use warnings ;
    >> > ## use utf8 ;
    >> >
    >> > my $str ;
    >> > $str = "ä" ;
    >> > print "str is $str\n" ;
    >> > ---
    >> > Works well. But my question is: how do I know which encoding is being
    >> > used to read/write $str?

    >>
    >> According to the people who dabble in this area, you are not supposed
    >> to know that. You are supposed to convert any data flowing into perl
    >> from the encoding known to you into 'the super-secret, proprietary
    >> internal Perl encoding' (patent pending) and any data flowing out of
    >> perl from said 'super-secret internal Perl encoding' into whatever
    >> encoding you'd like to have.

    >
    > You have to make a difference between the encoding used by *you* while you
    > are writing your perl program, and the encoding used by *perl* while it is
    > running your program.


    No. The people who *presently* work on Perl unicode support *want*
    that users of the language have to pretend that 'the internal perl
    encoding' is some magic secret beyond the realm of Perl code *despite*
    this is obviously at odds with the original design of 'unicode support
    for Perl' and this doesn't make much sense: At the very least, this
    requires one additional copy of all data flowing into Perl and one
    additional copy of all data going out of Perl. Given that one of the
    main uses of Perl is as a so-called 'glue language' interconnection
    other pieces of software into a complex whole, this is a major pain in
    the ass and this solely for the hypothetical benefit of the people
    working on the code. It is hypothetical because there is no way in
    heaven or hell that all of the existing Perl code which wasn't written
    based on the assumption that Perl strings are magic beasts with
    intransigent properties is ever going to be changed just because this
    would appeal someone's completely impractical idea of theoretical
    purity and the worst possible cause is that - someday - a Perl 5 fork
    is created which does break all this code and this will then simply
    become Perl 6 rev 0.5 --- something which exists for the private joy
    of its developers nobody uses for anything.
     
    Rainer Weikusat, Jan 3, 2012
    #4
  5. Ben Morrow <> writes:
    > Quoth "" <>:
    >>
    >> I know the question must have been asked many times, there are many
    >> web-pages which are supposed to help, but after going through many of
    >> them, I still need help.
    >> I am running a simple program on linux, perl 5.8.8 ;

    >
    > That perl is very nearly six years old. You should upgrade to at least
    > 5.12.
    >
    >> use strict ;
    >> use warnings ;
    >> ## use utf8 ;
    >>
    >> my $str ;
    >> $str = "ä" ;
    >> print "str is $str\n" ;
    >> ---
    >> Works well. But my question is: how do I know which encoding is being
    >> used to read/write $str?

    >
    > If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
    > you do, it assumes your source is in UTF-8. (In theory you can use other
    > encodings with the 'use encoding' pragma, but AIUI this doesn't work
    > reliably.)
    >
    > Output is completely unrelated. If you don't do anything special, perl
    > will give you output in ISO8859-1.


    This isn't quite correct: It will use 'the native 8 bit encoding' and
    this may well be something other than ASCII/ ISO-8859-1, although
    that's a case which rarely occurs in practice because most people
    don't write code for IBM mainframes :->.

    [...]

    > If you attempt to print a character which can't be represented in
    > ISO8859-1 you get a warning and the raw UTF-8 bytes representing
    > that character: this is obviously something you need to avoid, since
    > the output doesn't make any sense at that point.


    An example I recently encountered where it did make sense was a web
    interface with a Japanese localization: Since there were no characters
    corresponding with codepoints from (128, 255), the generated output
    was simply UTF-8 encoded Japanese which was exactly what it was
    supposed to be.
     
    Rainer Weikusat, Jan 3, 2012
    #5
  6. Guest

    On Jan 3, 10:25 am, Ben Morrow wrote:
    >
    > That perl (5.8.8) is very nearly six years old. You should
    > upgrade to at least 5.12.
    >


    I wonder whether you realize how difficult (ranging to impossible) it
    may be to achieve it. Say, I am on a 3-month contract. The employer
    has been managing for years with 5.8.8 and is unlikely to upgrade in
    such a case. Once I was stuck with a MySQL server which was many years
    old, but my boss was more concerned with preserving his own job than
    asking his BOSS to spend time and money on upgrading. Not that the
    suggestion to upgrade is wrong or any thing.

    >
    > If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
    > you do, it assumes your source is in UTF-8. (In theory you can use other
    > encodings with the 'use encoding' pragma, but AIUI this doesn't work
    > reliably.)
    > ...
    > What did you expect to happen? perldoc utf8 quite clearly says
    >     Do not use this pragma for anything else than telling Perl that your
    >     script is written in UTF-8.
    > so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
    > must expect warnings and misbehaviour.
    >


    It is very useful to know that perl assumes the source to be
    ISO8859-1. That 'use utf8' arguably works counter-intuitively. Since
    my code is ASCII and all ASCII is automatically utf8, I tend to wonder
    why I would ever write non-ascii code. It may not be a logical thing
    to do but I daresay it is an instinctive thing to do. Now if I want to
    dabble in utf8 or databases, what do I do? I think of 'use utf8' or
    'use DataBaseInterface DBI'.

    What I needed was 'use Encode' which is what I am doing now.
    Thanks for all the responses.
     
    , Jan 4, 2012
    #6
  7. On 2012-01-04 07:38, <> wrote:
    > On Jan 3, 10:25 am, Ben Morrow wrote:
    >> That perl (5.8.8) is very nearly six years old. You should
    >> upgrade to at least 5.12.

    [...]
    >> If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
    >> you do, it assumes your source is in UTF-8. (In theory you can use other
    >> encodings with the 'use encoding' pragma, but AIUI this doesn't work
    >> reliably.)
    >> ...
    >> What did you expect to happen? perldoc utf8 quite clearly says
    >>     Do not use this pragma for anything else than telling Perl that your
    >>     script is written in UTF-8.
    >> so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
    >> must expect warnings and misbehaviour.
    >>

    >
    > It is very useful to know that perl assumes the source to be
    > ISO8859-1.


    This is not quite correct. Without 'use utf8', perl assumes your source
    is an unspecified superset of ASCII, not ISO-8859-1. The character codes
    are the same, but the semantics are different. For example, if your
    script was encoded in ISO-8859-1, "ä" would result in string consisting
    of a single byte with the value 0xE4, but that byte is not equivalent to
    the character "ä" - it doesn't match \w, [:lower:] or any of the other
    classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
    uppercased. It is just a meaningless byte, not a character.


    > That 'use utf8' arguably works counter-intuitively. Since
    > my code is ASCII


    No, your code isn't ASCII. It contained the line

    | $str = "ä" ;

    "ä" is not an ASCII character.

    > and all ASCII is automatically utf8, I tend to wonder
    > why I would ever write non-ascii code.


    Well, why did you?


    > What I needed was 'use Encode' which is what I am doing now.


    Please don't unless you really understand what it does. Encode does a
    couple of different things and it isn't entirely consistent. It seemed
    like a good idea at the time and it may have been useful for converting
    pre-5.8-code, but I really wouldn't use it for new code.

    hp

    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, Jan 4, 2012
    #7
  8. "Peter J. Holzer" <> writes:
    > On 2012-01-04 07:38, <> wrote:
    >> On Jan 3, 10:25 am, Ben Morrow wrote:
    >>> That perl (5.8.8) is very nearly six years old. You should
    >>> upgrade to at least 5.12.

    > [...]
    >>> If you don't 'use utf8', perl assumes your source is is ISO8859-1. If
    >>> you do, it assumes your source is in UTF-8. (In theory you can use other
    >>> encodings with the 'use encoding' pragma, but AIUI this doesn't work
    >>> reliably.)
    >>> ...
    >>> What did you expect to happen? perldoc utf8 quite clearly says
    >>>     Do not use this pragma for anything else than telling Perl that your
    >>>     script is written in UTF-8.
    >>> so if you 'use utf8' and your source isn't, in fact, *in* UTF-8, you
    >>> must expect warnings and misbehaviour.
    >>>

    >>
    >> It is very useful to know that perl assumes the source to be
    >> ISO8859-1.

    >
    > This is not quite correct. Without 'use utf8', perl assumes your source
    > is an unspecified superset of ASCII, not ISO-8859-1.
    > The character codes are the same, but the semantics are different.


    This is also not quite correct: When 'use locale' is in effect, Perl
    assumes that anything beyond ASCII is supposed to have a meaning in
    the locale which happens to be in effect when the script is
    executed. Otherwise, the default is equivalent to the default POSIX
    locale (corresponding with LANG=C) which means bytes with value in the
    range (0, 127) will be interpreted as ASCII characters belonging to
    some of the different characters classes and bytes with values from
    (128, 255) are just 'bytes with certain values' and no further
    properties.

    Eg, assuming the text included below

    ----------------
    $a = chr(0xe4);

    {
    use locale;
    print 'locale: ', $a =~ /\w/, "\n";
    }

    print 'no locale: ', $a =~ /\w/, "\n";
    ----------------

    is saved to a file on a system where locale-information for ISO-8859-1
    based German is available, the command (a.pl being the name of the
    file)

    LANG=de_DE perl a.pl

    will print

    locale: 1
    no locale:

    and

    LANG=C perl a.pl

    locale:
    no locale:
     
    Rainer Weikusat, Jan 4, 2012
    #8
  9. Ben Morrow <> writes:
    > Quoth "Peter J. Holzer" <>:
    >> On 2012-01-04 07:38, <> wrote:


    [...]


    >> > What I needed was 'use Encode' which is what I am doing now.

    >>
    >> Please don't unless you really understand what it does. Encode does a
    >> couple of different things and it isn't entirely consistent. It seemed
    >> like a good idea at the time and it may have been useful for converting
    >> pre-5.8-code, but I really wouldn't use it for new code.

    >
    > Are you (either of you, in fact) thinking of 'use encoding'? That pragma
    > is, as I said originally, a Bad Idea.


    This would then be another documented Perl which managed to run afoul
    of someone's opinions. Is their actually any other reason than "it's a
    convenient way to do what shalt not be done"?
     
    Rainer Weikusat, Jan 5, 2012
    #9
  10. Shmuel (Seymour J.) Metz <> writes:
    > at 01:23 AM, Rainer Weikusat <> said:
    >
    >>This would then be another documented Perl which managed to run
    >>afoul of someone's opinions.

    >
    > No.


    It's documented:

    [rw@sapphire]/tmp $whatis encoding
    encoding (3perl) - allows you to write your script in non-ascii or non-utf8

    But according to the opinion of someone, it shouldn't be used.

    >>Is their actually any other reason than "it's a
    >>convenient way to do what shalt not be done"?

    >
    > Yes.


    And - as usual - no reasons beyond 'thou shalt do as I bid you and not
    ask silly questions' are given.
     
    Rainer Weikusat, Jan 5, 2012
    #10
  11. On 2012-01-04 23:51, Ben Morrow <> wrote:
    >
    > Quoth "Peter J. Holzer" <>:
    >> On 2012-01-04 07:38, <> wrote:
    >> >
    >> > It is very useful to know that perl assumes the source to be
    >> > ISO8859-1.

    >>
    >> This is not quite correct. Without 'use utf8', perl assumes your source
    >> is an unspecified superset of ASCII, not ISO-8859-1. The character codes
    >> are the same, but the semantics are different. For example, if your
    >> script was encoded in ISO-8859-1, "ä" would result in string consisting
    >> of a single byte with the value 0xE4, but that byte is not equivalent to
    >> the character "ä" - it doesn't match \w, [:lower:] or any of the other
    >> classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
    >> uppercased. It is just a meaningless byte, not a character.

    >
    > Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
    > string doesn't match \w, but as soon as you do anything that causes it
    > to be upgraded, it will.


    Yup, but unless you do something which causes it to be upgraded, it
    won't. So if you care about it being ISO-8859-1, you have to either
    force an upgrade or decode it. So I prefer to think of it as an
    "unspecified superset of ASCII" and not as "almost but not quite
    ISO-8859-1".

    >> > What I needed was 'use Encode' which is what I am doing now.

    >>
    >> Please don't unless you really understand what it does. Encode does a
    >> couple of different things and it isn't entirely consistent. It seemed
    >> like a good idea at the time and it may have been useful for converting
    >> pre-5.8-code, but I really wouldn't use it for new code.

    >
    > Are you (either of you, in fact) thinking of 'use encoding'?


    Yes, sorry. I misread what wrote.

    > That pragma is, as I said originally, a Bad Idea. Encode, OTOH, is
    > perfectly reliable, and cannot be avoided if you want to use data in
    > any encoding other than UTF-8.


    Right. I use that quite frequently, actually.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, Jan 5, 2012
    #11
  12. "Peter J. Holzer" <> writes:
    > On 2012-01-04 23:51, Ben Morrow <> wrote:
    >>
    >> Quoth "Peter J. Holzer" <>:
    >>> On 2012-01-04 07:38, <> wrote:
    >>> >
    >>> > It is very useful to know that perl assumes the source to be
    >>> > ISO8859-1.
    >>>
    >>> This is not quite correct. Without 'use utf8', perl assumes your source
    >>> is an unspecified superset of ASCII, not ISO-8859-1. The character codes
    >>> are the same, but the semantics are different. For example, if your
    >>> script was encoded in ISO-8859-1, "ä" would result in string consisting
    >>> of a single byte with the value 0xE4, but that byte is not equivalent to
    >>> the character "ä" - it doesn't match \w, [:lower:] or any of the other
    >>> classes "LATIN SMALL LETTER A WITH DIAERESIS" should match. It cannot be
    >>> uppercased. It is just a meaningless byte, not a character.

    >>
    >> Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
    >> string doesn't match \w, but as soon as you do anything that causes it
    >> to be upgraded, it will.

    >
    > Yup, but unless you do something which causes it to be upgraded, it
    > won't. So if you care about it being ISO-8859-1, you have to either
    > force an upgrade or decode it. So I prefer to think of it as an
    > "unspecified superset of ASCII"


    Assuming that locale information isn't being used, it is ASCII and not
    'a superset of ASCII' since no byte value outside the subset of
    possible byte values used by the ASCII encoding has any 'character
    properties' (except being considered 'a non-word character', that is).
     
    Rainer Weikusat, Jan 5, 2012
    #12
  13. On 2012-01-05 18:43, Rainer Weikusat <> wrote:
    > "Peter J. Holzer" <> writes:
    >> On 2012-01-04 23:51, Ben Morrow <> wrote:
    >>> Quoth "Peter J. Holzer" <>:
    >>>> On 2012-01-04 07:38, <> wrote:
    >>>> >
    >>>> > It is very useful to know that perl assumes the source to be
    >>>> > ISO8859-1.
    >>>>
    >>>> This is not quite correct. Without 'use utf8', perl assumes your source
    >>>> is an unspecified superset of ASCII, not ISO-8859-1. The character codes
    >>>> are the same, but the semantics are different.

    [...]
    >>> Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
    >>> string doesn't match \w, but as soon as you do anything that causes it
    >>> to be upgraded, it will.

    >>
    >> Yup, but unless you do something which causes it to be upgraded, it
    >> won't. So if you care about it being ISO-8859-1, you have to either
    >> force an upgrade or decode it. So I prefer to think of it as an
    >> "unspecified superset of ASCII"

    >
    > Assuming that locale information isn't being used, it is ASCII and not
    > 'a superset of ASCII' since no byte value outside the subset of
    > possible byte values used by the ASCII encoding has any 'character
    > properties' (except being considered 'a non-word character', that is).


    ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
    isn't ASCII any more.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, Jan 6, 2012
    #13
  14. "Peter J. Holzer" <> writes:
    > On 2012-01-05 18:43, Rainer Weikusat <> wrote:
    >> "Peter J. Holzer" <> writes:
    >>> On 2012-01-04 23:51, Ben Morrow <> wrote:
    >>>> Quoth "Peter J. Holzer" <>:
    >>>>> On 2012-01-04 07:38, <> wrote:
    >>>>> >
    >>>>> > It is very useful to know that perl assumes the source to be
    >>>>> > ISO8859-1.
    >>>>>
    >>>>> This is not quite correct. Without 'use utf8', perl assumes your source
    >>>>> is an unspecified superset of ASCII, not ISO-8859-1. The character codes
    >>>>> are the same, but the semantics are different.

    > [...]
    >>>> Aieiee, this is where we run bang smack into The Unicode Bug. Yes, that
    >>>> string doesn't match \w, but as soon as you do anything that causes it
    >>>> to be upgraded, it will.
    >>>
    >>> Yup, but unless you do something which causes it to be upgraded, it
    >>> won't. So if you care about it being ISO-8859-1, you have to either
    >>> force an upgrade or decode it. So I prefer to think of it as an
    >>> "unspecified superset of ASCII"

    >>
    >> Assuming that locale information isn't being used, it is ASCII and not
    >> 'a superset of ASCII' since no byte value outside the subset of
    >> possible byte values used by the ASCII encoding has any 'character
    >> properties' (except being considered 'a non-word character', that is).

    >
    > ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
    > isn't ASCII any more.


    As soon as a byte with value > 127 is considered to be some character,
    it isn't ASCII anymore but a superset of ASCII.
     
    Rainer Weikusat, Jan 6, 2012
    #14
  15. Shmuel (Seymour J.) Metz <> writes:
    > In <>, on 01/05/2012
    > at 04:06 PM, Rainer Weikusat <> said:
    >
    >>It's documented:

    >
    > What's documented the feature or the claim that the only reason to
    > not use it is because it ran afoul of someone's opinion.


    Try an educated guess based on the content of the text I wrote. I'm
    giving you a gratis hint: I quoted the 'name' section of the encoding
    manual page.

    [...]

    >>But according to the opinion of someone, it shouldn't be used.

    >
    > Well, the opinions of the people who tried to fix the bugs.


    The mere fact that somebody failed at doing something ('tried to fix a
    bug') doesn't really make that someone authoritative on anything.

    >>And - as usual - no reasons beyond 'thou shalt do as I bid you and
    >>not ask silly questions' are given.

    >
    > And, as usual, you are inventing claims that nobody actually made.


    That claim was implicit in your refusal to give a reason.
     
    Rainer Weikusat, Jan 6, 2012
    #15
  16. On 2012-01-06 20:53, Rainer Weikusat <> wrote:
    > "Peter J. Holzer" <> writes:
    >> On 2012-01-05 18:43, Rainer Weikusat <> wrote:
    >>> "Peter J. Holzer" <> writes:
    >>>> So I prefer to think of it as an "unspecified superset of ASCII"
    >>>
    >>> Assuming that locale information isn't being used, it is ASCII and not
    >>> 'a superset of ASCII' since no byte value outside the subset of
    >>> possible byte values used by the ASCII encoding has any 'character
    >>> properties' (except being considered 'a non-word character', that is).

    >>
    >> ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
    >> isn't ASCII any more.

    >
    > As soon as a byte with value > 127 is considered to be some character,
    > it isn't ASCII anymore but a superset of ASCII.


    Glad you agree.

    hp


    --
    _ | Peter J. Holzer | Deprecating human carelessness and
    |_|_) | Sysadmin WSR | ignorance has no successful track record.
    | | | |
    __/ | http://www.hjp.at/ | -- Bill Code on
     
    Peter J. Holzer, Jan 6, 2012
    #16
  17. "Peter J. Holzer" <> writes:
    > On 2012-01-06 20:53, Rainer Weikusat <> wrote:
    >> "Peter J. Holzer" <> writes:
    >>> On 2012-01-05 18:43, Rainer Weikusat <> wrote:
    >>>> "Peter J. Holzer" <> writes:
    >>>>> So I prefer to think of it as an "unspecified superset of ASCII"
    >>>>
    >>>> Assuming that locale information isn't being used, it is ASCII and not
    >>>> 'a superset of ASCII' since no byte value outside the subset of
    >>>> possible byte values used by the ASCII encoding has any 'character
    >>>> properties' (except being considered 'a non-word character', that is).
    >>>
    >>> ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
    >>> isn't ASCII any more.

    >>
    >> As soon as a byte with value > 127 is considered to be some character,
    >> it isn't ASCII anymore but a superset of ASCII.

    >
    > Glad you agree.


    I don't: Since bytes with values > 127 are not considered to be
    characters, it doesn't make sense to refer to this as 'superset of
    ASCII': It would need some character outside of the ASCII range, not
    just numbers which can also be stored in bytes because of the
    hardware.
     
    Rainer Weikusat, Jan 7, 2012
    #17
  18. Kaz Kylheku Guest

    On 2012-01-07, Rainer Weikusat <> wrote:
    > "Peter J. Holzer" <> writes:
    >> On 2012-01-06 20:53, Rainer Weikusat <> wrote:
    >>> "Peter J. Holzer" <> writes:
    >>>> On 2012-01-05 18:43, Rainer Weikusat <> wrote:
    >>>>> "Peter J. Holzer" <> writes:
    >>>>>> So I prefer to think of it as an "unspecified superset of ASCII"
    >>>>>
    >>>>> Assuming that locale information isn't being used, it is ASCII and not
    >>>>> 'a superset of ASCII' since no byte value outside the subset of
    >>>>> possible byte values used by the ASCII encoding has any 'character
    >>>>> properties' (except being considered 'a non-word character', that is).
    >>>>
    >>>> ASCII is a 7 bit code. As soon as you have a byte with value >= 0x80 it
    >>>> isn't ASCII any more.
    >>>
    >>> As soon as a byte with value > 127 is considered to be some character,
    >>> it isn't ASCII anymore but a superset of ASCII.

    >>
    >> Glad you agree.

    >
    > I don't: Since bytes with values > 127 are not considered to be
    > characters


    That's not grounds to disagree. The union of the set of all bicycles and the
    set of ASCII characters is a superset of the set of ASCII characters.

    Also, the set of ASCII characters is a superset of the set of ASCII
    characters (albeit not a proper superset).

    :)
     
    Kaz Kylheku, Jan 7, 2012
    #18
  19. Ted Zlatanov Guest

    On Sat, 7 Jan 2012 00:58:04 +0000 Ben Morrow <> wrote:

    BM> Quoth Rainer Weikusat <>:
    >>
    >> The mere fact that somebody failed at doing something ('tried to fix a
    >> bug') doesn't really make that someone authoritative on anything.


    BM> I am not authoritative on anything. I have never claimed to be. I am
    BM> attempting to convey what I believe was the consensus on p5p, in the
    BM> hope that people here might find the information useful.

    Your information was useful and practical, thanks.

    BM> You are free to ignore me. I, and probably everyone else, would very
    BM> much rather you did.

    Yeah, killfiling Rainer is not enough. Get into his killfile like I did.

    Ted
     
    Ted Zlatanov, Jan 7, 2012
    #19
  20. Shmuel (Seymour J.) Metz <> writes:
    > In <>, on 01/06/2012
    > at 09:00 PM, Rainer Weikusat <> said:
    >
    >>Try an educated guess based on the content of the text I wrote.

    >
    > I did; you don't seem to like my educated guess.


    I think you came up with a rather idiotic supposition based on your
    somewhat lacking 'reading comprehension' skills, your desire to attack
    me in any case and your complete inability (or unwillingness) to come
    up with something like a rational counterargument.

    I suggest that you consider discussing the issue with an interested
    parking meter.
     
    Rainer Weikusat, Jan 8, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    357
    Omar Khan
    Apr 25, 2004
  2. Marat
    Replies:
    5
    Views:
    3,856
    John C. Bollinger
    Nov 10, 2004
  3. Eric Lilja
    Replies:
    8
    Views:
    699
    Eric Lilja
    Feb 22, 2005
  4. Replies:
    5
    Views:
    370
    Dennis Lee Bieber
    Aug 29, 2006
  5. Stanley Xu
    Replies:
    2
    Views:
    714
    Stanley Xu
    Mar 23, 2011
Loading...

Share This Page