Flexible string representation, unicode, typography, ...

Discussion in 'Python' started by wxjmfauth@gmail.com, Aug 23, 2012.

  1. Guest

    This is neither a complaint nor a question, just a comment.

    In the previous discussion related to the flexible
    string representation, Roy Smith added this comment:

    http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42

    Not only I agree with his sentence:
    "Clearly, the world has moved to a 32-bit character set."

    he used in his comment a very intersting word: "punctuation".

    There is a point which is, in my mind, not very well understood,
    "digested", underestimated or neglected by many developers:
    the relation between the coding of the characters and the typography.

    Unicode (the consortium), does not only deal with the coding of
    the characters, it also worked on the characters *classification*.

    A deliberatly simplistic representation: "letters" in the bottom
    of the table, lower code points/integers; "typographic characters"
    like punctuation, common symbols, ... high in the table, high code
    points/integers.

    The conclusion is inescapable, if one wish to work in a "unicode
    mode", one is forced to use the whole palette of the unicode
    code points, this is the *nature* of Unicode.

    Technically, believing that it possible to optimize only a subrange
    of the unicode code points range is simply an illusion. A lot of
    work, probably quite complicate, which finally solves nothing.

    Python, in my mind, fell in this trap.

    "Simple is better than complex."
    -> hard to maintained
    "Flat is better than nested."
    -> code points range
    "Special cases aren't special enough to break the rules."
    -> special unicode code points?
    "Although practicality beats purity."
    -> or the opposite?
    "In the face of ambiguity, refuse the temptation to guess."
    -> guessing a user will only work with the "optimmized" char subrange.
    ....

    Small illustration. Take an a4 page containing 50 lines of 80 ascii
    characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
    and you will see all the optimization efforts destroyed.

    >> sys.getsizeof('a' * 80 * 50)

    4025
    >>> sys.getsizeof('a' * 80 * 50 + '•')

    8040

    Just my 2 € (code point 0x20ac) cents.

    jmf
    , Aug 23, 2012
    #1
    1. Advertising

  2. On 23/08/2012 13:47, wrote:
    > This is neither a complaint nor a question, just a comment.
    >
    > In the previous discussion related to the flexible
    > string representation, Roy Smith added this comment:
    >
    > http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42
    >
    > Not only I agree with his sentence:
    > "Clearly, the world has moved to a 32-bit character set."
    >
    > he used in his comment a very intersting word: "punctuation".
    >
    > There is a point which is, in my mind, not very well understood,
    > "digested", underestimated or neglected by many developers:
    > the relation between the coding of the characters and the typography.
    >
    > Unicode (the consortium), does not only deal with the coding of
    > the characters, it also worked on the characters *classification*.
    >
    > A deliberatly simplistic representation: "letters" in the bottom
    > of the table, lower code points/integers; "typographic characters"
    > like punctuation, common symbols, ... high in the table, high code
    > points/integers.
    >
    > The conclusion is inescapable, if one wish to work in a "unicode
    > mode", one is forced to use the whole palette of the unicode
    > code points, this is the *nature* of Unicode.
    >
    > Technically, believing that it possible to optimize only a subrange
    > of the unicode code points range is simply an illusion. A lot of
    > work, probably quite complicate, which finally solves nothing.
    >
    > Python, in my mind, fell in this trap.
    >
    > "Simple is better than complex."
    > -> hard to maintained
    > "Flat is better than nested."
    > -> code points range
    > "Special cases aren't special enough to break the rules."
    > -> special unicode code points?
    > "Although practicality beats purity."
    > -> or the opposite?
    > "In the face of ambiguity, refuse the temptation to guess."
    > -> guessing a user will only work with the "optimmized" char subrange.
    > ...
    >
    > Small illustration. Take an a4 page containing 50 lines of 80 ascii
    > characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
    > and you will see all the optimization efforts destroyed.
    >
    >>> sys.getsizeof('a' * 80 * 50)

    > 4025
    >>>> sys.getsizeof('a' * 80 * 50 + '•')

    > 8040
    >
    > Just my 2 € (code point 0x20ac) cents.
    >
    > jmf
    >


    I'm looking forward to all the patches you are going to provide to
    correct all these (presumably) cPython deficiencies. When do they start
    arriving on the bug tracker?

    --
    Cheers.

    Mark Lawrence.
    Mark Lawrence, Aug 23, 2012
    #2
    1. Advertising

  3. MRAB Guest

    On 23/08/2012 14:57, Neil Hodgson wrote:
    > :
    >
    >> Small illustration. Take an a4 page containing 50 lines of 80 ascii
    >> characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),
    >> and you will see all the optimization efforts destroyed.
    >>
    >>>> sys.getsizeof('a' * 80 * 50)

    >> 4025
    >>>>> sys.getsizeof('a' * 80 * 50 + '•')

    >> 8040

    >
    > This example is still benefiting from shrinking the number of bytes
    > in half over using 32 bits per character as was the case with Python 3.2:
    >
    > >>> sys.getsizeof('a' * 80 * 50)

    > 16032
    > >>> sys.getsizeof('a' * 80 * 50 + '•')

    > 16036
    > >>>

    >

    Perhaps the solution should've been to just switch between 2/4 bytes
    instead
    of 1/2/4 bytes. :)
    MRAB, Aug 23, 2012
    #3
  4. Ian Kelly Guest

    On Thu, Aug 23, 2012 at 9:11 AM, MRAB <> wrote:
    > Perhaps the solution should've been to just switch between 2/4 bytes instead
    > of 1/2/4 bytes. :)


    Why? You don't lose any complexity by doing that. I can see
    arguments for 1/2/4 or for just 4, but I can't see any advantage of
    2/4 over either of those.
    Ian Kelly, Aug 23, 2012
    #4
  5. Guest

    Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
    > :
    >
    >
    >
    > > Small illustration. Take an a4 page containing 50 lines of 80 ascii

    >
    > > characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),

    >
    > > and you will see all the optimization efforts destroyed.

    >
    > >

    >
    > >>> sys.getsizeof('a' * 80 * 50)

    >
    > > 4025

    >
    > >>>> sys.getsizeof('a' * 80 * 50 + '•')

    >
    > > 8040

    >
    >
    >
    > This example is still benefiting from shrinking the number of bytes
    >
    > in half over using 32 bits per character as was the case with Python 3.2:
    >
    >
    >
    > >>> sys.getsizeof('a' * 80 * 50)

    >
    > 16032
    >
    > >>> sys.getsizeof('a' * 80 * 50 + '•')

    >
    > 16036
    >

    Correct, but how many times does it happen?
    Practically never.

    In this unicode stuff, I'm fascinated by the obsession
    to solve a problem which is, due to the nature of
    Unicode, unsolvable.

    For every optimization algorithm, for every code
    point range you can optimize, it is always possible
    to find a case breaking that optimization.

    This follows quasi the mathematical logic. To proof a
    law is valid, you have to proof all the cases
    are valid. To proof a law is invalid, just find one
    case showing it.

    Sure, it is possible to optimize the unicode usage
    by not using French characters, punctuation, mathematical
    symbols, currency symbols, CJK characters...
    (select undesired characters here: http://www.unicode.org/charts/).

    In that case, why using unicode?
    (A problematic not specific to Python)

    jmf
    , Aug 23, 2012
    #5
  6. Ian Kelly Guest

    On Thu, Aug 23, 2012 at 12:33 PM, <> wrote:
    >> >>> sys.getsizeof('a' * 80 * 50)

    >>
    >> > 4025

    >>
    >> >>>> sys.getsizeof('a' * 80 * 50 + '•')

    >>
    >> > 8040

    >>
    >>
    >>
    >> This example is still benefiting from shrinking the number of bytes
    >>
    >> in half over using 32 bits per character as was the case with Python 3.2:
    >>
    >>
    >>
    >> >>> sys.getsizeof('a' * 80 * 50)

    >>
    >> 16032
    >>
    >> >>> sys.getsizeof('a' * 80 * 50 + '•')

    >>
    >> 16036
    >>

    > Correct, but how many times does it happen?
    > Practically never.


    What are you talking about? Surely it happens the same number of
    times that your example happens, since it's the same example. By
    dismissing this example as being too infrequent to be of any
    importance, you dismiss the validity of your own example as well.

    > In this unicode stuff, I'm fascinated by the obsession
    > to solve a problem which is, due to the nature of
    > Unicode, unsolvable.
    >
    > For every optimization algorithm, for every code
    > point range you can optimize, it is always possible
    > to find a case breaking that optimization.


    So what? Similarly, for any generalized data compression algorithm,
    it is possible to engineer inputs for which the "compressed" output is
    as large as or larger than the original input (this is easy to prove).
    Does this mean that compression algorithms are useless? I hardly
    think so, as evidenced by the widespread popularity of tools like gzip
    and WinZip.

    You seem to be saying that because we cannot pack all Unicode strings
    into 1-byte or 2-byte per character representations, we should just
    give up and force everybody to use maximum-width representations for
    all strings. That is absurd.

    > Sure, it is possible to optimize the unicode usage
    > by not using French characters, punctuation, mathematical
    > symbols, currency symbols, CJK characters...
    > (select undesired characters here: http://www.unicode.org/charts/).
    >
    > In that case, why using unicode?
    > (A problematic not specific to Python)


    Obviously, it is because I want to have the *ability* to represent all
    those characters in my strings, even if I am not necessarily going to
    take advantage of that ability in every single string that I produce.
    Not all of the strings I use are going to fit into the 1-byte or
    2-byte per character representation. Fine, whatever -- that's part of
    the cost of internationalization. However, *most* of the strings that
    I work with (this entire email message, for instance) -- and, I think,
    most of the strings that any developer works with (identifiers in the
    standard library, for instance) -- will fit into at least the 2-byte
    per character representation. Why shackle every string everywhere to
    4 bytes per character when for a majority of them we can do much
    better than that?
    Ian Kelly, Aug 23, 2012
    #6
  7. On 23/08/2012 19:33, wrote:
    > Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
    >> :
    >>
    >>
    >>
    >>> Small illustration. Take an a4 page containing 50 lines of 80 ascii

    >>
    >>> characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),

    >>
    >>> and you will see all the optimization efforts destroyed.

    >>
    >>>

    >>
    >>>>> sys.getsizeof('a' * 80 * 50)

    >>
    >>> 4025

    >>
    >>>>>> sys.getsizeof('a' * 80 * 50 + '•')

    >>
    >>> 8040

    >>
    >>
    >>
    >> This example is still benefiting from shrinking the number of bytes
    >>
    >> in half over using 32 bits per character as was the case with Python 3.2:
    >>
    >>
    >>
    >> >>> sys.getsizeof('a' * 80 * 50)

    >>
    >> 16032
    >>
    >> >>> sys.getsizeof('a' * 80 * 50 + '•')

    >>
    >> 16036
    >>

    > Correct, but how many times does it happen?
    > Practically never.
    >
    > In this unicode stuff, I'm fascinated by the obsession
    > to solve a problem which is, due to the nature of
    > Unicode, unsolvable.
    >
    > For every optimization algorithm, for every code
    > point range you can optimize, it is always possible
    > to find a case breaking that optimization.
    >
    > This follows quasi the mathematical logic. To proof a
    > law is valid, you have to proof all the cases
    > are valid. To proof a law is invalid, just find one
    > case showing it.
    >
    > Sure, it is possible to optimize the unicode usage
    > by not using French characters, punctuation, mathematical
    > symbols, currency symbols, CJK characters...
    > (select undesired characters here: http://www.unicode.org/charts/).
    >
    > In that case, why using unicode?
    > (A problematic not specific to Python)
    >
    > jmf
    >


    What do you propose should be used instead, as you appear to be the
    resident expert in the field?

    --
    Cheers.

    Mark Lawrence.
    Mark Lawrence, Aug 23, 2012
    #7
  8. On Thursday, 23 August 2012 18:17:29 UTC+5:30, (unknown) wrote:
    > This is neither a complaint nor a question, just a comment.
    >
    >
    >
    > In the previous discussion related to the flexible
    >
    > string representation, Roy Smith added this comment:
    >
    >
    >
    > http://groups.google.com/group/comp...read/thread/2645504f459bab50/eda342573381ff42
    >
    >
    >
    > Not only I agree with his sentence:
    >
    > "Clearly, the world has moved to a 32-bit character set."
    >
    >
    >
    > he used in his comment a very intersting word: "punctuation".
    >
    >
    >
    > There is a point which is, in my mind, not very well understood,
    >
    > "digested", underestimated or neglected by many developers:
    >
    > the relation between the coding of the characters and the typography.
    >
    >
    >
    > Unicode (the consortium), does not only deal with the coding of
    >
    > the characters, it also worked on the characters *classification*.
    >
    >
    >
    > A deliberatly simplistic representation: "letters" in the bottom
    >
    > of the table, lower code points/integers; "typographic characters"
    >
    > like punctuation, common symbols, ... high in the table, high code
    >
    > points/integers.
    >
    >
    >
    > The conclusion is inescapable, if one wish to work in a "unicode
    >
    > mode", one is forced to use the whole palette of the unicode
    >
    > code points, this is the *nature* of Unicode.
    >
    >
    >
    > Technically, believing that it possible to optimize only a subrange
    >
    > of the unicode code points range is simply an illusion. A lot of
    >
    > work, probably quite complicate, which finally solves nothing.
    >
    >
    >
    > Python, in my mind, fell in this trap.
    >
    >
    >
    > "Simple is better than complex."
    >
    > -> hard to maintained
    >
    > "Flat is better than nested."
    >
    > -> code points range
    >
    > "Special cases aren't special enough to break the rules."
    >
    > -> special unicode code points?
    >
    > "Although practicality beats purity."
    >
    > -> or the opposite?
    >
    > "In the face of ambiguity, refuse the temptation to guess."
    >
    > -> guessing a user will only work with the "optimmized" char subrange.
    >
    > ...
    >
    >
    >
    > Small illustration. Take an a4 page containing 50 lines of 80 ascii
    >
    > characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
    >
    > and you will see all the optimization efforts destroyed.
    >
    >
    >
    > >> sys.getsizeof('a' * 80 * 50)

    >
    > 4025
    >
    > >>> sys.getsizeof('a' * 80 * 50 + '•')

    >
    > 8040
    >
    >
    >
    > Just my 2 € (code point 0x20ac) cents.
    >
    >
    >
    > jmf


    The zen of python is simply a guideline
    Ramchandra Apte, Aug 24, 2012
    #8
  9. rusi Guest

    On Aug 24, 12:22 am, Ian Kelly <> wrote:
    > On Thu, Aug 23, 2012 at 12:33 PM,  <> wrote:
    > >> >>> sys.getsizeof('a' * 80 * 50)

    >
    > >> > 4025

    >
    > >> >>>> sys.getsizeof('a' * 80 * 50 + '•')

    >
    > >> > 8040

    >
    > >>     This example is still benefiting from shrinking the number of bytes

    >
    > >> in half over using 32 bits per character as was the case with Python 3..2:

    >
    > >>  >>> sys.getsizeof('a' * 80 * 50)

    >
    > >> 16032

    >
    > >>  >>> sys.getsizeof('a' * 80 * 50 + '•')

    >
    > >> 16036

    >
    > > Correct, but how many times does it happen?
    > > Practically never.

    >
    > What are you talking about?  Surely it happens the same number of
    > times that your example happens, since it's the same example.  By
    > dismissing this example as being too infrequent to be of any
    > importance, you dismiss the validity of your own example as well.
    >
    > > In this unicode stuff, I'm fascinated by the obsession
    > > to solve a problem which is, due to the nature of
    > > Unicode, unsolvable.

    >
    > > For every optimization algorithm, for every code
    > > point range you can optimize, it is always possible
    > > to find a case breaking that optimization.

    >
    > So what?  Similarly, for any generalized data compression algorithm,
    > it is possible to engineer inputs for which the "compressed" output is
    > as large as or larger than the original input (this is easy to prove).
    >  Does this mean that compression algorithms are useless?  I hardly
    > think so, as evidenced by the widespread popularity of tools like gzip
    > and WinZip.
    >
    > You seem to be saying that because we cannot pack all Unicode strings
    > into 1-byte or 2-byte per character representations, we should just
    > give up and force everybody to use maximum-width representations for
    > all strings.  That is absurd.
    >
    > > Sure, it is possible to optimize the unicode usage
    > > by not using French characters, punctuation, mathematical
    > > symbols, currency symbols, CJK characters...
    > > (select undesired characters here:http://www.unicode.org/charts/).

    >
    > > In that case, why using unicode?
    > > (A problematic not specific to Python)

    >
    > Obviously, it is because I want to have the *ability* to represent all
    > those characters in my strings, even if I am not necessarily going to
    > take advantage of that ability in every single string that I produce.
    > Not all of the strings I use are going to fit into the 1-byte or
    > 2-byte per character representation.  Fine, whatever -- that's part of
    > the cost of internationalization.  However, *most* of the strings that
    > I work with (this entire email message, for instance) -- and, I think,
    > most of the strings that any developer works with (identifiers in the
    > standard library, for instance) -- will fit into at least the 2-byte
    > per character representation.  Why shackle every string everywhere to
    > 4 bytes per character when for a majority of them we can do much
    > better than that?


    Actually what exactly are you (jmf) asking for?
    Its not clear to anybody as best as we can see...
    rusi, Aug 24, 2012
    #9
  10. On 24/08/2012 17:06, rusi wrote:

    >
    > Actually what exactly are you (jmf) asking for?
    > Its not clear to anybody as best as we can see...
    >


    A knee in the temple and a dagger up the <censored> ? :) From another
    Monty Python sketch for those who don't know.

    --
    Cheers.

    Mark Lawrence.
    Mark Lawrence, Aug 24, 2012
    #10
  11. On Fri, 24 Aug 2012 17:47:42 +0100, Mark Lawrence
    <> declaimed the following in
    gmane.comp.python.general:

    >
    > A knee in the temple and a dagger up the <censored> ? :) From another
    > Monty Python sketch for those who don't know.


    A poignard in the codpiece...
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, Aug 24, 2012
    #11
  12. Ramchandra Apte <maniandram01 <at> gmail.com> writes:
    >
    > The zen of python is simply a guideline


    What's more, the Zen guides the language's design, not its implementation.
    People who think CPython is a complicated implementation can take a look at PyPy
    :)

    Regards

    Antoine.


    --
    Software development and contracting: http://pro.pitrou.net
    Antoine Pitrou, Aug 25, 2012
    #12
  13. Guest

    Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
    > Ramchandra Apte <maniandram01 <at> gmail.com> writes:
    >
    > >

    >
    > > The zen of python is simply a guideline

    >
    >
    >
    > What's more, the Zen guides the language's design, not its implementation..
    >
    > People who think CPython is a complicated implementation can take a look at PyPy
    >
    > :)


    Unicode design: a flat table of code points, where all code
    points are "equals".
    As soon as one attempts to escape from this rule, one has to
    "pay" for it.
    The creator of this machinery (flexible string representation)
    can not even benefit from it in his native language (I think
    I'm correctly informed).

    Hint: Google -> "Das grosse Eszett"

    jmf
    , Aug 25, 2012
    #13
  14. Guest

    Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
    > Ramchandra Apte <maniandram01 <at> gmail.com> writes:
    >
    > >

    >
    > > The zen of python is simply a guideline

    >
    >
    >
    > What's more, the Zen guides the language's design, not its implementation..
    >
    > People who think CPython is a complicated implementation can take a look at PyPy
    >
    > :)


    Unicode design: a flat table of code points, where all code
    points are "equals".
    As soon as one attempts to escape from this rule, one has to
    "pay" for it.
    The creator of this machinery (flexible string representation)
    can not even benefit from it in his native language (I think
    I'm correctly informed).

    Hint: Google -> "Das grosse Eszett"

    jmf
    , Aug 25, 2012
    #14
  15. On 25/08/2012 08:27, wrote:
    > Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit :
    >> Ramchandra Apte <maniandram01 <at> gmail.com> writes:
    >>
    >>>

    >>
    >>> The zen of python is simply a guideline

    >>
    >>
    >>
    >> What's more, the Zen guides the language's design, not its implementation.
    >>
    >> People who think CPython is a complicated implementation can take a look at PyPy
    >>
    >> :)

    >
    > Unicode design: a flat table of code points, where all code
    > points are "equals".
    > As soon as one attempts to escape from this rule, one has to
    > "pay" for it.
    > The creator of this machinery (flexible string representation)
    > can not even benefit from it in his native language (I think
    > I'm correctly informed).
    >
    > Hint: Google -> "Das grosse Eszett"
    >
    > jmf
    >


    It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
    still baffled as to the point if any. Could someone please enlightem me?

    --
    Cheers.

    Mark Lawrence.
    Mark Lawrence, Aug 25, 2012
    #15
  16. On 25/08/2012 10:58, Mark Lawrence wrote:
    > On 25/08/2012 08:27, wrote:
    >>
    >> Unicode design: a flat table of code points, where all code
    >> points are "equals".
    >> As soon as one attempts to escape from this rule, one has to
    >> "pay" for it.
    >> The creator of this machinery (flexible string representation)
    >> can not even benefit from it in his native language (I think
    >> I'm correctly informed).
    >>
    >> Hint: Google -> "Das grosse Eszett"
    >>
    >> jmf
    >>

    >
    > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
    > still baffled as to the point if any. Could someone please enlightem me?
    >


    Here's what I think he is saying. I am posting this to test the water. I
    am also confused, and if I have got it wrong hopefully someone will
    correct me.

    In python 3.3, unicode strings are now stored as follows -
    if all characters can be represented by 1 byte, the entire string is
    composed of 1-byte characters
    else if all characters can be represented by 1 or 2 bytea, the entire
    string is composed of 2-byte characters
    else the entire string is composed of 4-byte characters

    There is an overhead in making this choice, to detect the lowest number
    of bytes required.

    jmfauth believes that this only benefits 'english-speaking' users, as
    the rest of the world will tend to have strings where at least one
    character requires 2 or 4 bytes. So they incur the overhead, without
    getting any benefit.

    Therefore, I think he is saying that he would have preferred that python
    standardise on 4-byte characters, on the grounds that the saving in
    memory does not justify the performance overhead.

    Frank Millman
    Frank Millman, Aug 25, 2012
    #16
  17. On 25/08/2012 10:46, Frank Millman wrote:
    > On 25/08/2012 10:58, Mark Lawrence wrote:
    >> On 25/08/2012 08:27, wrote:
    >>>
    >>> Unicode design: a flat table of code points, where all code
    >>> points are "equals".
    >>> As soon as one attempts to escape from this rule, one has to
    >>> "pay" for it.
    >>> The creator of this machinery (flexible string representation)
    >>> can not even benefit from it in his native language (I think
    >>> I'm correctly informed).
    >>>
    >>> Hint: Google -> "Das grosse Eszett"
    >>>
    >>> jmf
    >>>

    >>
    >> It's Saturday morning, I'm stone cold sober, had a good sleep and I'm
    >> still baffled as to the point if any. Could someone please enlightem me?
    >>

    >
    > Here's what I think he is saying. I am posting this to test the water. I
    > am also confused, and if I have got it wrong hopefully someone will
    > correct me.
    >
    > In python 3.3, unicode strings are now stored as follows -
    > if all characters can be represented by 1 byte, the entire string is
    > composed of 1-byte characters
    > else if all characters can be represented by 1 or 2 bytea, the entire
    > string is composed of 2-byte characters
    > else the entire string is composed of 4-byte characters
    >
    > There is an overhead in making this choice, to detect the lowest number
    > of bytes required.
    >
    > jmfauth believes that this only benefits 'english-speaking' users, as
    > the rest of the world will tend to have strings where at least one
    > character requires 2 or 4 bytes. So they incur the overhead, without
    > getting any benefit.
    >
    > Therefore, I think he is saying that he would have preferred that python
    > standardise on 4-byte characters, on the grounds that the saving in
    > memory does not justify the performance overhead.
    >
    > Frank Millman
    >
    >


    I thought Terry Reedy had shot down any claims about performance
    overhead, and that the memory savings in many cases must be substantial
    and therefore worthwhile. Or have I misread something? Or what?

    --
    Cheers.

    Mark Lawrence.
    Mark Lawrence, Aug 25, 2012
    #17
  18. On Sat, Aug 25, 2012 at 9:05 PM, Mark Lawrence <> wrote:
    > I thought Terry Reedy had shot down any claims about performance overhead,
    > and that the memory savings in many cases must be substantial and therefore
    > worthwhile. Or have I misread something? Or what?


    My reading of the thread(s) is/are that there are two reasons for the
    debate to continue to rage:

    1) Comparisons with a "narrow build" in which most characters take two
    bytes but there are one or two characters that get encoded with
    surrogates. The new system will allocate four bytes per character for
    the whole string.

    2) Arguments on the basis of huge strings that represent _all the
    data_ that your program's working with, forgetting that there are
    numerous strings all through everything that are ASCII-only.

    ChrisA
    Chris Angelico, Aug 25, 2012
    #18
  19. Terry Reedy Guest

    On 8/25/2012 7:05 AM, Mark Lawrence wrote:

    > I thought Terry Reedy had shot down any claims about performance
    > overhead, and that the memory savings in many cases must be substantial
    > and therefore worthwhile. Or have I misread something?


    No, you have correctly read what I and others have said. Jim appears to
    not be interested in dialog. Lets leave it at that.


    --
    Terry Jan Reedy
    Terry Reedy, Aug 25, 2012
    #19
  20. Guest

    Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit :
    > On 25/08/2012 10:58, Mark Lawrence wrote:
    >
    > > On 25/08/2012 08:27, wrote:

    >
    > >>

    >
    > >> Unicode design: a flat table of code points, where all code

    >
    > >> points are "equals".

    >
    > >> As soon as one attempts to escape from this rule, one has to

    >
    > >> "pay" for it.

    >
    > >> The creator of this machinery (flexible string representation)

    >
    > >> can not even benefit from it in his native language (I think

    >
    > >> I'm correctly informed).

    >
    > >>

    >
    > >> Hint: Google -> "Das grosse Eszett"

    >
    > >>

    >
    > >> jmf

    >
    > >>

    >
    > >

    >
    > > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm

    >
    > > still baffled as to the point if any. Could someone please enlightem me?

    >
    > >

    >
    >
    >
    > Here's what I think he is saying. I am posting this to test the water. I
    >
    > am also confused, and if I have got it wrong hopefully someone will
    >
    > correct me.
    >
    >
    >
    > In python 3.3, unicode strings are now stored as follows -
    >
    > if all characters can be represented by 1 byte, the entire string is
    >
    > composed of 1-byte characters
    >
    > else if all characters can be represented by 1 or 2 bytea, the entire
    >
    > string is composed of 2-byte characters
    >
    > else the entire string is composed of 4-byte characters
    >
    >
    >
    > There is an overhead in making this choice, to detect the lowest number
    >
    > of bytes required.
    >
    >
    >
    > jmfauth believes that this only benefits 'english-speaking' users, as
    >
    > the rest of the world will tend to have strings where at least one
    >
    > character requires 2 or 4 bytes. So they incur the overhead, without
    >
    > getting any benefit.
    >
    >
    >
    > Therefore, I think he is saying that he would have preferred that python
    >
    > standardise on 4-byte characters, on the grounds that the saving in
    >
    > memory does not justify the performance overhead.
    >
    >
    >
    > Frank Millman


    Very well explained. Thanks.

    More precisely, affected are not only the 'english-speaking'
    users, but all the users who are using not latin-1 characters.
    (See the title of this topic, ... typography).

    Being at the same time, latin-1 and unicode compliant is
    a plain absurdity in the mathematical sense.

    ---

    For those you do not know, the go language has introduced
    the rune type. As far as I know, nobody is complaining, I
    have not even seen a discussion related to this subject.


    100% Unicode compliant from the day 0. Congratulations.

    jmf
    , Aug 25, 2012
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?UmFqZXNoIHNvbmk=?=

    'System.String[]' from its string representation 'String[] Array'

    =?Utf-8?B?UmFqZXNoIHNvbmk=?=, May 4, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    1,787
    =?Utf-8?B?UmFqZXNoIHNvbmk=?=
    May 4, 2006
  2. Replies:
    5
    Views:
    503
  3. Andrew
    Replies:
    32
    Views:
    1,951
    Arne Vajhøj
    Aug 23, 2009
  4. jacob navia

    Typography of programs

    jacob navia, Jun 30, 2011, in forum: C Programming
    Replies:
    83
    Views:
    1,422
    Phil Carmody
    Jul 7, 2011
  5. Replies:
    17
    Views:
    180
    Serhiy Storchaka
    Sep 11, 2013
Loading...

Share This Page