How to get a "screen" length of a multibyte string?

Discussion in 'Python' started by kobayashi, Nov 25, 2012.

  1. kobayashi

    kobayashi Guest

    Hello,

    Under platform that has fixed pitch font,
    I want to get a "screen" length of a multibyte string

    --- sample ---
    s1 = u"abcdef"
    s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's..
    print len(s1) # Got 6
    print len(s2) # Got 3, but I want get 6.
    --------------

    Abobe can get a "character" length of a multibyte string.
    Is there a way to get a "screen" length of a multibyte string?
     
    kobayashi, Nov 25, 2012
    #1
    1. Advertising

  2. On Sun, Nov 25, 2012 at 9:19 PM, kobayashi <> wrote:
    > Hello,
    >
    > Under platform that has fixed pitch font,
    > I want to get a "screen" length of a multibyte string
    >
    > --- sample ---
    > s1 = u"abcdef"
    > s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's.
    > print len(s1) # Got 6
    > print len(s2) # Got 3, but I want get 6.
    > --------------
    >
    > Abobe can get a "character" length of a multibyte string.
    > Is there a way to get a "screen" length of a multibyte string?


    What do you mean by screen length? Do you mean the length in bytes?
    That depends on your encoding. Do you mean width of the displayed
    version? That depends on your font.

    ChrisA
     
    Chris Angelico, Nov 25, 2012
    #2
    1. Advertising

  3. kobayashi

    Hans Mulder Guest

    On 25/11/12 11:19:18, kobayashi wrote:
    > Hello,
    >
    > Under platform that has fixed pitch font,
    > I want to get a "screen" length of a multibyte string
    >
    > --- sample ---
    > s1 = u"abcdef"
    > s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's.
    > print len(s1) # Got 6
    > print len(s2) # Got 3, but I want get 6.
    > --------------
    >
    > Abobe can get a "character" length of a multibyte string.
    > Is there a way to get a "screen" length of a multibyte string?


    How about:

    from unicodedata import east_asian_width

    def screen_length(s):
    return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)


    Hope this helps,

    -- HansM
     
    Hans Mulder, Nov 25, 2012
    #3
  4. On Sun, 25 Nov 2012 22:12:33 +1100, Chris Angelico wrote:

    > On Sun, Nov 25, 2012 at 9:19 PM, kobayashi <> wrote:
    >> Hello,
    >>
    >> Under platform that has fixed pitch font, I want to get a "screen"
    >> length of a multibyte string
    >>
    >> --- sample ---
    >> s1 = u"abcdef"
    >> s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's. print len(s1) # Got
    >> 6
    >> print len(s2) # Got 3, but I want get 6. --------------
    >>
    >> Abobe can get a "character" length of a multibyte string. Is there a
    >> way to get a "screen" length of a multibyte string?

    >
    > What do you mean by screen length? Do you mean the length in bytes? That
    > depends on your encoding. Do you mean width of the displayed version?
    > That depends on your font.


    That's what I thought, but on doing some experimentation in my terminal,
    and doing some googling, I have come to the understanding that so-called
    monospaced (fixed-width) fonts may support *double column* characters as
    well as single column.

    So the OP's example has:

    s1 = u"abcdef"
    s2 = u"ã‚ã„ã†"

    s1 has six single-column ("narrow") characters, while s2 has three double-
    column ("wide") characters, and both strings should take up the same
    horizontal space on screen.

    If you are reading this in a non-monospaced font, the width of each
    character is not fixed, the idea of columns doesn't really work, and the
    strings may not be the same width.

    See http://www.unicode.org/reports/tr11/tr11-19.html for more detail.


    Interestingly, Unicode supports wide versions of many non-EastAsian
    characters (presumably because pre-Unicode EastAsian encodings supported
    them). For example, run this code in Python:

    print u'\N{FULLWIDTH LATIN CAPITAL LETTER A}'; print u'AA'

    which should output:

    A
    AA

    If your font supports this, you should see a single "A" as wide as the
    double "AA" beneath it.

    Curiously, in the monospaced font I am using to type this, the
    "fullwidth" (wide, two-column) A is actually 2/3rds the width of the
    standard ("halfwidth", narrow, one-column) A. Font designers -- can't
    live with them, can't take them out and shoot them.


    Hans Mulder's suggestion:

    from unicodedata import east_asian_width

    def screen_length(s):
    return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)


    is almost right. The Unicode document above states:

     
    Steven D'Aprano, Nov 25, 2012
    #4
  5. kobayashi

    kobayashi Guest

    Encoding is utf-8.
    I use "screen length" means as that; that of ascii character is 1, and that of character having double width than ascii character is 2.
    It's not bytes, but drawing width.
    As you say, it depends font. I'll be considering carefully.
     
    kobayashi, Nov 25, 2012
    #5
  6. kobayashi

    kobayashi Guest

    Encoding is utf-8.
    I use "screen length" means as that; that of ascii character is 1, and that of character having double width than ascii character is 2.
    It's not bytes, but drawing width.
    As you say, it depends font. I'll be considering carefully.
     
    kobayashi, Nov 25, 2012
    #6
  7. kobayashi

    kobayashi Guest

    Great, It's a just good solution. I use it.

    Thanks,
     
    kobayashi, Nov 25, 2012
    #7
  8. kobayashi

    kobayashi Guest

    I'm greateful for more detailed information and better code.
    I learned a lot and I use it.

    Thanks,
     
    kobayashi, Nov 25, 2012
    #8
  9. On 25.11.12 12:19, kobayashi wrote:
    > Under platform that has fixed pitch font,
    > I want to get a "screen" length of a multibyte string
    >
    > --- sample ---
    > s1 = u"abcdef"
    > s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's.
    > print len(s1) # Got 6
    > print len(s2) # Got 3, but I want get 6.
    > --------------
    >
    > Abobe can get a "character" length of a multibyte string.
    > Is there a way to get a "screen" length of a multibyte string?


    http://bugs.python.org/issue12568
     
    Serhiy Storchaka, Nov 25, 2012
    #9
  10. Re: Re: How to get a "screen" length of a multibyte string?

    On 11/25/2012 07:48 AM, kobayashi wrote:
    > Encoding is utf-8.
    > I use "screen length" means as that; that of ascii character is 1, and that of character having double width than ascii character is 2.
    > It's not bytes, but drawing width.
    > As you say, it depends font. I'll be considering carefully.
    >


    Don't forget also that there are combining characters. To wit:

    >>> "\u00e1"

    'á'
    >>> "\u0061\u0301"

    'aÌ'

    (U+00e1 is an 'a' with acute accent; U+0061 is an unaccented 'a'; U+0301
    is an combining acute accent.)


    So far the discussion has been on single Unicode code points which
    appear as a double-wide glyph (I did not know about those!); depending
    on how you want to look at it, combining characters result in sequences
    of Unicode code points which result in a single glyph, or combining
    characters are zero-width code points.

    Evan
     
    Evan Driscoll, Nov 26, 2012
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mitchua
    Replies:
    5
    Views:
    2,749
    Eric J. Roode
    Jul 17, 2003
  2. Sam
    Replies:
    3
    Views:
    14,111
    Karl Seguin
    Feb 17, 2005
  3. Zygmunt Krynicki

    Multibyte string length

    Zygmunt Krynicki, Oct 9, 2003, in forum: C Programming
    Replies:
    19
    Views:
    713
    Dan Pop
    Oct 14, 2003
  4. Jordan Abel

    multibyte length

    Jordan Abel, Mar 3, 2006, in forum: C Programming
    Replies:
    3
    Views:
    319
    Micah Cowan
    Mar 3, 2006
  5. Owner

    How to determine Multibyte string length.

    Owner, Apr 9, 2011, in forum: C Programming
    Replies:
    4
    Views:
    818
    Ben Bacarisse
    Apr 11, 2011
Loading...

Share This Page