How to get a "screen" length of a multibyte string?

K

kobayashi

Hello,

Under platform that has fixed pitch font,
I want to get a "screen" length of a multibyte string

--- sample ---
s1 = u"abcdef"
s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's..
print len(s1) # Got 6
print len(s2) # Got 3, but I want get 6.
 
C

Chris Angelico

Hello,

Under platform that has fixed pitch font,
I want to get a "screen" length of a multibyte string

--- sample ---
s1 = u"abcdef"
s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's.
print len(s1) # Got 6
print len(s2) # Got 3, but I want get 6.

What do you mean by screen length? Do you mean the length in bytes?
That depends on your encoding. Do you mean width of the displayed
version? That depends on your font.

ChrisA
 
H

Hans Mulder

Hello,

Under platform that has fixed pitch font,
I want to get a "screen" length of a multibyte string

--- sample ---
s1 = u"abcdef"
s2 = u"ã‚ã„ã†" # It has same "screen" length as s1's.
print len(s1) # Got 6
print len(s2) # Got 3, but I want get 6.

How about:

from unicodedata import east_asian_width

def screen_length(s):
return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)


Hope this helps,

-- HansM
 
S

Steven D'Aprano

What do you mean by screen length? Do you mean the length in bytes? That
depends on your encoding. Do you mean width of the displayed version?
That depends on your font.

That's what I thought, but on doing some experimentation in my terminal,
and doing some googling, I have come to the understanding that so-called
monospaced (fixed-width) fonts may support *double column* characters as
well as single column.

So the OP's example has:

s1 = u"abcdef"
s2 = u"ã‚ã„ã†"

s1 has six single-column ("narrow") characters, while s2 has three double-
column ("wide") characters, and both strings should take up the same
horizontal space on screen.

If you are reading this in a non-monospaced font, the width of each
character is not fixed, the idea of columns doesn't really work, and the
strings may not be the same width.

See http://www.unicode.org/reports/tr11/tr11-19.html for more detail.


Interestingly, Unicode supports wide versions of many non-EastAsian
characters (presumably because pre-Unicode EastAsian encodings supported
them). For example, run this code in Python:

print u'\N{FULLWIDTH LATIN CAPITAL LETTER A}'; print u'AA'

which should output:

A
AA

If your font supports this, you should see a single "A" as wide as the
double "AA" beneath it.

Curiously, in the monospaced font I am using to type this, the
"fullwidth" (wide, two-column) A is actually 2/3rds the width of the
standard ("halfwidth", narrow, one-column) A. Font designers -- can't
live with them, can't take them out and shoot them.


Hans Mulder's suggestion:

from unicodedata import east_asian_width

def screen_length(s):
return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)


is almost right. The Unicode document above states:

In a broad sense, wide characters include W, F, and A (when in East Asian
context), and narrow characters include N, Na, H, and A (when not in East
Asian context).
[end quote]

from unicodedata import east_asian_width
def columns(s, eastasian_context=True):
if eastasian_context:
wide = 'WFA'
else:
wide = 'WF'
return sum(2 if east_asian_width(c) in wide else 1 for c in s)


ought to do it for all but the most sophisticated text layout
applications. For those needing much more sophistication, see here:


http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
 
K

kobayashi

Encoding is utf-8.
I use "screen length" means as that; that of ascii character is 1, and that of character having double width than ascii character is 2.
It's not bytes, but drawing width.
As you say, it depends font. I'll be considering carefully.
 
K

kobayashi

Encoding is utf-8.
I use "screen length" means as that; that of ascii character is 1, and that of character having double width than ascii character is 2.
It's not bytes, but drawing width.
As you say, it depends font. I'll be considering carefully.
 
K

kobayashi

I'm greateful for more detailed information and better code.
I learned a lot and I use it.

Thanks,
 
E

Evan Driscoll

Encoding is utf-8.
I use "screen length" means as that; that of ascii character is 1, and that of character having double width than ascii character is 2.
It's not bytes, but drawing width.
As you say, it depends font. I'll be considering carefully.

Don't forget also that there are combining characters. To wit:
'aÌ'

(U+00e1 is an 'a' with acute accent; U+0061 is an unaccented 'a'; U+0301
is an combining acute accent.)


So far the discussion has been on single Unicode code points which
appear as a double-wide glyph (I did not know about those!); depending
on how you want to look at it, combining characters result in sequences
of Unicode code points which result in a single glyph, or combining
characters are zero-width code points.

Evan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top