What do you mean by screen length? Do you mean the length in bytes? That
depends on your encoding. Do you mean width of the displayed version?
That depends on your font.
That's what I thought, but on doing some experimentation in my terminal,
and doing some googling, I have come to the understanding that so-called
monospaced (fixed-width) fonts may support *double column* characters as
well as single column.
So the OP's example has:
s1 = u"abcdef"
s2 = u"ã‚ã„ã†"
s1 has six single-column ("narrow") characters, while s2 has three double-
column ("wide") characters, and both strings should take up the same
horizontal space on screen.
If you are reading this in a non-monospaced font, the width of each
character is not fixed, the idea of columns doesn't really work, and the
strings may not be the same width.
See
http://www.unicode.org/reports/tr11/tr11-19.html for more detail.
Interestingly, Unicode supports wide versions of many non-EastAsian
characters (presumably because pre-Unicode EastAsian encodings supported
them). For example, run this code in Python:
print u'\N{FULLWIDTH LATIN CAPITAL LETTER A}'; print u'AA'
which should output:
A
AA
If your font supports this, you should see a single "A" as wide as the
double "AA" beneath it.
Curiously, in the monospaced font I am using to type this, the
"fullwidth" (wide, two-column) A is actually 2/3rds the width of the
standard ("halfwidth", narrow, one-column) A. Font designers -- can't
live with them, can't take them out and shoot them.
Hans Mulder's suggestion:
from unicodedata import east_asian_width
def screen_length(s):
return sum(2 if east_asian_width(c) == 'W' else 1 for c in s)
is almost right. The Unicode document above states:
In a broad sense, wide characters include W, F, and A (when in East Asian
context), and narrow characters include N, Na, H, and A (when not in East
Asian context).
[end quote]
from unicodedata import east_asian_width
def columns(s, eastasian_context=True):
if eastasian_context:
wide = 'WFA'
else:
wide = 'WF'
return sum(2 if east_asian_width(c) in wide else 1 for c in s)
ought to do it for all but the most sophisticated text layout
applications. For those needing much more sophistication, see here:
http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c