textwrap and combining diacritical marks

Discussion in 'Python' started by Berteun Damman, Jun 28, 2007.

  1. Hello,

    When using the textwrap module, the wrap will always use len() to
    determine the length of the string being wrapped. This might be a
    sensible thing to do in many circumstances, but I think there are
    circumstances where this does not lead to the desired result.

    I assume many applications of this module are found in applications
    where text is formatted to be presented to a user, e.g. a console
    application. The number of characters in the string, as determined by
    len() might not be the number of columns occupied. Some of the
    characters might be combining diacritical marks, which go on top of the
    previous character, i.e. the string de'ge'ne're' (where the ' indicate
    combing accute accents) will only display with a width of 8 characters.

    The string might also include some characters that'll switch the console
    to bold or underline mode, which have zero display width. If this
    happens a lot, the resuling text might seem very badly formatted because
    of all these zerowidth character-strings.

    It is of course impossible to handle all these scenario's in which some
    characters might influence the width of the displayed string, but
    wouldn't it be convenient to have a 'chunk_width' method or something
    which can be overridden in a derived class, so that a user might give a
    custom implementation? The default of this chunk_width might just be
    'len()'.

    And that leasts to another question, does Python have a function akin to
    wcwidth() which gives the number of column positions a unicode character
    needs?

    Berteun
     
    Berteun Damman, Jun 28, 2007
    #1
    1. Advertising

  2. On Thu, 28 Jun 2007 09:19:20 +0000 (UTC), Berteun Damman
    <berteun@NO_SPAMdds.nl> wrote:
    > And that leasts to another question, does Python have a function akin to
    > wcwidth() which gives the number of column positions a unicode character
    > needs?


    After playing around a bit with unicodedata.normalize, but seeing how
    this fails when there is no precomposed form, I've decided to take
    Marcus Kuhns implementation [1], and made a Python version [2].

    This will try to guess the column width of a character. Non printable
    characters will report a -1 width (this includes '\n' and '\t' for
    example.), except for \0, which has width 0. Composing characters will
    report '0', normal latin characters 1 and full-width forms for example
    '2'.

    Of course, real output depends on the capabilities of the display
    device. xterm is capable of handling combining characters, whereas OS
    X's Terminal.app can not do it for Greek or Russian characters for
    example.

    All in all, I think it is a reasonable start. There is one issue though,
    namely involving Plane 1 chars. On 64 bit systems, so it seems, these
    are stored as one character, on 32 bit systems as a surrogate pair. I
    don't know how this works exactly, but the code should basically ignore
    Plane 1 characters on 32 bit systems (i.e. always report display width
    '1' even though they're combining or full-width).

    Berteun

    [1] http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c
    [2] http://berteun.nl/tmp/wcwidth.py
     
    Berteun Damman, Jun 28, 2007
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. adamskim

    French diacritical marks

    adamskim, Dec 13, 2004, in forum: Java
    Replies:
    4
    Views:
    683
    Real Gagnon
    Dec 13, 2004
  2. Girish Sharma

    Diacritical marks in HTML?

    Girish Sharma, Nov 27, 2004, in forum: HTML
    Replies:
    11
    Views:
    4,006
    Jukka K. Korpela
    Dec 1, 2004
  3. Dado
    Replies:
    5
    Views:
    1,062
  4. Paul Barry

    removing diacritical marks

    Paul Barry, Mar 17, 2006, in forum: Ruby
    Replies:
    2
    Views:
    227
    Paul Battley
    Mar 17, 2006
  5. jiverbean

    Diacritical marks in array don't translate

    jiverbean, Nov 11, 2005, in forum: Javascript
    Replies:
    15
    Views:
    224
    Dag Sunde
    Nov 12, 2005
Loading...

Share This Page