Text::Wrap and unicode

Discussion in 'Perl Misc' started by wing328hk@gmail.com, Jan 4, 2006.

  1. Guest

    Hi,

    I'm using Text::Wrap and Unicode and found that the function wrap
    doesn't handle unicode properly.

    Unicode character is double-byte and it seems that wrap basically uses
    the function length, which basically return the number of bytes stored
    in a variable, to decide where to wrap the input.

    For example, say column is 10 and consider the following
    aXXXXX where X is a double-byte character

    The last unicode X will be corrupted by wrap, which will split the last
    unicode character, the 10th and 11th byte of the string, into two.

    Does anyone know how to configure wrap such that it works properly with
    unicode?

    Thanks,
    Wing
     
    , Jan 4, 2006
    #1
    1. Advertising

  2. Paul Lalli Guest

    wrote:
    > I'm using Text::Wrap and Unicode and found that the function wrap
    > doesn't handle unicode properly.
    >
    > Unicode character is double-byte and it seems that wrap basically uses
    > the function length, which basically return the number of bytes stored
    > in a variable,


    No, it doesn't. length() returns the number of characters in a string.
    perldoc -f length

    > to decide where to wrap the input.
    >
    > For example, say column is 10 and consider the following
    > aXXXXX where X is a double-byte character
    >
    > The last unicode X will be corrupted by wrap, which will split the last
    > unicode character, the 10th and 11th byte of the string, into two.
    >
    > Does anyone know how to configure wrap such that it works properly with
    > unicode?


    Well, first, this should only be a problem for "words" that are greater
    in length than the wrapping limit and if you have $huge set to 'wrap'.
    You could consider setting $huge to 'overflow' instead. (See perldoc
    Text::Wrap for examples)

    However, perhaps a module that is meant to deal with Unicode
    specifically would suit you better?
    http://search.cpan.org/~nesting/Unicode-Wrap-0.03/Wrap.pm

    (Disclaimer: I've never used Unicode::Wrap. It's just one of the first
    results when I search CPAN for 'unicode wrap')

    Paul Lalli
     
    Paul Lalli, Jan 4, 2006
    #2
    1. Advertising

  3. wrote:
    > I'm using Text::Wrap and Unicode and found that the function wrap
    > doesn't handle unicode properly.
    >
    > Unicode character is double-byte


    Not necessarily. UTF-8 uses anything from 1 to 4(?) bytes.

    > and it seems that wrap basically uses
    > the function length, which basically return the number of bytes stored
    > in a variable,


    Wrong. length() returns the number of characters, not bytes.

    jue
     
    Jürgen Exner, Jan 4, 2006
    #3
  4. Dr.Ruud Guest

    schreef:

    > I'm using Text::Wrap and Unicode and found that the function wrap
    > doesn't handle unicode properly.


    Where is your code?

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jan 4, 2006
    #4
  5. On Wed, 4 Jan 2006, Jürgen Exner wrote:

    > wrote:
    > >
    > > Unicode character is double-byte

    >
    > Not necessarily.


    "Unicode character" is an abstract concept, which associates the
    character with an integer value between 0 and 0x10FFFF.

    It's impossible to talk about that abstract concept in practical terms
    without considering a specific "Character Encoding Form", which
    specifies how to represent that integer value using different sized
    units. There exist definitions for how to use 8-bit units (utf-8),
    16-bit units (utf-16), and 32-bit units (utf-32).

    See Chapter 2 of the Unicode specification, in particular sections
    2.5 and 2.6 where the terms "Character Encoding Form" and "Character
    Encoding Scheme" are elucidated.

    e.g at http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

    > UTF-8 uses anything from 1 to 4(?) bytes.


    Indeed. The original utf-8 encoding scheme included definitions of
    how to represent integers up to 32 bits, using sequences of up to 6
    octets (8-bit bytes). But Unicode has now firmly set their upper
    limit at 0x10FFFF (for whatever reason they picked that rather odd
    endpoint), meaning that utf-8 sequences of more than 4 octets won't be
    needed in practice.

    h t h
     
    Alan J. Flavell, Jan 4, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    562
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  2. Aaron Fude

    To wrap or not to wrap?

    Aaron Fude, May 8, 2008, in forum: Java
    Replies:
    12
    Views:
    716
    Chronic Philharmonic
    May 10, 2008
  3. Art Werschulz

    Text::Wrap::wrap difference

    Art Werschulz, Sep 22, 2003, in forum: Perl Misc
    Replies:
    0
    Views:
    254
    Art Werschulz
    Sep 22, 2003
  4. Art Werschulz

    Text::Wrap::wrap difference

    Art Werschulz, Sep 24, 2003, in forum: Perl Misc
    Replies:
    1
    Views:
    257
    Anno Siegel
    Sep 25, 2003
  5. john.swilting

    problems GD and GD::Text::Wrap

    john.swilting, Jan 23, 2007, in forum: Perl Misc
    Replies:
    5
    Views:
    114
    john.swilting
    Jan 23, 2007
Loading...

Share This Page