Text::Wrap and unicode

wing328hk · Jan 4, 2006

Hi,

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable, to decide where to wrap the input.

For example, say column is 10 and consider the following
aXXXXX where X is a double-byte character

The last unicode X will be corrupted by wrap, which will split the last
unicode character, the 10th and 11th byte of the string, into two.

Does anyone know how to configure wrap such that it works properly with
unicode?

Thanks,
Wing

Paul Lalli · Jan 4, 2006

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable,

No, it doesn't. length() returns the number of characters in a string.
perldoc -f length

to decide where to wrap the input.

For example, say column is 10 and consider the following
aXXXXX where X is a double-byte character

The last unicode X will be corrupted by wrap, which will split the last
unicode character, the 10th and 11th byte of the string, into two.

Does anyone know how to configure wrap such that it works properly with
unicode?

Well, first, this should only be a problem for "words" that are greater
in length than the wrapping limit and if you have $huge set to 'wrap'.
You could consider setting $huge to 'overflow' instead. (See perldoc
Text::Wrap for examples)

However, perhaps a module that is meant to deal with Unicode
specifically would suit you better?
http://search.cpan.org/~nesting/Unicode-Wrap-0.03/Wrap.pm

(Disclaimer: I've never used Unicode::Wrap. It's just one of the first
results when I search CPAN for 'unicode wrap')

Paul Lalli

Jürgen Exner · Jan 4, 2006

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte

Not necessarily. UTF-8 uses anything from 1 to 4(?) bytes.

and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable,

Wrong. length() returns the number of characters, not bytes.

jue

Dr.Ruud · Jan 4, 2006

(e-mail address removed) schreef:

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Where is your code?

Alan J. Flavell · Jan 4, 2006

Not necessarily.

"Unicode character" is an abstract concept, which associates the
character with an integer value between 0 and 0x10FFFF.

It's impossible to talk about that abstract concept in practical terms
without considering a specific "Character Encoding Form", which
specifies how to represent that integer value using different sized
units. There exist definitions for how to use 8-bit units (utf-8),
16-bit units (utf-16), and 32-bit units (utf-32).

See Chapter 2 of the Unicode specification, in particular sections
2.5 and 2.6 where the terms "Character Encoding Form" and "Character
Encoding Scheme" are elucidated.

e.g at http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf

UTF-8 uses anything from 1 to 4(?) bytes.

Indeed. The original utf-8 encoding scheme included definitions of
how to represent integers up to 32 bits, using sequences of up to 6
octets (8-bit bytes). But Unicode has now firmly set their upper
limit at 0x10FFFF (for whatever reason they picked that rather odd
endpoint), meaning that utf-8 sequences of more than 4 octets won't be
needed in practice.

h t h

Unicode help please	5	Oct 19, 2013
Unicode (UTF-8) in C	13	Mar 16, 2014
How can I get a character, given its Unicode index?	5	Aug 30, 2009
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
Unicode: Strings marked 'utf8'. Can they be converted to 'byte' without going the vec() route?	0	Aug 3, 2009
given char* utf8, how to read unicode line by line, and output utf8	2	Mar 13, 2012
Html Table wrap text to column	1	Sep 3, 2008
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012

Text::Wrap and unicode

wing328hk

Paul Lalli

Jürgen Exner

Dr.Ruud

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads