Text::Wrap and unicode

W

wing328hk

Hi,

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable, to decide where to wrap the input.

For example, say column is 10 and consider the following
aXXXXX where X is a double-byte character

The last unicode X will be corrupted by wrap, which will split the last
unicode character, the 10th and 11th byte of the string, into two.

Does anyone know how to configure wrap such that it works properly with
unicode?

Thanks,
Wing
 
P

Paul Lalli

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable,

No, it doesn't. length() returns the number of characters in a string.
perldoc -f length
to decide where to wrap the input.

For example, say column is 10 and consider the following
aXXXXX where X is a double-byte character

The last unicode X will be corrupted by wrap, which will split the last
unicode character, the 10th and 11th byte of the string, into two.

Does anyone know how to configure wrap such that it works properly with
unicode?

Well, first, this should only be a problem for "words" that are greater
in length than the wrapping limit and if you have $huge set to 'wrap'.
You could consider setting $huge to 'overflow' instead. (See perldoc
Text::Wrap for examples)

However, perhaps a module that is meant to deal with Unicode
specifically would suit you better?
http://search.cpan.org/~nesting/Unicode-Wrap-0.03/Wrap.pm

(Disclaimer: I've never used Unicode::Wrap. It's just one of the first
results when I search CPAN for 'unicode wrap')

Paul Lalli
 
J

Jürgen Exner

I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Unicode character is double-byte

Not necessarily. UTF-8 uses anything from 1 to 4(?) bytes.
and it seems that wrap basically uses
the function length, which basically return the number of bytes stored
in a variable,

Wrong. length() returns the number of characters, not bytes.

jue
 
D

Dr.Ruud

(e-mail address removed) schreef:
I'm using Text::Wrap and Unicode and found that the function wrap
doesn't handle unicode properly.

Where is your code?
 
A

Alan J. Flavell

Not necessarily.

"Unicode character" is an abstract concept, which associates the
character with an integer value between 0 and 0x10FFFF.

It's impossible to talk about that abstract concept in practical terms
without considering a specific "Character Encoding Form", which
specifies how to represent that integer value using different sized
units. There exist definitions for how to use 8-bit units (utf-8),
16-bit units (utf-16), and 32-bit units (utf-32).

See Chapter 2 of the Unicode specification, in particular sections
2.5 and 2.6 where the terms "Character Encoding Form" and "Character
Encoding Scheme" are elucidated.

e.g at http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf
UTF-8 uses anything from 1 to 4(?) bytes.

Indeed. The original utf-8 encoding scheme included definitions of
how to represent integers up to 32 bits, using sequences of up to 6
octets (8-bit bytes). But Unicode has now firmly set their upper
limit at 0x10FFFF (for whatever reason they picked that rather odd
endpoint), meaning that utf-8 sequences of more than 4 octets won't be
needed in practice.

h t h
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top