python 3 and Unicode line breaking

leoboiko · Jan 13, 2011

Hi,

Is there an equivalent to the textwrap module that knows about the
Unicode line breaking algorithm (UAX #14, http://unicode.org/reports/tr14/
)?

Steven D'Aprano · Jan 14, 2011

Hi,

Is there an equivalent to the textwrap module that knows about the
Unicode line breaking algorithm (UAX #14,
http://unicode.org/reports/tr14/ )?

Is access to Google blocked where you are, or would you just like us to
do your searches for you?

If you have tried searching, please say so, otherwise most people will
conclude you haven't bothered, and most likely will not bother to reply.

leoboiko · Jan 14, 2011

Of course I searched for one and couldn’t find; that goes without
saying. Otherwise I wouldn’t even bother writing a message, isn’t
it? I disagree people should cruft their messages with details about
how they failed to find information, as that is unrelated to the
question at hand and has no point other than polluting people’s
mailboxes.

I also see no reason to reply to a simple question with such
discourtesy, and cannot understand why someone would be so aggressive
to a stranger.

Stefan Behnel · Jan 14, 2011

leoboiko, 14.01.2011 14:06:

Of course I searched for one and couldnâ€™t find; that goes without
saying. Otherwise I wouldnâ€™t even bother writing a message, isnâ€™t
it? I disagree people should cruft their messages with details about
how they failed to find information, as that is unrelated to the
question at hand and has no point other than polluting peopleâ€™s
mailboxes.

http://www.catb.org/~esr/faqs/smart-questions.html#beprecise
http://www.catb.org/~esr/faqs/smart-questions.html#volume

Stefan

Stefan Behnel · Jan 14, 2011

Steven D'Aprano, 14.01.2011 01:15:

Is access to Google blocked where you are, or would you just like us to
do your searches for you?

If you have tried searching, please say so, otherwise most people will
conclude you haven't bothered, and most likely will not bother to reply.

I think the OP was asking for something like the "textwrap" module (which
the OP apparently knows about), but based on a special line break algorithm
which, as suggested by the way the OP asks, is not supported by textwrap.

Sadly, the OP did not clearly state that the required feature is really not
supported by "textwrap" and in what way textwrap behaves differently. That
would have helped in answering.

Stefan

leoboiko · Jan 14, 2011

Sadly, the OP did not clearly state that the required feature
is really not supported by "textwrap" and in what way textwrap
behaves differently. That would have helped in answering.

Oh, textwrap doesnâ€™t work for arbitrary Unicode text at all. For
example, it separates combining sequences:
tiÃª
ng
ViÃª
t

It also doesnâ€™t know about double-width characters:
1234567
8901234

It doesnâ€™t know about non-ascii punctuation:
abc-
def abcâ€d
ef

It doesnâ€™t know East Asian filling rules (though this is
perhaps pushing it a bit beyond textwrapâ€™s goals):
æ—¥æœ¬èªž
ã€ä¸å›½ # should avoid linebreak before CJK punctuation
èªž

And it generally doesnâ€™t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough. We
canâ€™t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (thatâ€™s why the
consortium provides a ready-made algorithm). Itâ€™s just that,
with python3â€™s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation. I thought someone
would surely have written one and I simply couldnâ€™t find, so I
asked precisely that.

Steven D'Aprano · Jan 14, 2011

Of course I searched for one and couldnâ€™t find; that goes without
saying. Otherwise I wouldnâ€™t even bother writing a message, isnâ€™t it?

You wouldn't say that if you had the slightest idea about how many people
write to newsgroups and web forums asking for help without making the
tiniest effort to solve the problem themselves. So, no, it *doesn't* go
without saying -- unless, of course, you want the answer to also go
without saying.

I disagree people should cruft their messages with details about how
they failed to find information, as that is unrelated to the question at
hand and has no point other than polluting peopleâ€™s mailboxes.

This is total nonsense -- how on earth can you say that it is unrelated
to the question you are asking? It tells others what they should not
waste their time trying, because you've already tried it. You don't need
to write detailed step-by-step instructions of everything you've tried,
but you can point us in the directions you've already traveled.

Think of it this way... if you were paying money for professional advice,
would you be happy to receive a bill for time spent doing the exact same
things you have already tried? I'm sure you wouldn't be. So why do you
think it is okay to waste the time of unpaid volunteers? That's just
thoughtless and selfish.

If you think so little of other people's time that you won't even write a
few words to save them from going down the same dead-ends that you've
already tried, then don't be surprised if they think so little of your
time that they don't bother replying even when they know the answer.

I also see no reason to reply to a simple question with such
discourtesy, and cannot understand why someone would be so aggressive to
a stranger.

If you think my reply was aggressive and discourteous, you've got a lot
to learn about public forums.

Antoine Pitrou · Jan 14, 2011

Hey,

If you think my reply was aggressive and discourteous, you've got a lot
to learn about public forums.

Perhaps you've got to learn about politeness yourself! Just because
some people are jerks on internet forums (or in real life) doesn't mean
everyone should; this is quite a stupid and antisocial excuse actually.

You would never have reacted this way if the same question had been
phrased by a regular poster here (let alone on python-dev). Taking
cheap shots at newcomers is certainly not the best way to welcome
them.

Thank you

Antoine.

Colin J. Williams · Jan 14, 2011

Hey,

Perhaps you've got to learn about politeness yourself! Just because
some people are jerks on internet forums (or in real life) doesn't mean
everyone should; this is quite a stupid and antisocial excuse actually.

You would never have reacted this way if the same question had been
phrased by a regular poster here (let alone on python-dev). Taking
cheap shots at newcomers is certainly not the best way to welcome
them.

Thank you

Antoine.

+1

Steven D'Aprano · Jan 14, 2011

You would never have reacted this way if the same question had been
phrased by a regular poster here (let alone on python-dev). Taking cheap
shots at newcomers is certainly not the best way to welcome them.

You're absolutely correct. Regular posters have demonstrated their
ability to perform the basics -- if you had asked the question, I could
assume that you would have done a google search, because I know you're
not a lazy n00b who expects others to do their work for them. But the
Original Poster has not, as far as I can see, ever posted here before. He
has no prior reputation and gives no detail in his post.

You have focused on my first blunt remark, and ignored the second:

"If you have tried searching, please say so, otherwise most people will
conclude you haven't bothered, and most likely will not bother to reply."

This is good, helpful advice, and far more useful to the OP than just
ignoring his post. You have jumped to his defense (or rather, you have
jumped to criticise me) but I see that you haven't replied to his
question or given him any advice in how to solve his problem. Instead of
encouraging him to ask smarter questions, you encourage the behaviour
that hinders his ability to get help from others.

The only other person I can see who has attempted to actually help the OP
is Stefan Behnel, who tried to get more information about the problem
being solved in order to better answer the question. The OP has, so far
as I can see, not responded, although he has taken the time to write to
me in private to argue further.

leoboiko · Jan 14, 2011

The only other person I can see who has attempted to actually help the OP
is Stefan Behnel, who tried to get more information about the problem
being solved in order to better answer the question. The OP has, so far
as I can see, not responded, although he has taken the time to write to
me in private to argue further.

I have written in private because I really feel this discussion is out-
of-place here. This thread is already in the first page of google
results for “python unicode line breaking”, “python uax #14” etc. I
feel it would be good to use this place to discuss Unicode line
breaking, not best practices on asking questions, or in how
disappointly impolite the Internet has become. (Briefly: As a tech
support professional myself, I prefer direct, concise questions than
crufty ones; and I try to ask questions in the most direct manner
precisely _because_ I don’t want to waste the time of kind volunteers
with my problems.)

As for taking the time to provide information, I wonder if there was
any technical problem that prevented you from seeing my reply to
Stefan, sent Jan 14, 12:29PM? He asked how exacly the stdlib module
“textwrap” differs from the Unicode algorithm, so I provided some
commented examples.

Antoine Pitrou · Jan 14, 2011

This is good, helpful advice, and far more useful to the OP than just
ignoring his post. You have jumped to his defense (or rather, you have
jumped to criticise me) but I see that you haven't replied to his
question or given him any advice in how to solve his problem.

Simply because I have no elaborate answer to give, even in the light of
his/her recent precisions on the topic (and, actually, neither do you).
Asking for precisions is certainly fine; doing it in an agressive way
is not, especially when the original message doesn't look like the
usual blunt, impolite and typo-ridden "can you do my homework" message.

Also, I would expect someone familiar with the textwrap module's (lack
of) unicode capabilities would have been able to answer the first
message without even asking for precisions.

Regards

Antoine.

leoboiko · Jan 14, 2011

Â Â >>> s1 = "æ—¥æœ¬èªžã®ãƒ†ã‚ãƒˆ"

(In case any Japanese speaker is curious, this was a typo; it was
supposed to be ã€Œæ—¥æœ¬èªžã®ãƒ†ã‚ã‚¹ãƒˆã€ã€‚ Thatâ€™s unrelated to the problem being
discussed, though.)

Antoine Pitrou · Jan 14, 2011

And it generally doesnâ€™t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough. We
canâ€™t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (thatâ€™s why the
consortium provides a ready-made algorithm). Itâ€™s just that,
with python3â€™s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation. I thought someone
would surely have written one and I simply couldnâ€™t find, so I
asked precisely that.

If you're willing to help on that matter (or some aspects of them,
textwrap-specific or not), you can open an issue on
http://bugs.python.org and propose a patch.

See also http://docs.python.org/devguide/#contributing if you need more
info on how to contribute.

Regards

Antoine.

Steven D'Aprano · Jan 15, 2011

.

As for taking the time to provide information, I wonder if there was any
technical problem that prevented you from seeing my reply to Stefan,
sent Jan 14, 12:29PM?

Presumably, since I haven't got it in my news client. This is not the
first time.

He asked how exacly the stdlib module â€œtextwrapâ€
differs from the Unicode algorithm, so I provided some commented
examples.

Does this help?

http://packages.python.org/kitchen/api-text-display.html

kitchen.text.display.wrap(text, width=70, initial_indent=u'',
subsequent_indent=u'', encoding='utf-8', errors='replace')

Works like we want textwrap.wrap() to work
[...]
textwrap.wrap() from the python standard libray has two drawbacks
that this attempts to fix:

1. It does not handle textual width. It only operates on bytes or
characters which are both inadequate (due to multi-byte and
double width characters).
2. It malforms lists and blocks.

leoboiko · Jan 17, 2011

Does this help?

http://packages.python.org/kitchen/api-text-display.html

Ooh, it doesn’t appear to be a full line-breaking
implementation but it certainly helps for what I want to do
in my project! Thanks much!

(There’s also the alternative of using something like PyICU
to access a C library, something I had forgotten about
entirely.)

If you're willing to help on that matter (or some aspects of them,
textwrap-specific or not), you can open an issue on
http://bugs.python.org and propose a patch.

I’m not sure my poor coding is good enough to contribute but I’ll
keep this is mind if I find myself implementing the algorithm or
wanting to patch textwrap. Thanks.

Python and PEP8 - Recommendations on breaking up long lines?	19	Nov 28, 2013
Breaking infinite loop with key stroke	1	Jul 27, 2022
Command Line Arguments	0	Mar 7, 2023
ur'foo' syntax under Python 3	0	Feb 8, 2014
unable to print Unicode characters in Python 3	12	Jan 26, 2009
doctests compatibility for python 2 & python 3	4	Jan 17, 2014
__unicode__() works, unicode() blows up.	3	Nov 4, 2012
Is Unicode support so hard...	12	Apr 20, 2013

python 3 and Unicode line breaking

leoboiko

Steven D'Aprano

leoboiko

Stefan Behnel

Stefan Behnel

leoboiko

Steven D'Aprano

Antoine Pitrou

Colin J. Williams

Steven D'Aprano

leoboiko

Antoine Pitrou

leoboiko

Antoine Pitrou

Steven D'Aprano

leoboiko

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads