python 3 and Unicode line breaking

S

Steven D'Aprano

Hi,

Is there an equivalent to the textwrap module that knows about the
Unicode line breaking algorithm (UAX #14,
http://unicode.org/reports/tr14/ )?


Is access to Google blocked where you are, or would you just like us to
do your searches for you?

If you have tried searching, please say so, otherwise most people will
conclude you haven't bothered, and most likely will not bother to reply.
 
L

leoboiko

Of course I searched for one and couldn’t find; that goes without
saying. Otherwise I wouldn’t even bother writing a message, isn’t
it? I disagree people should cruft their messages with details about
how they failed to find information, as that is unrelated to the
question at hand and has no point other than polluting people’s
mailboxes.

I also see no reason to reply to a simple question with such
discourtesy, and cannot understand why someone would be so aggressive
to a stranger.
 
S

Stefan Behnel

leoboiko, 14.01.2011 14:06:
Of course I searched for one and couldn’t find; that goes without
saying. Otherwise I wouldn’t even bother writing a message, isn’t
it? I disagree people should cruft their messages with details about
how they failed to find information, as that is unrelated to the
question at hand and has no point other than polluting people’s
mailboxes.

http://www.catb.org/~esr/faqs/smart-questions.html#beprecise
http://www.catb.org/~esr/faqs/smart-questions.html#volume

Stefan
 
S

Stefan Behnel

Steven D'Aprano, 14.01.2011 01:15:
Is access to Google blocked where you are, or would you just like us to
do your searches for you?

If you have tried searching, please say so, otherwise most people will
conclude you haven't bothered, and most likely will not bother to reply.

I think the OP was asking for something like the "textwrap" module (which
the OP apparently knows about), but based on a special line break algorithm
which, as suggested by the way the OP asks, is not supported by textwrap.

Sadly, the OP did not clearly state that the required feature is really not
supported by "textwrap" and in what way textwrap behaves differently. That
would have helped in answering.

Stefan
 
L

leoboiko

Sadly, the OP did not clearly state that the required feature
is really not supported by "textwrap" and in what way textwrap
behaves differently. That would have helped in answering.

Oh, textwrap doesn’t work for arbitrary Unicode text at all. For
example, it separates combining sequences:
tiê
ng
Viê
t

It also doesn’t know about double-width characters:
1234567
8901234

It doesn’t know about non-ascii punctuation:
abc-
def abcâ€d
ef

It doesn’t know East Asian filling rules (though this is
perhaps pushing it a bit beyond textwrap’s goals):
日本語
ã€ä¸­å›½ # should avoid linebreak before CJK punctuation
語


And it generally doesn’t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough. We
can’t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (that’s why the
consortium provides a ready-made algorithm). It’s just that,
with python3’s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation. I thought someone
would surely have written one and I simply couldn’t find, so I
asked precisely that.
 
S

Steven D'Aprano

Of course I searched for one and couldn’t find; that goes without
saying. Otherwise I wouldn’t even bother writing a message, isn’t it?

You wouldn't say that if you had the slightest idea about how many people
write to newsgroups and web forums asking for help without making the
tiniest effort to solve the problem themselves. So, no, it *doesn't* go
without saying -- unless, of course, you want the answer to also go
without saying.

I disagree people should cruft their messages with details about how
they failed to find information, as that is unrelated to the question at
hand and has no point other than polluting people’s mailboxes.

This is total nonsense -- how on earth can you say that it is unrelated
to the question you are asking? It tells others what they should not
waste their time trying, because you've already tried it. You don't need
to write detailed step-by-step instructions of everything you've tried,
but you can point us in the directions you've already traveled.

Think of it this way... if you were paying money for professional advice,
would you be happy to receive a bill for time spent doing the exact same
things you have already tried? I'm sure you wouldn't be. So why do you
think it is okay to waste the time of unpaid volunteers? That's just
thoughtless and selfish.

If you think so little of other people's time that you won't even write a
few words to save them from going down the same dead-ends that you've
already tried, then don't be surprised if they think so little of your
time that they don't bother replying even when they know the answer.
I also see no reason to reply to a simple question with such
discourtesy, and cannot understand why someone would be so aggressive to
a stranger.

If you think my reply was aggressive and discourteous, you've got a lot
to learn about public forums.
 
A

Antoine Pitrou

Hey,

If you think my reply was aggressive and discourteous, you've got a lot
to learn about public forums.

Perhaps you've got to learn about politeness yourself! Just because
some people are jerks on internet forums (or in real life) doesn't mean
everyone should; this is quite a stupid and antisocial excuse actually.

You would never have reacted this way if the same question had been
phrased by a regular poster here (let alone on python-dev). Taking
cheap shots at newcomers is certainly not the best way to welcome
them.

Thank you

Antoine.
 
C

Colin J. Williams

Hey,



Perhaps you've got to learn about politeness yourself! Just because
some people are jerks on internet forums (or in real life) doesn't mean
everyone should; this is quite a stupid and antisocial excuse actually.

You would never have reacted this way if the same question had been
phrased by a regular poster here (let alone on python-dev). Taking
cheap shots at newcomers is certainly not the best way to welcome
them.

Thank you

Antoine.
+1
 
S

Steven D'Aprano

You would never have reacted this way if the same question had been
phrased by a regular poster here (let alone on python-dev). Taking cheap
shots at newcomers is certainly not the best way to welcome them.

You're absolutely correct. Regular posters have demonstrated their
ability to perform the basics -- if you had asked the question, I could
assume that you would have done a google search, because I know you're
not a lazy n00b who expects others to do their work for them. But the
Original Poster has not, as far as I can see, ever posted here before. He
has no prior reputation and gives no detail in his post.

You have focused on my first blunt remark, and ignored the second:

"If you have tried searching, please say so, otherwise most people will
conclude you haven't bothered, and most likely will not bother to reply."

This is good, helpful advice, and far more useful to the OP than just
ignoring his post. You have jumped to his defense (or rather, you have
jumped to criticise me) but I see that you haven't replied to his
question or given him any advice in how to solve his problem. Instead of
encouraging him to ask smarter questions, you encourage the behaviour
that hinders his ability to get help from others.

The only other person I can see who has attempted to actually help the OP
is Stefan Behnel, who tried to get more information about the problem
being solved in order to better answer the question. The OP has, so far
as I can see, not responded, although he has taken the time to write to
me in private to argue further.
 
L

leoboiko

The only other person I can see who has attempted to actually help the OP
is Stefan Behnel, who tried to get more information about the problem
being solved in order to better answer the question. The OP has, so far
as I can see, not responded, although he has taken the time to write to
me in private to argue further.

I have written in private because I really feel this discussion is out-
of-place here. This thread is already in the first page of google
results for “python unicode line breaking”, “python uax #14” etc. I
feel it would be good to use this place to discuss Unicode line
breaking, not best practices on asking questions, or in how
disappointly impolite the Internet has become. (Briefly: As a tech
support professional myself, I prefer direct, concise questions than
crufty ones; and I try to ask questions in the most direct manner
precisely _because_ I don’t want to waste the time of kind volunteers
with my problems.)


As for taking the time to provide information, I wonder if there was
any technical problem that prevented you from seeing my reply to
Stefan, sent Jan 14, 12:29PM? He asked how exacly the stdlib module
“textwrap” differs from the Unicode algorithm, so I provided some
commented examples.
 
A

Antoine Pitrou

This is good, helpful advice, and far more useful to the OP than just
ignoring his post. You have jumped to his defense (or rather, you have
jumped to criticise me) but I see that you haven't replied to his
question or given him any advice in how to solve his problem.

Simply because I have no elaborate answer to give, even in the light of
his/her recent precisions on the topic (and, actually, neither do you).
Asking for precisions is certainly fine; doing it in an agressive way
is not, especially when the original message doesn't look like the
usual blunt, impolite and typo-ridden "can you do my homework" message.

Also, I would expect someone familiar with the textwrap module's (lack
of) unicode capabilities would have been able to answer the first
message without even asking for precisions.

Regards

Antoine.
 
L

leoboiko

    >>> s1 = "日本語ã®ãƒ†ã‚­ãƒˆ"

(In case any Japanese speaker is curious, this was a typo; it was
supposed to be 「日本語ã®ãƒ†ã‚­ã‚¹ãƒˆã€ã€‚ That’s unrelated to the problem being
discussed, though.)
 
A

Antoine Pitrou

And it generally doesn’t try to pick good places to break lines
at all, just making the assumption that 1 character = 1 column
and that breaking on ASCII whitespaces/hyphens is enough. We
can’t really blame textwrap for that, it is a very simple module
and Unicode line breaking gets complex fast (that’s why the
consortium provides a ready-made algorithm). It’s just that,
with python3’s emphasis on Unicode support, I was surprised not
to be able to find an UAX #14 implementation. I thought someone
would surely have written one and I simply couldn’t find, so I
asked precisely that.

If you're willing to help on that matter (or some aspects of them,
textwrap-specific or not), you can open an issue on
http://bugs.python.org and propose a patch.

See also http://docs.python.org/devguide/#contributing if you need more
info on how to contribute.

Regards

Antoine.
 
S

Steven D'Aprano

.
As for taking the time to provide information, I wonder if there was any
technical problem that prevented you from seeing my reply to Stefan,
sent Jan 14, 12:29PM?

Presumably, since I haven't got it in my news client. This is not the
first time.

He asked how exacly the stdlib module “textwrapâ€
differs from the Unicode algorithm, so I provided some commented
examples.

Does this help?


http://packages.python.org/kitchen/api-text-display.html

kitchen.text.display.wrap(text, width=70, initial_indent=u'',
subsequent_indent=u'', encoding='utf-8', errors='replace')

Works like we want textwrap.wrap() to work
[...]
textwrap.wrap() from the python standard libray has two drawbacks
that this attempts to fix:

1. It does not handle textual width. It only operates on bytes or
characters which are both inadequate (due to multi-byte and
double width characters).
2. It malforms lists and blocks.
 
L

leoboiko


Ooh, it doesn’t appear to be a full line-breaking
implementation but it certainly helps for what I want to do
in my project! Thanks much!

(There’s also the alternative of using something like PyICU
to access a C library, something I had forgotten about
entirely.)
If you're willing to help on that matter (or some aspects of them,
textwrap-specific or not), you can open an issue on
http://bugs.python.org and propose a patch.

I’m not sure my poor coding is good enough to contribute but I’ll
keep this is mind if I find myself implementing the algorithm or
wanting to patch textwrap. Thanks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,189
Latest member
CryptoTaxSoftware

Latest Threads

Top