Could you verify this, Oh Great Unicode Experts of the Python-List?

J

Joshua Landau

Basically, I think Twitter's broken.

For my full discusion on the matter, see:
http://www.reddit.com/r/learnpython/comments/1k2yrn/help_with_len_and_input_function_33/cbku5e8

Here's the first post of mine, ineffectually edited for this list:

"""
<strikethrough>The obvious solution [to getting the length of a tweet]
is wrong. Like, slightly wrong¹.</strikethrough>

Given tweet = b"caf\x65\xCC\x81".decode():
'café'

But:
5

So the solution is:
4

<strikethrough>Read twitter's commentary¹ for proof.</strikethrough>

<strikethrough>There are additional complications I'm trying to sort
out.</strikethrough>
________________________________

After further testing (I don't actually use Twitter) it seems the
whole thing was just smoke and mirrors. The linked article is a lie,
at least on the user's end.

p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE)
p.communicate(input=b"caf\x65\xCC\x81")
(None, None)

"cafeÌ" will be in your Copy-Paste buffer, and you can paste it in to
the tweet-box. It takes 5 characters. So much for testing ;).

________________________________
¹ https://dev.twitter.com/docs/counting-characters#Definition_of_a_Character
"""


I know this isn't *really* Python-related, but there's Python involved
and you're the sort of people who'll be able to tell me what I've done
wrong, if anything.
 
S

Steven D'Aprano

Basically, I think Twitter's broken.

Oh, in about a million ways, but apparently people like it :-(

For my full discusion on the matter, see:
http://www.reddit.com/r/learnpython/comments/1k2yrn/ help_with_len_and_input_function_33/cbku5e8

Here's the first post of mine, ineffectually edited for this list:

"""
<strikethrough>The obvious solution [to getting the length of a tweet]
is wrong. Like, slightly wrong¹.</strikethrough>

Given tweet = b"caf\x65\xCC\x81".decode():

I assume you're using Python 3, where UTF-8 is the default encoding.


Yes, that's correct. Unicode doesn't promise to have a single unique
representation for all human-readable strings. In this case, the string
"cafe" with an accent on the "e" can be generated by two sequences of
code points:

LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E
COMBINING ACUTE ACCENT

or

LATIN SMALL LETTER C
LATIN SMALL LETTER A
LATIN SMALL LETTER F
LATIN SMALL LETTER E WITH ACUTE


The reason some accented letters have single code point forms is to
support legacy charsets; the reason some only exist as combining
characters is due to the combinational explosion. Some languages allow
you to add up to five or six different accent on any of dozens of
different letters. If each combination needed its own unique code point,
there wouldn't be enough code points. For bonus points, if there are five
accents that can be placed in any combination of zero or more on any of
four characters, how many code points would be needed?

Neither form is "right" or "wrong", they are both equally valid. They
encode differently, of course, since UTF-8 does guarantee that every
sequence of code points has a unique byte representation:

py> tweet.encode('utf-8')
'cafe\xcc\x81'
py> u'café'.encode('utf-8')
'caf\xc3\xa9'

Note that the form you used, b"caf\x65\xCC\x81", is the same as the first
except that you have shown "e" in hex for some reason:

py> b'\x65' == b'e'
True

So the solution is:

4

In this particular case, this will reduce the tweet to the normalised
form that Twitter uses.


[...]
After further testing (I don't actually use Twitter) it seems the whole
thing was just smoke and mirrors. The linked article is a lie, at least
on the user's end.

Which linked article? The one on dev.twitter.com seems to be okay to me.
Of course, they might be lying when they say "Twitter counts the length
of a Tweet using the Normalization Form C (NFC) version of the text", I
have no idea. But the seem to have a good grasp of the issues involved,
and assuming they do what they say, at least Western European users
should be happy.

p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE)
p.communicate(input=b"caf\x65\xCC\x81")
(None, None)

"cafeÌ" will be in your Copy-Paste buffer, and you can paste it in to
the tweet-box. It takes 5 characters. So much for testing ;).

How do you know that it takes 5 characters? Is that some Javascript
widget? I'd blame buggy Javascript before Twitter.

If this shows up in your application as cafeÌ rather than café, it is a
bug in the text rendering engine. Some applications do not deal with
combining characters correctly.

(It's a hard problem to solve, and really needs support from the font. In
some languages, the same accent will appear in different places depending
on the character they are attached to, or the other accents there as
well. Or so I've been lead to believe.)

¹ https://dev.twitter.com/docs/counting-
characters#Definition_of_a_Character

Looks reasonable to me. No obvious errors to my eyes.
 
J

Joshua Landau

The reason some accented letters have single code point forms is to
support legacy charsets; the reason some only exist as combining
characters is due to the combinational explosion. Some languages allow
you to add up to five or six different accent on any of dozens of
different letters. If each combination needed its own unique code point,
there wouldn't be enough code points. For bonus points, if there are five
accents that can be placed in any combination of zero or more on any of
four characters, how many code points would be needed?
52?

Note that the form you used, b"caf\x65\xCC\x81", is the same as the first
except that you have shown "e" in hex for some reason:

py> b'\x65' == b'e'
True

Yeah.. I did that because the linked post did it. I'm not sure why either ;).
So the solution is:

4

In this particular case, this will reduce the tweet to the normalised
form that Twitter uses.

[...]
After further testing (I don't actually use Twitter) it seems the whole
thing was just smoke and mirrors. The linked article is a lie, at least
on the user's end.

Which linked article? The one on dev.twitter.com seems to be okay to me.

That's the one.
Of course, they might be lying when they say "Twitter counts the length
of a Tweet using the Normalization Form C (NFC) version of the text", I
have no idea. But the seem to have a good grasp of the issues involved,
and assuming they do what they say, at least Western European users
should be happy.

They *don't* seem to be doing what they say.
p = subprocess.Popen(['xsel', '-bi'], stdin=subprocess.PIPE)
p.communicate(input=b"caf\x65\xCC\x81")
(None, None)

"cafeÌ" will be in your Copy-Paste buffer, and you can paste it in to
the tweet-box. It takes 5 characters. So much for testing ;).

How do you know that it takes 5 characters? Is that some Javascript
widget? I'd blame buggy Javascript before Twitter.

I go to twitter.com, log in and press that odd blue compose button in
the top-right. After pasting at says I have 135 (down from 140)
characters left.

My only question here is, since you can't post after 140
non-normalised characters, who cares if the server counts it as less?
If this shows up in your application as cafeÌ rather than café,it is a
bug in the text rendering engine. Some applications do not deal with
combining characters correctly.

Why the rendering engine?
(It's a hard problem to solve, and really needs support from the font. In
some languages, the same accent will appear in different places depending
on the character they are attached to, or the other accents there as
well. Or so I've been lead to believe.)



Looks reasonable to me. No obvious errors to my eyes.

*Not sure whether talking about the link or my post*
 
S

Steven D'Aprano


More than double that.

Consider a single character. It can have 0 to 5 accents, in any
combination. Order doesn't matter, and there are no duplicates, so there
are:

0 accent: take 0 from 5 = 1 combination;
1 accent: take 1 from 5 = 5 combinations;
2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations;
3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations;
4 accents: take 4 from 5 = 5 combinations;
5 accents: take 5 from 5 = 1 combination

giving a total of 32 combinations for a single character. Since there are
four characters in this hypothetical language that take accents, that
gives a total of 4*32 = 128 distinct code points needed.

In reality, Unicode has currently code points U+0300 to U+036F (112 code
points) to combining characters. It's not really meaningful to combine
all 112 of them, or even most of 112 of them, but let's assume that we
can legitimately combine up to three of them on average (some languages
will allow more, some less) on just six different letters. That gives us:

0 accent: 1 combination
1 accent: 112 combinations
2 accents: 112!/(2!*110!) = 6216 combinations
3 accents: 112!/(3!*109!) = 227920 combinations

giving 234249 combinations, by six base characters, = 1405494 code
points. Which is comfortably more than the 1114112 code points Unicode
has in total :)

This calculation is horribly inaccurate, since you can't arbitrarily
combine (say) accents from Greek with accents from IPA, but I reckon that
the combinational explosion of accented letters is still real.


[...]
Of course, they might be lying when they say "Twitter counts the length
of a Tweet using the Normalization Form C (NFC) version of the text", I
have no idea. But the seem to have a good grasp of the issues involved,
and assuming they do what they say, at least Western European users
should be happy.

They *don't* seem to be doing what they say. [...]
How do you know that it takes 5 characters? Is that some Javascript
widget? I'd blame buggy Javascript before Twitter.

I go to twitter.com, log in and press that odd blue compose button in
the top-right. After pasting at says I have 135 (down from 140)
characters left.

I'm pretty sure that will be a piece of Javascript running in your
browser that reports the number of characters in the text box. So, I
would expect that either:

- Javascript doesn't provide a way to normalize text;

- Twitter's Javascript developer(s) don't know how to normalize text, or
can't be bothered to follow company policy (shame on them);

- the Javascript just asks the browser, and the browser doesn't know how
to count characters the Twitter way;

etc. But of course posting to Twitter via your browser isn't the only way
to post. Twitter provide an API to twit, and *that* is the ultimate test
of whether Twitter's dev guide is lying or not.

My only question here is, since you can't post after 140 non-normalised
characters, who cares if the server counts it as less?

People who bypass the browser and write their own Twitter client.

Why the rendering engine?

If the text renderer assumes it can draw once code point at a time, it
will draw the "e", then reach the combining accent. It could, in
principle, backspace and draw it over the "e", but more likely it will
just draw it next to it.

What the renderer should do is walk the string, collecting characters
until it reaches one which is not a combining character, then draw them
all at once one on top of each other. A good font may have special
glyphs, or at least hints, for combining accents. For instance, if you
have a dot accent and a comma accent drawn one on top of the other, it
looks like a comma; what you are supposed to do is move them side by
side, so you have separate dot and comma glyphs.

*Not sure whether talking about the link or my post*

The dev.twitter.com post.
 
C

Chris Angelico

Consider a single character. It can have 0 to 5 accents, in any
combination. Order doesn't matter, and there are no duplicates, so there
are:

0 accent: take 0 from 5 = 1 combination;
1 accent: take 1 from 5 = 5 combinations;
2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations;
3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations;
4 accents: take 4 from 5 = 5 combinations;
5 accents: take 5 from 5 = 1 combination

giving a total of 32 combinations for a single character. Since there are
four characters in this hypothetical language that take accents, that
gives a total of 4*32 = 128 distinct code points needed.

There's an easy way to calculate it. Instead of the "take N from 5"
notation, simply look at it as a set of independent bits - each of
your accents may be either present or absent. So it's 1<<5
combinations for a single character, which is the same 32 figure you
came up with, but easier to work with in the ridiculous case.
In reality, Unicode has currently code points U+0300 to U+036F (112 code
points) to combining characters. It's not really meaningful to combine
all 112 of them, or even most of 112 of them...

If you *were* to use literally ANY combination, that would be 1<<112
which is... uhh... five billion yottacombinations. Don't bother
working that one out by the "take N" method, it'll take you too long
:)

Oh, and that's 1<<112 possible combining character combinations, so
you then need to multiply that by the number of base characters you
could use....

ChrisA
 
J

Joshua Landau

More than double that.

Consider a single character. It can have 0 to 5 accents, in any
combination. Order doesn't matter, and there are no duplicates, so there
are:

0 accent: take 0 from 5 = 1 combination;
1 accent: take 1 from 5 = 5 combinations;
2 accents: take 2 from 5 = 5!/(2!*3!) = 10 combinations;
3 accents: take 3 from 5 = 5!/(3!*2!) = 10 combinations;
4 accents: take 4 from 5 = 5 combinations;
5 accents: take 5 from 5 = 1 combination

giving a total of 32 combinations for a single character. Since there are
four characters in this hypothetical language that take accents, that
gives a total of 4*32 = 128 distinct code points needed.

I didn't see "four characters", and I did (1 + 5 + 10) * 2 and came up
with 52...
Maybe I should get more sleep.
 
W

wxjmfauth

Le dimanche 11 août 2013 11:09:44 UTC+2, Steven D'Aprano a écrit :
On Sun, 11 Aug 2013 07:17:42 +0100, Joshua Landau wrote:




The reason some accented letters have single code point forms is to

support legacy charsets; ...

No.

jmf

PS Unicode normalization is failing expectedly very well
with the FSR.
 
J

Joshua Landau

Le dimanche 11 août 2013 11:09:44 UTC+2, Steven D'Aprano a écrit :

No.

jmf

PS Unicode normalization is failing expectedly very well
with the FSR.

No.

Joshua Landau

PS Proper arguments are falling expectedly very well with the internet
 
J

Joshua Landau

I'm pretty sure that will be a piece of Javascript running in your
browser that reports the number of characters in the text box. So, I
would expect that either:

- Javascript doesn't provide a way to normalize text;

- Twitter's Javascript developer(s) don't know how to normalize text, or
can't be bothered to follow company policy (shame on them);

- the Javascript just asks the browser, and the browser doesn't know how
to count characters the Twitter way;

etc. But of course posting to Twitter via your browser isn't the only way
to post. Twitter provide an API to twit, and *that* is the ultimate test
of whether Twitter's dev guide is lying or not.

Well, I've done some further testing and it seems you're right. It's
just the javascript that's wrong. I guess they did it for better
load-times.
 

Members online

Forum statistics

Threads
473,734
Messages
2,569,441
Members
44,832
Latest member
GlennSmall

Latest Threads

Top