Counting utf-8 characters -special characters

majna · Sep 19, 2007

I have character counter for textarea wich counting the characters.
Special character needs same place as two normal characters because of
16-bit encoding.
Counter is counting -2 when special character is added like some
language specific char.

How to count specials like 1 char?
tnx

Thomas 'PointedEars' Lahn · Sep 19, 2007

majna said:
I have character counter for textarea wich counting the characters.
Special character needs same place as two normal characters because of
16-bit encoding.

It doesn't.

Counter is counting -2 when special character is added like some
language specific char.

"€".length === 1

How to count specials like 1 char?

The same way. ECMAScript 3 implementations use UTF-16 encoded strings. RTFM.

PointedEars

Thomas 'PointedEars' Lahn · Sep 19, 2007

Thomas said:
It doesn't.

Should have been -1. But even if most implementations would not be
UTF-16 safe, that would not have sufficed. UTF-16 does not mean that
the representation of a glyph in that encoding requires always only
16 bits:

http://www.unicode.org/faq/utf_bom.html#6

"â‚¬".length === 1

Windows(-1252). Hmpf. Make that "â‚¬" any Unicode glyph (such as "â‚")
and it is still true.

PointedEars

Thomas 'PointedEars' Lahn · Sep 19, 2007

Johannes said:
Thomas 'PointedEars' Lahn :

Should be, since '€' (U+20AC) is represented as a single UTF-16 code
point,

You mean code *unit*, _not_ code point. The latter is a completely
different thing, the *position* of a Unicode character in the definition tables.

Et non sequitur, as I have encoded my first followup accidentally with
Windows-1252, that is not the real code point of that character (it is
0x80). With UTF-16, you are correct, except that characters beyond
code point 63k, which would require more code units, are seldom used.

but it is not, e.g., in spidermonkey, which obviously uses UTF-8:

js> e = "€"
€
js> e.length
3
js> for (i = 0; i < e.length; i++) {print(e.charCodeAt(i).toString(16))}
e2
82
ac

Probably due to your SpiderMonkey build. It works just fine since Mozilla/4.0.

But then, OP mentions UTF-8 in the subject line.

Doesn't matter. The used document encoding is transparent to the
application. The `value' property of a HTMLTextAreaElement object is of
type DOMString, which is fully compatible to ECMAScript (UTF-16) strings.

Hmmm. Is there *any* implementation that actually respects the requirement
of UTF-16?

Most would nowadays. Even Netscape 4.78 yields 1 for "€".length.

Besides, even assuming UTF-16, some "language specific" characters (whatever
that means...) take up more than one code point. Some characters may even
use one or more code points according to whether one uses decomposition
or not, e.g., 'é' is either U+00E9 or U+0065 U+0301.

No unique Unicode glyph has more than one code point, that would be a major
flaw in the standard (that does not exist). However, a glyph may be
represented by more than one code unit, though, either due to the mere
necessity of its higher code point (position), surrogates or composition
(and in the latter case it consists of several glyphs with their own code
point, and their code units concatenated according to the used encoding).

However, that does not matter for implementations of ECMAScript 3.
Especially, glyph composition is transparent to the application, if it
supports it.

http://www.unicode.org/faq/char_combmark.html#2

PointedEars

Thomas 'PointedEars' Lahn · Sep 19, 2007

Thomas said:
Most would nowadays. Even Netscape 4.78 yields 1 for "â‚¬".length.

One might argue then that Netscape 4.78 evaluates the Windows-1252 encoded
version of the respective currency mark which is one byte, and that it does
not support Unicode. However, "â‚¬".charCodeAt() yields 8364 (not 128),
String.fromCharCode(8365) yields "â‚", and both "\u20AC".length and
String.fromCharCode(8365).length yield 1.

PointedEars

Javascript and special characters	5	Mar 27, 2006
UTF-8 and strings	44	Jun 7, 2011
Unicode (UTF-8) in C	13	Mar 16, 2014
Python unicode utf-8 characters and MySQL unicode utf-8 characters	2	Jan 18, 2011
utf-8	5	Aug 7, 2009
UTF-8 and diacritics combining characters	5	Dec 19, 2008
Windows, Dir class and special characters	1	Jun 21, 2010
XMLRPC (REXML) incorrectly handles UTF-8 data	6	Nov 16, 2010

Counting utf-8 characters -special characters

majna

Thomas 'PointedEars' Lahn

Thomas 'PointedEars' Lahn

Thomas 'PointedEars' Lahn

Thomas 'PointedEars' Lahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads