Counting utf-8 characters -special characters

Discussion in 'Javascript' started by majna, Sep 19, 2007.

  1. majna

    majna Guest

    I have character counter for textarea wich counting the characters.
    Special character needs same place as two normal characters because of
    16-bit encoding.
    Counter is counting -2 when special character is added like some
    language specific char.

    How to count specials like 1 char?
    tnx
    majna, Sep 19, 2007
    #1
    1. Advertising

  2. majna wrote:
    > I have character counter for textarea wich counting the characters.
    > Special character needs same place as two normal characters because of
    > 16-bit encoding.


    It doesn't.

    > Counter is counting -2 when special character is added like some
    > language specific char.


    "€".length === 1

    > How to count specials like 1 char?


    The same way. ECMAScript 3 implementations use UTF-16 encoded strings. RTFM.


    PointedEars
    --
    Anyone who slaps a 'this page is best viewed with Browser X' label on
    a Web page appears to be yearning for the bad old days, before the Web,
    when you had very little chance of reading a document written on another
    computer, another word processor, or another network. -- Tim Berners-Lee
    Thomas 'PointedEars' Lahn, Sep 19, 2007
    #2
    1. Advertising

  3. Thomas 'PointedEars' Lahn wrote:
    > majna wrote:
    >> I have character counter for textarea wich counting the characters.
    >> Special character needs same place as two normal characters because of
    >> 16-bit encoding.

    >
    > It doesn't.
    >
    >> Counter is counting -2 when special character is added like some
    >> language specific char.


    Should have been -1. But even if most implementations would not be
    UTF-16 safe, that would not have sufficed. UTF-16 does not mean that
    the representation of a glyph in that encoding requires always only
    16 bits:

    http://www.unicode.org/faq/utf_bom.html#6

    > "€".length === 1


    Windows(-1252). Hmpf. Make that "€" any Unicode glyph (such as "₭")
    and it is still true.


    PointedEars
    --
    var bugRiddenCrashPronePieceOfJunk = (
    navigator.userAgent.indexOf('MSIE 5') != -1
    && navigator.userAgent.indexOf('Mac') != -1
    ) // Plone, register_function.js:16
    Thomas 'PointedEars' Lahn, Sep 19, 2007
    #3
  4. Johannes Baagoe wrote:
    > Thomas 'PointedEars' Lahn :
    >> "€".length === 1

    >
    > Should be, since '€' (U+20AC) is represented as a single UTF-16 code
    > point,


    You mean code *unit*, _not_ code point. The latter is a completely
    different thing, the *position* of a Unicode character in the definition tables.

    Et non sequitur, as I have encoded my first followup accidentally with
    Windows-1252, that is not the real code point of that character (it is
    0x80). With UTF-16, you are correct, except that characters beyond
    code point 63k, which would require more code units, are seldom used.

    > but it is not, e.g., in spidermonkey, which obviously uses UTF-8:
    >
    > js> e = "€"
    > €
    > js> e.length
    > 3
    > js> for (i = 0; i < e.length; i++) {print(e.charCodeAt(i).toString(16))}
    > e2
    > 82
    > ac


    Probably due to your SpiderMonkey build. It works just fine since Mozilla/4.0.

    > But then, OP mentions UTF-8 in the subject line.


    Doesn't matter. The used document encoding is transparent to the
    application. The `value' property of a HTMLTextAreaElement object is of
    type DOMString, which is fully compatible to ECMAScript (UTF-16) strings.

    >>> How to count specials like 1 char?

    >> The same way. ECMAScript 3 implementations use UTF-16 encoded strings.
    >> RTFM.

    >
    > Hmmm. Is there *any* implementation that actually respects the requirement
    > of UTF-16?


    Most would nowadays. Even Netscape 4.78 yields 1 for "€".length.

    > Besides, even assuming UTF-16, some "language specific" characters (whatever
    > that means...) take up more than one code point. Some characters may even
    > use one or more code points according to whether one uses decomposition
    > or not, e.g., 'é' is either U+00E9 or U+0065 U+0301.


    No unique Unicode glyph has more than one code point, that would be a major
    flaw in the standard (that does not exist). However, a glyph may be
    represented by more than one code unit, though, either due to the mere
    necessity of its higher code point (position), surrogates or composition
    (and in the latter case it consists of several glyphs with their own code
    point, and their code units concatenated according to the used encoding).

    However, that does not matter for implementations of ECMAScript 3.
    Especially, glyph composition is transparent to the application, if it
    supports it.

    http://www.unicode.org/faq/char_combmark.html#2


    PointedEars
    --
    Prototype.js was written by people who don't know javascript for people
    who don't know javascript. People who don't know javascript are not
    the best source of advice on designing systems that use javascript.
    -- Richard Cornford, cljs, <f806at$ail$1$>
    Thomas 'PointedEars' Lahn, Sep 19, 2007
    #4
  5. Thomas 'PointedEars' Lahn wrote:
    > Johannes Baagoe wrote:
    >> Thomas 'PointedEars' Lahn :
    >>>> How to count specials like 1 char?
    >>> The same way. ECMAScript 3 implementations use UTF-16 encoded strings.
    >>> RTFM.

    >> Hmmm. Is there *any* implementation that actually respects the requirement
    >> of UTF-16?

    >
    > Most would nowadays. Even Netscape 4.78 yields 1 for "€".length.


    One might argue then that Netscape 4.78 evaluates the Windows-1252 encoded
    version of the respective currency mark which is one byte, and that it does
    not support Unicode. However, "€".charCodeAt() yields 8364 (not 128),
    String.fromCharCode(8365) yields "â‚­", and both "\u20AC".length and
    String.fromCharCode(8365).length yield 1.


    PointedEars
    --
    Anyone who slaps a 'this page is best viewed with Browser X' label on
    a Web page appears to be yearning for the bad old days, before the Web,
    when you had very little chance of reading a document written on another
    computer, another word processor, or another network. -- Tim Berners-Lee
    Thomas 'PointedEars' Lahn, Sep 19, 2007
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Stefan Mueller
    Replies:
    3
    Views:
    32,990
    Stefan Mueller
    Jul 23, 2006
  2. Replies:
    2
    Views:
    1,085
    Ingo Menger
    May 31, 2007
  3. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    958
    Grzegorz ¦liwiñski
    Jan 19, 2011
  4. Axel Etzold
    Replies:
    1
    Views:
    279
    Axel Etzold
    Sep 7, 2008
  5. edwardfredriks

    counting up instead of counting down

    edwardfredriks, Sep 6, 2005, in forum: Javascript
    Replies:
    6
    Views:
    197
    Dr John Stockton
    Sep 7, 2005
Loading...

Share This Page