Regular Expressions

  • Thread starter Markos Charatzas
  • Start date
S

skeptic

Thomas Schodt said:
My point is that charAt() is *still* a simple index lookup.

Any Unicode 4.0 supplementary codepoints units in Strings are stored as
two char values (surrogates).

This means that Strings can potentially display as few as half as many
codepoint units as String.length() reports.

What do you mean under the "codepoint units" ? The javadocs do say
about "code points"(ranging U+0000 to U+10FFFF) as opposed to "code
units"(U+0000 to U+FFFF) as a means to represent "characters".
For Strings containing Unicode 4.0 supplementary codepoints the index
you must pass to charAt() no longer corresponds to the offset of the
codepoint unit in the visual representation of the String.

Visual representation is absolutely irrelevant here. Codepoints may
split(1 codepoint may show as 2 glyfs) and combine, may not show at
all.
Let's say about characters (as listed in the big table of Unicode
characters at unicode.org).
You can use codePointAt() to get the 21-bit int value of codepoint units
in a String. When codePointAt() is called with the index of the first
surrogate of a Unicode 4.0 supplementary codepoint unit it returns the
21-bit int value of the entire pointcode unit (occupying the bytes at
index and at index+1 in the String). When codePointAt() is called with
the index of a "regular" Unicode codepoint it returns the 16-bit int
value of the pointcode unit numerically equivalent to the value charAt()
would return.

You again missed the point. The really interesting thing is the
meaning of the argument to codePointAt(i). Just returning the i-th
member of the internal char[] array converted to int(no matter how) is
either wrong or contradictory to common expectation.

I(if not most of us) use to think of a String as a *vector* of
*characters*,
where the n-th element is the n-th character. The opposite renders the
String methods like charAt(), substring(), indexOf() quite useless.

You'd say that the String now holds the UTF16-encoded data rather than
characters. Ok, agreed. No problem. But then what is the point of the
codePointAt()????
[not so logical stuff skipped]

Best Regards
P.S. Just trying to put some logic into the mess. May be wrong all
around.
 
T

Thomas Schodt

skeptic said:
What do you mean under the "codepoint units" ? The javadocs do say
about "code points"(ranging U+0000 to U+10FFFF) as opposed to "code
units"(U+0000 to U+FFFF) as a means to represent "characters".

My mistake, edited "code units" to "codepoints" but forgot to delete
"units". Hence the double-plural "codepoints units".

Visual representation is absolutely irrelevant here. Codepoints may
split(1 codepoint may show as 2 glyfs) and combine, may not show at
all.
Let's say about characters (as listed in the big table of Unicode
characters at unicode.org).

I (if not most of us) tend to think of Strings as sequences of
characters, where the n-th character is the n-th glyph in the visual
representation (except for a few control characters).

This is the case for US-ASCII Strings (disregarding control characters).

Except that for some uses you can also stick html codes in a String and
these will be converted.

You again missed the point.

o_O

The original question was
how do they implement the charAt(i)

I believe I answered that question.

The really interesting thing is the
meaning of the argument to codePointAt(i). Just returning the i-th
member of the internal char[] array converted to int(no matter how) is
either wrong or contradictory to common expectation.

I only intended to talk about "what is" not "what should be".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top