encodeURI and unicode

Csaba Gabor · Mar 17, 2006

If I do alert(encodeURI(String.fromCharCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,
or am I guaranteed that all % encodings (from encodeURI) will have
exactly two hex digits following?

Perhaps someone could shed some light on this or point me to quality
site. Be gentle, I know almost nothing about unicode.

Thanks,
Csaba Gabor from Vienna
alert(encodeURI(String.fromCharCode(2500))) => %E0%A7%84
alert(encodeURI(String.fromCharCode(25000))) => %E6%86%A8

Csaba Gabor · Mar 17, 2006

Csaba said:
If I do alert(encodeURI(String.fromCharCode(250)));
(in FF 1.5+ or IE6 on my winXP Pro) then I get: %C3%BA

Now I was sort of expecting something like %u... (and a single (4
digit?) unicode hex character num). Is that something for the future,

OK, I think I have most it it now. I was confusing encodeURI with what
I had earlier read at this site:
http://html.megalink.com/programmer/jstut/jsTabChars.html

but that is covering how to specify javascript (1.3) strings and not
what happens with encodeURI. I presume this is a reflection of the
spec that browsers must follow in transmitting information to servers.
Still, I was a little surprised.

Here is another interesting point:
var a=String.fromCharCode(131071);
alert(a.charCodeAt(0)+"\n"+a);

That code shows a char code of 65535, and if use 131072 then the char
code goes to 0. In other words, it wraps.

I just have one question at this point. As I mentioned in my original
post,
String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Csaba

Thomas 'PointedEars' Lahn · Mar 18, 2006

Csaba said:
I just have one question at this point. As I mentioned in my original
post, String.fromCharCode(2500) == "\u09C4" => %E0%A7%84
The first equivalence is easy since 9C4 is the hex representation of
(decimal) 2500. But how do we get to the encodeURI output on the
right?

Those are percent-escaped representations of the three UTF-8 code
units that are required to encode the Unicode character at code
point U+09C4. See also ECMAScript 3 Final, subsection 15.1.3, and
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

PointedEars

Csaba Gabor · Mar 18, 2006

Thomas said:
Those are percent-escaped representations of the three UTF-8 code ...
<URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:

n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less than
17*16^4, report an error,
and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.

Otherwise, let k be the number of bits in your number. That is to say,
k is the smallest integer such that 2^k is greater than your number -
e.g. [2^(k-1),2^k)->k; [128,256)->8; [8,16)->4; [4,8)->3; [2,4)->2;
1->1; 0->0). Now, starting at the low end, section the number into
m=ceiling((k-1)/5) groups of 6 bits, with any leftovers in the final
(high) group. Prefix all but the high groups with (bits) 10 (that is
to say, OR them with (hex) 80). Prefix the high group with the m+1
bits corresponding to 2^(m+1)-2. That is to say, prefix the first
group of 2 with (bits) 110, the first group of 3 with 1110, or the
first group of 4 with 11110.

Thus, if your number has 7 bits or less, it takes two hex digits to
represent. From 8 to 11 (inclusive) it takes four hex digits, from 12
to 16 (inclusive) it takes six, and from 17 to 21 (inclusive) bits it
takes eight hex digits to represent.

Example: 2500 -> 0x9C4 ->
1001 1100 0100 so k=12 and m=3 ->
(0000) 100111 000100 (that first group got no bits so it is implied) ->
(1110)0000 (10)100111 (10)000100 ->
E0 A7 84

With this it's also easy to see how to work from UTF-8 to unicode.
Given a byte, scan for (from the high (left) side, the first 0 bit).
If the high bit is 0, you are done and you have a "normal" character.
Otherwise, the character is specified by the next m bytes (including
the one the scan started with), where m is one less than the number of
1s encountered before finding that first 0 bit. Knock out all the bits
up to the first 0 bit, and the top 2 bits of all the rest, and
concatenate the remaining bits to get the char code.

Thus, we see the correspondence between UTF8 and unicode
Csaba
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

Thomas 'PointedEars' Lahn · Mar 18, 2006

Would you please at least try to retain context in quotations?

Thanks Thomas, I like links.
It let me figure out the unicode / UTF8 mapping.
He's got a function, convertCP2UTF8 (spaceSeparatedHexValues) that does
essentially:

n = ...unicodeValue...
if (n <= 0x7F) return dec2hex2(n);
else if (n <= 0x7FF) return
dec2hex2(0xC0 | ((n>>6) & 0x1F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0xFFFF) return
dec2hex2(0xE0 | ((n>>12) & 0x0F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else if (n <= 0x10FFFF) return
dec2hex2(0xF0 | ((n>>18) & 0x07)) + ' ' +
dec2hex2(0x80 | ((n>>12) & 0x3F)) + ' ' +
dec2hex2(0x80 | ((n>>6) & 0x3F)) + ' ' +
dec2hex2(0x80 | (n & 0x3F));
else return '!erreur ' + dec2hex(n);

In words: If your positive integer (the char code) is not less
than 17*16^4, report an error,

Yes. The error is reported if the value is greater than or equal to
0x110000, because The Unicode Standard, version 4.0, does not provide
for more than 1114112 code points, starting with code point U+0000.

(BTW: You have mis-wrapped your abstraction of the original source
code; a trailing `return' statement would only return `undefined',
not the evaluated value of the following lines.)

and If it is 7 bits or less (in the range (2^7,0], that is), just
return the two digit hex representation.

Yes. One (8-bit) UTF-8 code unit suffices to encode Unicode characters
at these code points.

[...]
With this it's also easy to see how to work from UTF-8 to unicode.
[...]
Thus, we see the correspondence between UTF8 and unicode

You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

[...]
I found the following sites useful for seeing mappings and glyphs:
http://www.unicode.org/charts/About.html and
http://www.macchiato.com/unicode/chart/

But obviously you have not found <URL:http://unicode.org/faq/> yet.
Please make it so.

PointedEars

Csaba Gabor · Mar 19, 2006

Thomas said:
Would you please at least try to retain context in quotations?

I did.

You are not making any sense. `n' is assigned the code point (CP)
number, which is then converted into UTF-8 code units, according to
the algorithms specified in The Unicode Standard, version 4.0.

Sorry you didn't get it. It seems I was spot on in showing how to go
from CP number to the UTF-8 code units and back, as can be verified at
the nice
http://en.wikipedia.org/wiki/UTF-8

Csaba

Thomas 'PointedEars' Lahn · Mar 19, 2006

Csaba said:
I did.

You did not. I wrote (at least):

| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the "units" word.

Thank you for destroying the context again.

Sorry you didn't get it.
YMMD.

It seems I was spot on in showing how to go from CP number to
the UTF-8 code units and back, as can be verified at the nice
http://en.wikipedia.org/wiki/UTF-8

What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not. You
said this shows the relation between Unicode and UTF-8, which is nonsense,
because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandings.

Score adjusted

PointedEars

Csaba Gabor · Apr 20, 2006

Thomas said:
You did not. I wrote (at least):

In fact, I did try. You are not an authority on me so I will
appreciate it if you will refrain from making assertions on
things you can not know.

| Those are percent-escaped representations of the three UTF-8 code
| units that are required to encode the Unicode character at code
| point U+09C4. [...]
| <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You quoted me:

| > Those are percent-escaped representations of the three UTF-8 code ...
| > <URL:http://people.w3.org/rishida/scripts/uniview/conversion>.

You call that /retaining/ context? You even removed the

....

Yes.
Upon review, I find that I quoted exactly what I wanted to quote.

Thank you for destroying the context again.

What you think it seemed, and what you actually meant, is not relevant
regarding the question whether you have been making sense or not.

In fact it is, since making sense is always subjective.

You said this shows the relation between Unicode and UTF-8, which is nonsense,

Really? Care to offer a quote for your assertion about what I said?
I never even used the word relationship in this thread.

because the relation has always been there. UTF-8 is one possible encoding
to encode Unicode characters.

Better express yourself next time, this way you can avoid misunderstandings.

Now that I have expressed myself, you might consider
expressing yourself better next time.
In particular, ordering and making demands on people is neither polite,
nor very effective on newsgroups where there is no means of enforcement.
If there is something that you would like to see done differently, then
it might be more expedient to point out what bothers you about it, and
suggest what would make you happier. Just saying "Don't" or "That was
nonesense" is not very constructive in forestalling future occurrences.

Csaba Gabor from Vienna

retriving escape unicode sequences from files ...	1	Aug 4, 2012
retriving escape unicode sequences from files ...	1	Aug 4, 2012
geting error as unxpected symbol read: ". in array initialization	0	Mar 27, 2016
corrupt zip files	10	May 6, 2012
Why file containing 256 bytes is 257 bytes long?	12	Sep 14, 2005
Can't get to_integer to work	6	Sep 25, 2003

encodeURI and unicode

Csaba Gabor

Csaba Gabor

Thomas 'PointedEars' Lahn

Csaba Gabor

Thomas 'PointedEars' Lahn

Csaba Gabor

Thomas 'PointedEars' Lahn

Csaba Gabor

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads