jscript + charCodeAt, how to set js encoding?

C

czechboy

Hi, I would like to use charCodeAt function but it returns wrong dec.
numbers. The question is how to set character of the js. file
executed? I can not use any kind of tags (<script> <meta> etc) since
the js. file is executed directly through the client server. So no
html etc... is there any solution? I need to display utf-8 text in
RTF1 so I need to transfer the text to unicode by charCodeAt so that
is why I needed it.

Here is the function I would like to use:

function toCharRef(str){
var charRefs = [], codePoint, i;
for(i = 0; i < str.length; ++i){
codePoint = str.charCodeAt(i);
if(0xD800 <= codePoint && codePoint <= 0xDBFF){
i++;
codePoint = 0x2400 + ((codePoint - 0xD800) << 10) +
str.charCodeAt(i);
}
charRefs.push('\\u' + codePoint);
}
return charRefs.join('');
};
 
V

VK

Hi, I would like to use charCodeAt function but it returns wrong dec.
numbers. The question is how to set character of the js. file
executed?

Javascript operates only and exclusively with Unicode (note UTF-8) but
Unicode itself. Even if it is say ISO-8859-1 page, from within
Javascript it is seen as Unicode. If user typed in some text on this
page into some form field and you have read it into Javascript
function, it will be not ISO-8859-1 anymore but Unicode. My guess is -
possibly wrong - that you are making double encoding, so whatever
already is an Unicode string being encoded again to make an Unicode
string and here where the wrong results are.
 
J

Joost Diepenmaat

czechboy said:
Hi, I would like to use charCodeAt function but it returns wrong dec.
numbers.

Really? Please demonstrate that.
The question is how to set character of the js. file
executed? I can not use any kind of tags (<script> <meta> etc) since
the js. file is executed directly through the client server. So no
html etc... is there any solution?

I don't understand what you're saying.
I need to display utf-8 text in
RTF1 so I need to transfer the text to unicode by charCodeAt so that
is why I needed it.

charCodeAt already returns the unicode codepoint and strings in
javascript are unicode. The conversion to and from other encodings is
presumably handled by the scripting host (i.e. the browser).

function toCharRef(str){
var charRefs = [], codePoint, i;
for(i = 0; i < str.length; ++i){
codePoint = str.charCodeAt(i);
if(0xD800 <= codePoint && codePoint <= 0xDBFF){
i++;
codePoint = 0x2400 + ((codePoint - 0xD800) << 10) +
str.charCodeAt(i);
}
charRefs.push('\\u' + codePoint);
}
return charRefs.join('');
};

codePoint here is not a hexadecimal 4-character string.

Is there any reason you're doing this at all?

Joost.
 
T

Thomas 'PointedEars' Lahn

czechboy said:
Hi, I would like to use charCodeAt function but it returns wrong dec.
numbers.

No, it does not.
The question is how to set character of the js. file executed?

You don't. You can and SHOULD declare the character encoding of a resource
but that has to match the actual encoding or the parsing result is garbage.
I can not use any kind of tags (<script> <meta> etc) since
the js. file is executed directly through the client server.

What is a "client server"?
So no html etc... is there any solution? I need to display utf-8 text

UTF-8 is an encoding for Unicode characters. ECMAScript strings are Unicode
strings encoded with UTF-16, but that matters little in practice.
in RTF1 [...]

What is RTF1?
Here is the function I would like to use:

function toCharRef(str){
var charRefs = [], codePoint, i;
for(i = 0; i < str.length; ++i){
codePoint = str.charCodeAt(i);
if(0xD800 <= codePoint && codePoint <= 0xDBFF){
i++;
codePoint = 0x2400 + ((codePoint - 0xD800) << 10) +
str.charCodeAt(i);
}

What are you trying to accomplish here? You don't have to re-implement UTF-8.
charRefs.push('\\u' + codePoint);

charRefs.push('\\u' + codePoint.toString(16));
}
return charRefs.join('');
};

You don't have to do any of this if your target resource has its encoding
properly declared. That would first include a Content-Type HTTP header,
and a BOM, an XML declaration, or a `meta' element as fallback.


PointedEars
 
T

Thomas 'PointedEars' Lahn

VK said:
Javascript operates only and exclusively with Unicode (note UTF-8) but
Unicode itself.

That is incoherent gibberish, and qualifies as nonsense. Unsurprisingly.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
charRefs.push('\\u' + codePoint.toString(16));

JFTR: That won't work (won't result in something interpretable as Unicode
literal) with code points less than 0x1000. You will need

charRefs.push('\\u' + leadingZero(codePoint.toString(16), 4));

where leadingZero() is a user-defined algorithm that appends leading zeroes
to the value of the first argument until the length equals the value of the
second one, and then returns the resulting string.

http://PointedEars.de/scripts/string.js


PointedEars
 
C

czechboy

To explain it in more detail. There is a javascript SDK plug-in in
FARR ( http://www.donationcoder.com/Forums/bb/index.php?topic=11804.0
). It uses Microsoft scripting host to interpret javascript. I would
like to display unicode result (russian, greek etc) as RTF1 which
means that I have to convert ì¹èø¾ etc. to its decimal interpretation
by charCodeAt. So when I call the charCodeAt for the letter "è" the
SDK displays 356 but it should be 269. Do you think there might be an
error in the javascript SDK plug-in?

And concerning the function. It is what I have found on the internet.
I an javascript newbie ;)
 
V

VK

Javascript operates only and exclusively with Unicode (note UTF-8) but
That is incoherent gibberish, and qualifies as nonsense.

Oh com'on, these are really ground basics. Don't make yourself look
foolish.
 
V

VK

To explain it in more detail. There is a javascript SDK plug-in in
FARR (http://www.donationcoder.com/Forums/bb/index.php?topic=11804.0
). It uses Microsoft scripting host to interpret javascript. I would
like to display unicode result (russian, greek etc) as RTF1 which
means that I have to convert ì¹èø¾ etc. to its decimal interpretation
by charCodeAt. So when I call the charCodeAt for the letter "è" the
SDK displays 356 but it should be 269. Do you think there might be an
error in the javascript SDK plug-in?

Definitely. Or an encoding conflict so the source page in one encoding
being sent with content header claiming another encoding so the engine
cannot get the source right. But in the later case the page would look
garbled as well.

Try:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<title>Demo</title>
<script type="text/javascript">
function f() {
var str = document.forms[0].elements[0].value;
for (var i=0; i< str.length; i++) {
window.alert(str.charCodeAt(i));
}
}
</script>
</head>
<body>
<form action="" onsubmit="return f();">
<textarea cols="5" rows="1">&egrave;&nbsp;&eacute;</textarea>
<input type="submit" value="Demo">
</form>
<h1>Demo</h1>

</body>
</html>


It properly reports to me 232, 160, 233 which is

00E8 LATIN SMALL LETTER E WITH GRAVE
00A0 NO-BREAK SPACE
00E9 LATIN SMALL LETTER E WITH ACUTE

from Latin-1 Unicode table
(http://www.unicode.org/charts/PDF/U0080.pdf)
 
T

Thomas 'PointedEars' Lahn

VK said:
Oh com'on, these are really ground basics.

*These* are not, because what you said is nonsense at best. It would appear
you are too stupid to realize how stupid you are.
Don't make yourself look foolish.

ROTFL.


PointedEars
 
T

Thomas 'PointedEars' Lahn

VK said:
What exactly you did not understand in my explanations?

There is nothing to be understood where there is no meaning.
Someone should force you to read the nonsense that you post.


PointedEars
 
J

Joost Diepenmaat

czechboy said:
To explain it in more detail. There is a javascript SDK plug-in in
FARR ( http://www.donationcoder.com/Forums/bb/index.php?topic=11804.0
). It uses Microsoft scripting host to interpret javascript. I would
like to display unicode result (russian, greek etc) as RTF1 which
means that I have to convert ěšÄřž etc. to its decimal interpretation
by charCodeAt. So when I call the charCodeAt for the letter "Ä" the
SDK displays 356 but it should be 269.

I would expect MS jscript to do something as basic as charcodeat
correctly. My implementation (firefox) correctly gives 268 (0x10c) for
"Č". In any case it's more likely that the text is wrongly converted
somewhere before it reaches the script (i.e. converted from an encoding
that it's not in fact in).
Do you think there might be an
error in the javascript SDK plug-in?

Could be. From your URL it appears that the host isn't unicode
aware.
And concerning the function. It is what I have found on the internet.
I an javascript newbie ;)

Don't use it. it's incorrect.

Joost.
 
V

VK

There is nothing to be understood where there is no meaning.

No meaning if someone doesn't or pretending of not having, for some
weird reasons, the most basic things about the matter. Let's try the
baby steps approach. On each step you can stop me saying "I don't get
that". Ready?

[1]
This page is valid HTML 4.01 Strict

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN">
<html><head>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
<title>Demo</title>
<script type="text/javascript">
function f() {
var str = document.forms[0].elements[0].value;
for (var i=0; i< str.length; i++) {
window.alert(str.charCodeAt(i));
}
}
</script>
</head>
<body>
<form action="" onsubmit="return f();">
<fieldset>
<legend>Demo</legend>
<textarea cols="5" rows="1">&divide;&egrave;&nbsp;&eacute;</textarea>
<input type="submit" value="Demo">
</fieldset>
</form>
</body>
</html>

[2]
The page above is a valid HTML 4.01 Strict in ISO-8859-1
It is true: if open as local file, if server doesn't specify another
charset in Content-type, if server specifies the same ISO-8859-1
charset in Content-type. When saying that "The page above is a valid
HTML 4.01 Strict in ISO-8859-1" either of these conditions is assumed.

[3]
After page load one may click on "Demo" button to execute inline
script. After execution of
var str = document.forms[0].elements[0].value;
variable str contains form field value in Unicode. It is irrelevant
what the actual page encoding is: iso-8859-1, x-sjis, big5 or anything
else.
 
C

czechboy

I would expect MS jscript to do something as basic as charcodeat
correctly. My implementation (firefox) correctly gives 268 (0x10c) for
"È". In any case it's more likely that the text is wrongly converted
somewhere before it reaches the script (i.e. converted from an encoding
that it's not in fact in).


Could be. From your URL it appears that the host isn't unicode
aware.


Don't use it. it's incorrect.

Joost.

Thank you for your help. Now it is working. Could you please post me
correct function? I am not that skilled to do one by myself ;) Thank
you
 
B

Bart Van der Donck

Joost said:
charCodeAt already returns the unicode codepoint and strings in
javascript are unicode. The conversion to and from other encodings is
presumably handled by the scripting host (i.e. the browser).

Yes, and more specifically, by the character set of the web page.

<textarea>éè</textarea>

returns under

Western European, charset=iso-8859-1: 233 and 232
Central European, charset=iso-8859-2: 233 and 269
Eastern European, charset=iso-8859-5: 1097 and 1096
Russian, charset=KOI8-R: 1048 and 1061

But the real fun starts with multibyte-sequences (saved under ANSI,
not UTF-8):

Japanese, charset=shift-jis: 40167 (one character)
Trad. Chinese, charset=big5: 27654 (one character)
 
J

Joost Diepenmaat

czechboy said:
Thank you for your help. Now it is working. Could you please post me
correct function? I am not that skilled to do one by myself ;) Thank
you

I think Thomas already posted a correction to the function in this
thread. Look it up.

Joost.
 
C

czechboy

I think Thomas already posted a correction to the function in this
thread. Look it up.

Joost.

Thanks. Meanwhile I have found another function which seams to work
fine. Is it the correct function?

(function(){

var unicode = {

/**
*
*
*/
'dec2hex' : function(ts)
{
return (ts+0).toString(16).toUpperCase();
},


/**
*
*
*/
'dec2hex2' : function(ts)
{
var hexequiv = new Array ("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "A", "B", "C", "D", "E", "F");
return hexequiv[(ts >> 4) & 0xF] + hexequiv[ts & 0xF];
},


/**
*
*
*/
'dec2hex4' : function(ts)
{
var hexequiv = new Array ("0", "1", "2", "3", "4", "5", "6", "7",
"8", "9", "A", "B", "C", "D", "E", "F");
return hexequiv[(ts >> 12) & 0xF] + hexequiv[(ts >> 8) & 0xF] +
hexequiv[(ts >> 4) & 0xF] + hexequiv[ts & 0xF];
},


/**
*
*
*/
'convertCP2Char' : function(ts)
{
var outputString = '';
ts = ts.replace(/^\s+/, '');
if(ts.length == 0)
return "";
ts = ts.replace(/\s+/g, ' ');
var listArray = ts.split(' ');
for(var i = 0; i < listArray.length; i++)
{
var n = parseInt(listArray, 16);
if(n <= 0xFFFF)
outputString += String.fromCharCode(n);
else if (n <= 0x10FFFF)
{
n -= 0x10000;
outputString += String.fromCharCode(0xD800 | (n >> 10)) +
String.fromCharCode(0xDC00 | (n & 0x3FF));
}
else
outputString += '!erreur ' + unicode.dec2hex(n) +'!';
}
return( outputString );
},


/**
*
*
*/
'convertCP2DecNCR' : function(ts)
{
var outputString = "";
ts = ts.replace(/^\s+/, '');
if(ts.length == 0)
return "";
ts = ts.replace(/\s+/g, ' ');
var listArray = ts.split(' ');
for(var i = 0; i < listArray.length; i++)
{
var n = parseInt(listArray, 16);
outputString += ('{\\u' + n + '}');
}
return(outputString);
},


/**
*
*
*/
'convertChar2CP' : function(ts)
{
var outputString = "", haut = 0, n = 0;
for(var i = 0; i < ts.length; i++)
{
var b = ts.charCodeAt(i);
if(b < 0 || b > 0xFFFF)
outputString += '!erreur ' + unicode.dec2hex(b) + '!';

if(haut != 0)
{
if(0xDC00 <= b && b <= 0xDFFF)
{
outputString += unicode.dec2hex(0x10000 + ((haut - 0xD800) <<
10) + (b - 0xDC00)) + ' ';
haut = 0;
continue;
}
else
{
outputString += '!erreur ' + unicode.dec2hex(haut) + '!';
haut = 0;
}
}

if(0xD800 <= b && b <= 0xDBFF)
haut = b;
else
outputString += unicode.dec2hex(b) + ' ';
}
return( outputString.replace(/ $/, '') );
},


/**
*
*
*/
'convertDecNCR2CP' : function(ts)
{
var outputString = '';
ts = ts.replace(/\s/g, '');
var listArray = ts.split(';');
for (var i = 0; i < listArray.length-1; i++)
{
if(i > 0)
outputString += ' ';
var n = parseInt(listArray.substring(2, listArray.length),
10);
outputString += unicode.dec2hex(n);
}
return( outputString );
}

};


/**
* Convert Character to Decimal.
*
* @example "JavaScript".char2dec();
* @result "JavaScript"
*
* @name char2dec
* @return String
*/
if(!String.prototype.char2dec)
String.prototype.char2dec = function()
{
return unicode.convertCP2DecNCR(unicode.convertChar2CP(this));
};


/**
* Convert Decimal to Character.
*
* @example
"JavaScript".dec2char();
* @result "JavaScript"
*
* @name dec2char
* @return String
*/
if(!String.prototype.dec2char)
String.prototype.dec2char = function()
{
return unicode.convertCP2Char(unicode.convertDecNCR2CP(this));
};

})();
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,593
Members
45,111
Latest member
KetoBurn
Top