Matthias said:
That's a one-liner:
"소개".replace(/&#(\d+);/g, function (search, match) { return
"\\u" + parseInt(match, 10).toString(16); });
To be precise, at least a two-liner, for legibility
"소개".replace(/&#(\d+);/g, function (search, match) {
return "\\u" + parseInt(match, 10).toString(16).toUpperCase(); });
It also matters that the `return' keyword and return value expression start
on the same line, else `undefined' is returned due to automatic semicolon
insertion.
However, I would write it as a general-purpose function:
function charRefToUnicodeEscape(s)
{
return String(s).replace(
/&#(\d+);/g,
function(m, p1) {
return "\\u" + parseInt(p1, 10).toString(16);
});
}
var s = ...;
/* ... */
s = charRefToUnicodeEscape(s).toUpperCase();
(Or make it a method of String.prototype.)
The issue remains that the HTML Document Character Set is UCS, which
supports code points beyond the Basic Multilingual Plane (U+10000 and
greater) with UCS-4, while ECMAScript Unicode escape sequences do not:
\uFFFF is the specified maximum. So those characters cannot be presented
equally in ECMAScript.
However, the solution to that problem would be simple (and oft-mentioned
before):
Do not output or store character references, but output raw code units and
declare the proper character encoding (e.g. UTF-7, -8, -16 or -32).
PointedEars