convert NCR to \u?

Ken Williams · Apr 6, 2009

Hi, I'm trying to convert text in "numerical character reference" format
to this javascript escape (\u) format. for example 소개
should become \uC18C\uAC1C.

I need to do this conversion in a Javascript routine or PHP.

Any ideas? I'm lost
(e-mail address removed)

Ben Crowell · Apr 6, 2009

Ken said:
Hi, I'm trying to convert text in "numerical character reference" format
to this javascript escape (\u) format. for example 소개
should become \uC18C\uAC1C.

I need to do this conversion in a Javascript routine or PHP.

The type of answer you need would probably depend on your programming
background. If you have some general knowledge of programming, then this
might do it for you. You can use a regular expression to pick out the
strings of the form &#...;. Then you need to convert decimal to
hexadecimal. As far as I know, js has no built-in function
like sprintf in C or perl that will do conversion to hex for you,
so you just need to hand-code a function to do it. Hope that helps.

Jon Gómez · Apr 7, 2009

Ken said:
Hi, I'm trying to convert text in "numerical character reference" format
to this javascript escape (\u) format. for example 소개
should become \uC18C\uAC1C.

Are we talking about strings you have access to, that have, for example
the sequence characters

'&' '#' '4' '9' '5' '4' '8'

and

'\\' 'u' 'C' '1' '8' 'C'

or are we talking about a situation in which you've used the numerical
character references that have become individual characters in your
text, and you want to somehow produce the second string example from a
given singleton character?

In the first case, my thought is that I'd write some decimal to
hexadecimal converter for the digits, and use string functions to
extract, concatenate, etc.

In the second case, you might be able to use a string function to get
the char code.

Either way, here are some starting references:

https://developer.mozilla.org/En/Core_JavaScript_1.5_Reference/Global_Functions
https://developer.mozilla.org/en/Core_JavaScript_1.5_Reference/Global_Objects/String

I'm thinking things like
parseInt
String charCodeAt
String charAt
String fromCharCode
Number toString

You should also check out their definitions in EMCA 262 Edition 3, which
is the standard for EMCAScript. For example, Number.prototype.toString
is implementation-dependent for radix in range 2-36 but not 10, so using
16, for example, could be unsafe. So you might be able to get
Javascript to convert to hex strings for you, but you might have a
problem there...

Ken Williams · Apr 7, 2009

Ben said:
The type of answer you need would probably depend on your programming
background. If you have some general knowledge of programming, then this
might do it for you. You can use a regular expression to pick out the
strings of the form &#...;. Then you need to convert decimal to
hexadecimal. As far as I know, js has no built-in function
like sprintf in C or perl that will do conversion to hex for you,
so you just need to hand-code a function to do it. Hope that helps.

I didn't notice that. I think its as simple as doing a hex conversion
on the decimal number part using a regex.

Thanks.

Jorge · Apr 7, 2009

(...) As far as I know, js has no built-in function
like sprintf in C or perl that will do conversion to hex for you,
so you just need to hand-code a function to do it.

javascript:alert((65535).toString(16))...

Matthias Reuter · Apr 7, 2009

Ken said:
Hi, I'm trying to convert text in "numerical character reference" format
to this javascript escape (\u) format. for example 소개
should become \uC18C\uAC1C.

That's a one-liner:

"소개".replace(/&#(\d+);/g, function (search, match) { return
"\\u" + parseInt(match, 10).toString(16); });

Explained in detail:

// That is your original string
var original = "소개"

// That is a regexp pattern. It says
// look for "&#" followed by at least one digit, followed by ";"
// and remember the digits.
// Do this for every occurence.
var pattern = /&#(\d+);/g;

// This is a function, that takes a parameter match
var matchFunction = function (search, match) {
// converts match to a number
var n = parseInt(match, 10);
// converts that number to a hex string
var hex = n.toString(16);

// return that hex string but prepend \u
return "\\u" + hex;
};

// now take that original string, and replace what you find in
// defined pattern using defined function
var modified = original.replace(pattern, matchFunction);

Matt

Ben Crowell · Apr 7, 2009

Jorge said:
javascript:alert((65535).toString(16))...

Aha! Cool, thanks for pointing that out!

For those who need something more full-featured, googling on "javascript
sprintf" gives quite a few hits. Anyone know if there's a high-quality
implementation out there that's open source?

Jon Gómez · Apr 7, 2009

Jorge said:
javascript:alert((65535).toString(16))...

Isn't that very dangerous?

From what I understand, this is what happens when you do

(65535).toString(16)

First, the 65535 is treated a numeric literal, of type Number (but not
the object type), and when a property is being accessed on it, ToObject
gets called on it (11.2.1), which results in it becoming a Number object
(9.9). Therefore, you are calling toString() on a Number object, which
normally would call the prototype function toString() (15.7.4.2). As I
said in my previous post, this function is undefined for the value 16:

"If radix is an integer from 2 to 36, but not 10, the result is a
string, the choice of which is implementation-dependent."

Therefore, the result could be "hello world, I love you", and that would
be an acceptable result in EMCAScript.

Admittedly, Javascript is probably safe, as the Mozilla MDC specifies
that it constructs the appropriate hexadecimal string from the number.

Also, in the expression

javascript:alert((65535).toString(16))

Isn't "javascript:" treated as a label, and so ignored in this case?

Jon.

Jon Gómez · Apr 7, 2009

Jon said:
"If radix is an integer from 2 to 36, but not 10, the result is a
string, the choice of which is implementation-dependent."

That quote is from EMCA 262, 15.7.4.2.

Jon.

Ben Crowell · Apr 7, 2009

Jon said:
Isn't that very dangerous?

From what I understand, this is what happens when you do

(65535).toString(16)

First, the 65535 is treated a numeric literal, of type Number (but not
the object type), and when a property is being accessed on it, ToObject
gets called on it (11.2.1), which results in it becoming a Number object
(9.9). Therefore, you are calling toString() on a Number object, which
normally would call the prototype function toString() (15.7.4.2). As I
said in my previous post, this function is undefined for the value 16:

"If radix is an integer from 2 to 36, but not 10, the result is a
string, the choice of which is implementation-dependent."

I suspect the reason it's implementation-dependent is that there's no
standard notation for bases >16. However, the choice of 36 is clearly
intended so that you can use digits 0-9 and letters a-z. Base 36 could
be a useful way of representing things like hash codes in a fairly
compact form that can be written down, put into urls, etc. In firefox,
(35).toString(36) does give 'z'. I find it hard to imagine that anyone
would produce an implementation of js that wouldn't have the expected
behavior for base 16.

Also, in the expression

javascript:alert((65535).toString(16))

Isn't "javascript:" treated as a label, and so ignored in this case?

I think Jorge wrote it that way for convenience so people could try it
in a browser. In firefox, if you paste this string into the url area,
it executes the javascript code. The "javascript:" part of the string
isn't js code, it's part of the url, analogous to "http:".

Thomas 'PointedEars' Lahn · Apr 7, 2009

Matthias said:
That's a one-liner:

"소개".replace(/&#(\d+);/g, function (search, match) { return
"\\u" + parseInt(match, 10).toString(16); });

To be precise, at least a two-liner, for legibility

"소개".replace(/&#(\d+);/g, function (search, match) {
return "\\u" + parseInt(match, 10).toString(16).toUpperCase(); });

It also matters that the `return' keyword and return value expression start
on the same line, else `undefined' is returned due to automatic semicolon
insertion.

However, I would write it as a general-purpose function:

function charRefToUnicodeEscape(s)
{
return String(s).replace(
/&#(\d+);/g,
function(m, p1) {
return "\\u" + parseInt(p1, 10).toString(16);
});
}

var s = ...;
/* ... */
s = charRefToUnicodeEscape(s).toUpperCase();

(Or make it a method of String.prototype.)
The issue remains that the HTML Document Character Set is UCS, which
supports code points beyond the Basic Multilingual Plane (U+10000 and
greater) with UCS-4, while ECMAScript Unicode escape sequences do not:
\uFFFF is the specified maximum. So those characters cannot be presented
equally in ECMAScript.

However, the solution to that problem would be simple (and oft-mentioned
before):

Do not output or store character references, but output raw code units and
declare the proper character encoding (e.g. UTF-7, -8, -16 or -32).

PointedEars

Jorge · Apr 7, 2009

(...)
Therefore, the result could be "hello world, I love you", and that would
be an acceptable result in EMCAScript.
(...)

15.7.4.2 Number.prototype.toString (radix)
^^^^^^^
http://en.wikipedia.org/wiki/Radix

Jorge · Apr 7, 2009

I think Jorge wrote it that way for convenience so people could try it
in a browser. In firefox, if you paste this string into the url area,
it executes the javascript code. The "javascript:" part of the string
isn't js code, it's part of the url, analogous to "http:".

Yes, exactly, that's it. Thanks. A 'bookmarklet', the
'javascript:' ('pseudo') protocol.

Jon Gómez · Apr 7, 2009

Jorge said:
15.7.4.2 Number.prototype.toString (radix)
^^^^^^^
http://en.wikipedia.org/wiki/Radix

I know what a radix is, but Number.prototype.toString() isn't guaranteed to.

Jon.

Jon Gómez · Apr 7, 2009

Ben said:
I think Jorge wrote it that way for convenience so people could try it
in a browser. In firefox, if you paste this string into the url area,
it executes the javascript code. The "javascript:" part of the string
isn't js code, it's part of the url, analogous to "http:".

Makes sense

.
Jon.

Jorge · Apr 7, 2009

I know what a radix is, but Number.prototype.toString() isn't guaranteed to.

"If radix is an integer from 2 to 36, but not 10, the result is a
string, the choice of which is implementation-dependent." means that
an implementation might have chosen to output ((65535).toString(16))
as '0xffff' or 'ffffH' or '$ffff' or 'hFFFF', etc. instead of 'ffff',
but of course not that it might output "hello world, I love you".

Jon Gómez · Apr 7, 2009

Jorge said:
"If radix is an integer from 2 to 36, but not 10, the result is a
string, the choice of which is implementation-dependent." means that
an implementation might have chosen to output ((65535).toString(16))
as '0xffff' or 'ffffH' or '$ffff' or 'hFFFF', etc. instead of 'ffff',
but of course not that it might output "hello world, I love you".

I disagree. That is certainly the most reasonable interpretation, but I
don't think the specification requires it. In fact, I hereby make a
base 16 with the encoding:

0 -> "hi"
1 -> "hello"
2 -> "greetings"
3 -> "whatup"
4 -> "hiya"
5 -> "sup"
6 -> "howdy"
7 -> "hearts"
8 -> "well-met"
9 -> "aloha"
a -> "ciao"
b -> "que-tal"
c -> "hola"
d -> "yo"
e -> "what-ho"
f -> "salutations"

10 -> "hi & hello"
11 -> "hello & hello"
2f -> "salutations & greetings"

[x,y,...,z] -> [z] & ... & [y] & [x]

My base 16 representation is a lot friendlier than the normal one

.

Jon.

Jon Gómez · Apr 7, 2009

I meant also to say that I still don't think there even needs to be a
bijection between the strings and the numbers. I don't think the
standard enforces in its language the idea that they need to be strings
that unambiguously represent the numbers, even if it is strongly implied.

Jon.
PS: If my other post didn't cancel (the one containing the string
"cardinality"), please ignore it. It was mis-stated.

Jorge · Apr 7, 2009

(...)

My base 16 representation is a lot friendlier than the normal one .

Y tu estás un poco más loco de lo normal, jejeje

Jon Gómez · Apr 7, 2009

Jorge said:
"If radix is an integer from 2 to 36, but not 10, the result is a
string, the choice of which is implementation-dependent." means that
an implementation might have chosen to output ((65535).toString(16))
as '0xffff' or 'ffffH' or '$ffff' or 'hFFFF', etc. instead of 'ffff',
but of course not that it might output "hello world, I love you".

Okay, I've been thinking about it some more, and I think, really, I'm
complaining that the specification could have been more explicit in
stating what we all intuitively grasp from it. But the result is that
there is a hole in the language. This made me think it might be better
just to write a simple implementation of one's own, since it really
isn't too hard, and that eliminates any unknown or unpredictable
variation. But, yeah, I'm probably being nit-picky. :'(.

Jon.

How can I convert PST to MBOX with attachments?	0	Jan 16, 2025
How to convert MBOX files to PST format?	6	Jan 8, 2025
How to batch convert PST emails to EMLX format?	0	Feb 15, 2025
How to convert MBOX to PST in easy steps?	2	Dec 28, 2024
How to convert EML files to HTML format with attachments?	2	Feb 26, 2025
How do I easily convert Zimbra files to PST format?	1	Jan 2, 2025
How can I convert my Outlook PST mailboxes into PDF?	0	Jan 18, 2025
Expert Guide to Convert MBOX to PST File Manually in 2025	3	Dec 1, 2024

convert NCR to \u?

Ken Williams

Ben Crowell

Jon Gómez

Ken Williams

Jorge

Matthias Reuter

Ben Crowell

Jon Gómez

Jon Gómez

Ben Crowell

Thomas 'PointedEars' Lahn

Jorge

Jorge

Jon Gómez

Jon Gómez

Jorge

Jon Gómez

Jon Gómez

Jorge

Jon Gómez

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads