decoding numeric HTML entities

Andreas Gohr · Jun 10, 2005

Hi all!

I need a way to decode numeric HTML entities (like Ü) back to
their UTF-8 character to place them into a textarea. I tried the
following but it doesn't work in IE.

data = data.replace(/&#(\d+);/g,
function() {
return String.fromCharCode(RegExp.$1);
});

Has anyone a crossbrowser solution?

Regards

fox · Jun 11, 2005

Andreas said:
Hi all!

I need a way to decode numeric HTML entities (like Ü) back to
their UTF-8 character to place them into a textarea. I tried the
following but it doesn't work in IE.

data = data.replace(/&#(\d+);/g,
function() {
return String.fromCharCode(RegExp.$1);
});

try:

function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
}

Andreas Gohr · Jun 11, 2005

It works! You saved my day

Just for my understanding: What does the plus sign do? Is it a typo and
just happens to work or does it do some magic?

Thanks again
Andi

fox · Jun 11, 2005

Andreas said:
It works! You saved my day

Just for my understanding: What does the plus sign do? Is it a typo and
just happens to work or does it do some magic?

the parenthetic match from the regex is *usually* interpreted as a
string value. In JavaScript, because data types are "flexible", using
the plus sign (a unary operator) removes any ambiguity as to the type of
the value passed -- if the characters are all digits, then a number will
be interpreted (otherwise, you'll receive a "NaN" result, so, in most
cases, care should be taken to make sure that the string will be all
digits). The behavior of this technique is different than using parseInt
which *will* return a numerical value if a string *starts* with digit
characters:

parseInt("123 Rue Morgue") => 123

[ comparatively:
+("123 Rue Morgue") => NaN
]

Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not) -- I've never benchmarked it.

Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

var aNumAsString = "23";
var asum = 15 + +aNumAsString;
^no space here

otherwise, it will be interpreted as a concatenation or addition
operator (or a syntax error, as in 1 + + 3)

'+' is one of the most "overloaded" operators in the language.

Michael Winter · Jun 11, 2005

On 11/06/2005 01:14, fox wrote:

[snip]

the parenthetic match from the regex is *usually* interpreted as a
string value.

It should always be a string value as regular expressions only operate
on strings (other types are converted, first).

[snip]

The behavior of this technique is different than using parseInt
which *will* return a numerical value if a string *starts* with digit
characters:

parseInt("123 Rue Morgue") => 123

Though in most cases, the parseInt function should be used with its
radix argument:

parseInt('123 Rue Morgue', 10)

[snip]

Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not)

Both examine characters, but unary plus neither includes a function
call, nor does it have such a complicated algorithm.

Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

Nonsense.

[snip]

Mike

Michael Winter · Jun 11, 2005

Andreas Gohr wrote:
[snip]

data = data.replace(/&#(\d+);/g,
function() {
return String.fromCharCode(RegExp.$1);
});

Click to expand...

try:

function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
}

Or

function() {
return String.fromCharCode(arguments[1]);
}

Despite what you might think from this thread, there isn't one really.
The String.prototype.replace method is broken or inadequate in some
browsers (including earlier IE versions). The behaviour of the replace
method can be examined and reimplemented in script, but I haven't done
that as yet.

Mike

Andreas Gohr · Jun 11, 2005

Thank you all for your help and clarification of the + operator

About the crossbrowser replace solution: I'm happy as it works in all
browsers needed. It's used with xmlhttprequest so it concerns modern
browsers only anyway (Firefox, Opera, IE and Konquerer tested so far).

For the curious: it now powers the Spellchecker in DokuWiki
http://wiki.splitbrain.org/wiki:spell_checker

Regards
Andi

Dr John Stockton · Jun 12, 2005

JRS: In article <[email protected]>, dated Fri, 10 Jun
2005 19:14:06, seen in fox

In JavaScript, because data types are "flexible", using
the plus sign (a unary operator) removes any ambiguity as to the type of
the value passed -- if the characters are all digits, then a number will
be interpreted (otherwise, you'll receive a "NaN" result, so, in most
cases, care should be taken to make sure that the string will be all
digits).

The characters can also be well-placed
space tab + - e E x X a A b B c C d D e E f F
though the last 12 are there digits too.

Logically, it would seem that using a unary operator (+) is a faster
conversion than parseInt (which examines every character to test whether
a digit or not) -- I've never benchmarked it.

Surely the unary operator will fail on reaching the first unacceptable
character, whereas parseInt will succeed at the same point. However, it
is the mechanism for looking up what to do, rather than the doing of it,
which may take more time.

Realize also, that in using the unary operator, it *MUST* appear
immediately adjacent to the value it is being applied to:

var aNumAsString = "23";
var asum = 15 + +aNumAsString;
^no space here

otherwise, it will be interpreted as a concatenation or addition
operator (or a syntax error, as in 1 + + 3)

Untrue. 1 + + 3 is 4, as is 1 + - - + 3, at least for me.

However, ISTM that one cannot have two contiguous instances of the same
one of + - but 1+-+-+-+3 happily gives -2. Note that +"+3" is OK.

fox · Jun 13, 2005

Dr said:
JRS: In article <[email protected]>, dated Fri, 10 Jun
2005 19:14:06, seen in fox

The characters can also be well-placed
space tab + - e E x X a A b B c C d D e E f F
though the last 12 are there digits too.

Surely the unary operator will fail on reaching the first unacceptable
character, whereas parseInt will succeed at the same point. However, it
is the mechanism for looking up what to do, rather than the doing of it,
which may take more time.

Untrue. 1 + + 3 is 4, as is 1 + - - + 3, at least for me.

However, ISTM that one cannot have two contiguous instances of the same
one of + - but 1+-+-+-+3 happily gives -2. Note that +"+3" is OK.

It appears that I have carried my stricter "C upbringing" into
JavaScript... I apologize for the misstatement (w/r/t JavaScript, that
is). When you know more than a few languages, sometimes the lines are
blurred.

Andreas Gohr · Jun 14, 2005

Hi again!

I declared success too early :-(

This works everywhere - except in Safari:

data = data.replace(/&#(\d+);/g,
function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
});

Safari treats this as a simple string instead of executing the function
:-( It works in Konqueror (I thought they both use the same renderer!?)
Has anyone an idea how to get the above workin in Safari? Or any other
solution for converting numerical entities back to UTF8?

Regards
Andi

fox · Jun 14, 2005

Andreas said:
Hi again!

I declared success too early :-(

This works everywhere - except in Safari:

data = data.replace(/&#(\d+);/g,
function(wholematch, parenmatch1) {
return String.fromCharCode(+parenmatch1);
});

Safari treats this as a simple string instead of executing the function
:-( It works in Konqueror (I thought they both use the same renderer!?)
Has anyone an idea how to get the above workin in Safari? Or any other
solution for converting numerical entities back to UTF8?

I submitted this "bug" to apple -- but ECMA does not "require" the
variation (they just say the implementation MAY supply a function
argument...)

here is another "brute force" method of converting:

var matches = data.match(/&#\d+;?/g);

for(var i = 0; i < matches.length; i++)
{
// line wraps here -- be careful copy/pasting
var replacement = String.fromCharCode((matches).replace(/\D/g,""));

data = data.replace(/&#\d+;?/,replacement);
}

i used the '?' on the semi-colon because the semi-colon is optional in
HTML coding (in most browser implementations). you don't need the 'g' on
the replace regex because you're stepping through each match in order.

i did this in a hurry -- hopefully you will not have any *more*
cross-browser issues.

}

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Python client/server that reads HTML body from server	1	Apr 12, 2023
based64 decoding in javascript	10	Sep 13, 2006
Decoding HTML entities	0	Jul 24, 2003
I dont get this. Please help me!!	2	Jan 24, 2023
Check forms With JavaScript	1	Mar 28, 2023
HTML::TableExtract w. perl 5.10	1	Sep 28, 2012
Encoding & Decoding text	2	Jul 5, 2006

decoding numeric HTML entities

Andreas Gohr

fox

Andreas Gohr

fox

Michael Winter

Michael Winter

Andreas Gohr

Dr John Stockton

fox

Andreas Gohr

fox

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads