decoding numeric HTML entities

Discussion in 'Javascript' started by Andreas Gohr, Jun 10, 2005.

  1. Andreas Gohr

    Andreas Gohr Guest

    Hi all!

    I need a way to decode numeric HTML entities (like Ü) back to
    their UTF-8 character to place them into a textarea. I tried the
    following but it doesn't work in IE.

    data = data.replace(/&#(\d+);/g,
    function() {
    return String.fromCharCode(RegExp.$1);
    });

    Has anyone a crossbrowser solution?

    Regards
     
    Andreas Gohr, Jun 10, 2005
    #1
    1. Advertising

  2. Andreas Gohr

    fox Guest

    Andreas Gohr wrote:
    > Hi all!
    >
    > I need a way to decode numeric HTML entities (like Ü) back to
    > their UTF-8 character to place them into a textarea. I tried the
    > following but it doesn't work in IE.
    >
    > data = data.replace(/&#(\d+);/g,
    > function() {
    > return String.fromCharCode(RegExp.$1);
    > });

    try:

    function(wholematch, parenmatch1) {
    return String.fromCharCode(+parenmatch1);
    }

    >
    > Has anyone a crossbrowser solution?
    >
    > Regards
    >
     
    fox, Jun 11, 2005
    #2
    1. Advertising

  3. Andreas Gohr

    Andreas Gohr Guest

    It works! You saved my day :)

    Just for my understanding: What does the plus sign do? Is it a typo and
    just happens to work or does it do some magic?

    Thanks again
    Andi
     
    Andreas Gohr, Jun 11, 2005
    #3
  4. Andreas Gohr

    fox Guest

    Andreas Gohr wrote:
    > It works! You saved my day :)
    >
    > Just for my understanding: What does the plus sign do? Is it a typo and
    > just happens to work or does it do some magic?


    the parenthetic match from the regex is *usually* interpreted as a
    string value. In JavaScript, because data types are "flexible", using
    the plus sign (a unary operator) removes any ambiguity as to the type of
    the value passed -- if the characters are all digits, then a number will
    be interpreted (otherwise, you'll receive a "NaN" result, so, in most
    cases, care should be taken to make sure that the string will be all
    digits). The behavior of this technique is different than using parseInt
    which *will* return a numerical value if a string *starts* with digit
    characters:

    parseInt("123 Rue Morgue") => 123

    [ comparatively:
    +("123 Rue Morgue") => NaN
    ]

    Logically, it would seem that using a unary operator (+) is a faster
    conversion than parseInt (which examines every character to test whether
    a digit or not) -- I've never benchmarked it.

    Realize also, that in using the unary operator, it *MUST* appear
    immediately adjacent to the value it is being applied to:

    var aNumAsString = "23";
    var asum = 15 + +aNumAsString;
    ^no space here

    otherwise, it will be interpreted as a concatenation or addition
    operator (or a syntax error, as in 1 + + 3)

    '+' is one of the most "overloaded" operators in the language.




    >
    > Thanks again
    > Andi
    >
     
    fox, Jun 11, 2005
    #4
  5. On 11/06/2005 01:14, fox wrote:

    [snip]

    > the parenthetic match from the regex is *usually* interpreted as a
    > string value.


    It should always be a string value as regular expressions only operate
    on strings (other types are converted, first).

    [snip]

    > The behavior of this technique is different than using parseInt
    > which *will* return a numerical value if a string *starts* with digit
    > characters:
    >
    > parseInt("123 Rue Morgue") => 123


    Though in most cases, the parseInt function should be used with its
    radix argument:

    parseInt('123 Rue Morgue', 10)

    [snip]

    > Logically, it would seem that using a unary operator (+) is a faster
    > conversion than parseInt (which examines every character to test whether
    > a digit or not)


    Both examine characters, but unary plus neither includes a function
    call, nor does it have such a complicated algorithm.

    > Realize also, that in using the unary operator, it *MUST* appear
    > immediately adjacent to the value it is being applied to:


    Nonsense.

    [snip]

    Mike

    --
    Michael Winter
    Replace ".invalid" with ".uk" to reply by e-mail.
     
    Michael Winter, Jun 11, 2005
    #5
  6. On 11/06/2005 00:04, fox wrote:

    > Andreas Gohr wrote:


    [snip]

    >> data = data.replace(/&#(\d+);/g,
    >> function() {
    >> return String.fromCharCode(RegExp.$1);
    >> });

    >
    > try:
    >
    > function(wholematch, parenmatch1) {
    > return String.fromCharCode(+parenmatch1);
    > }


    Or

    function() {
    return String.fromCharCode(arguments[1]);
    }

    >> Has anyone a crossbrowser solution?


    Despite what you might think from this thread, there isn't one really.
    The String.prototype.replace method is broken or inadequate in some
    browsers (including earlier IE versions). The behaviour of the replace
    method can be examined and reimplemented in script, but I haven't done
    that as yet.

    Mike

    --
    Michael Winter
    Replace ".invalid" with ".uk" to reply by e-mail.
     
    Michael Winter, Jun 11, 2005
    #6
  7. Andreas Gohr

    Andreas Gohr Guest

    Thank you all for your help and clarification of the + operator :)

    About the crossbrowser replace solution: I'm happy as it works in all
    browsers needed. It's used with xmlhttprequest so it concerns modern
    browsers only anyway (Firefox, Opera, IE and Konquerer tested so far).

    For the curious: it now powers the Spellchecker in DokuWiki
    http://wiki.splitbrain.org/wiki:spell_checker

    Regards
    Andi
     
    Andreas Gohr, Jun 11, 2005
    #7
  8. JRS: In article <d8dacg$ldc$>, dated Fri, 10 Jun
    2005 19:14:06, seen in news:comp.lang.javascript, fox
    <> posted :

    > In JavaScript, because data types are "flexible", using
    >the plus sign (a unary operator) removes any ambiguity as to the type of
    >the value passed -- if the characters are all digits, then a number will
    >be interpreted (otherwise, you'll receive a "NaN" result, so, in most
    >cases, care should be taken to make sure that the string will be all
    >digits).


    The characters can also be well-placed
    space tab + - e E x X a A b B c C d D e E f F
    though the last 12 are there digits too.


    >Logically, it would seem that using a unary operator (+) is a faster
    >conversion than parseInt (which examines every character to test whether
    >a digit or not) -- I've never benchmarked it.


    Surely the unary operator will fail on reaching the first unacceptable
    character, whereas parseInt will succeed at the same point. However, it
    is the mechanism for looking up what to do, rather than the doing of it,
    which may take more time.


    >Realize also, that in using the unary operator, it *MUST* appear
    >immediately adjacent to the value it is being applied to:
    >
    > var aNumAsString = "23";
    > var asum = 15 + +aNumAsString;
    > ^no space here
    >
    >otherwise, it will be interpreted as a concatenation or addition
    >operator (or a syntax error, as in 1 + + 3)


    Untrue. 1 + + 3 is 4, as is 1 + - - + 3, at least for me.

    However, ISTM that one cannot have two contiguous instances of the same
    one of + - but 1+-+-+-+3 happily gives -2. Note that +"+3" is OK.

    --
    © John Stockton, Surrey, UK. ?@merlyn.demon.co.uk Turnpike v4.00 IE 4 ©
    <URL:http://www.jibbering.com/faq/> JL/RC: FAQ of news:comp.lang.javascript
    <URL:http://www.merlyn.demon.co.uk/js-index.htm> jscr maths, dates, sources.
    <URL:http://www.merlyn.demon.co.uk/> TP/BP/Delphi/jscr/&c, FAQ items, links.
     
    Dr John Stockton, Jun 12, 2005
    #8
  9. Andreas Gohr

    fox Guest

    Dr John Stockton wrote:
    > JRS: In article <d8dacg$ldc$>, dated Fri, 10 Jun
    > 2005 19:14:06, seen in news:comp.lang.javascript, fox
    > <> posted :
    >
    >
    >>In JavaScript, because data types are "flexible", using
    >>the plus sign (a unary operator) removes any ambiguity as to the type of
    >>the value passed -- if the characters are all digits, then a number will
    >>be interpreted (otherwise, you'll receive a "NaN" result, so, in most
    >>cases, care should be taken to make sure that the string will be all
    >>digits).

    >
    >
    > The characters can also be well-placed
    > space tab + - e E x X a A b B c C d D e E f F
    > though the last 12 are there digits too.
    >
    >
    >
    >>Logically, it would seem that using a unary operator (+) is a faster
    >>conversion than parseInt (which examines every character to test whether
    >>a digit or not) -- I've never benchmarked it.

    >
    >
    > Surely the unary operator will fail on reaching the first unacceptable
    > character, whereas parseInt will succeed at the same point. However, it
    > is the mechanism for looking up what to do, rather than the doing of it,
    > which may take more time.
    >
    >
    >
    >>Realize also, that in using the unary operator, it *MUST* appear
    >>immediately adjacent to the value it is being applied to:
    >>
    >> var aNumAsString = "23";
    >> var asum = 15 + +aNumAsString;
    >> ^no space here
    >>
    >>otherwise, it will be interpreted as a concatenation or addition
    >>operator (or a syntax error, as in 1 + + 3)

    >
    >
    > Untrue. 1 + + 3 is 4, as is 1 + - - + 3, at least for me.
    >
    > However, ISTM that one cannot have two contiguous instances of the same
    > one of + - but 1+-+-+-+3 happily gives -2. Note that +"+3" is OK.


    It appears that I have carried my stricter "C upbringing" into
    JavaScript... I apologize for the misstatement (w/r/t JavaScript, that
    is). When you know more than a few languages, sometimes the lines are
    blurred.
     
    fox, Jun 13, 2005
    #9
  10. Andreas Gohr

    Andreas Gohr Guest

    Hi again!

    I declared success too early :-(

    This works everywhere - except in Safari:

    data = data.replace(/&#(\d+);/g,
    function(wholematch, parenmatch1) {
    return String.fromCharCode(+parenmatch1);
    });

    Safari treats this as a simple string instead of executing the function
    :-( It works in Konqueror (I thought they both use the same renderer!?)
    Has anyone an idea how to get the above workin in Safari? Or any other
    solution for converting numerical entities back to UTF8?

    Regards
    Andi
     
    Andreas Gohr, Jun 14, 2005
    #10
  11. Andreas Gohr

    fox Guest

    Andreas Gohr wrote:
    > Hi again!
    >
    > I declared success too early :-(
    >
    > This works everywhere - except in Safari:
    >
    > data = data.replace(/&#(\d+);/g,
    > function(wholematch, parenmatch1) {
    > return String.fromCharCode(+parenmatch1);
    > });
    >
    > Safari treats this as a simple string instead of executing the function
    > :-( It works in Konqueror (I thought they both use the same renderer!?)
    > Has anyone an idea how to get the above workin in Safari? Or any other
    > solution for converting numerical entities back to UTF8?


    I submitted this "bug" to apple -- but ECMA does not "require" the
    variation (they just say the implementation MAY supply a function
    argument...)


    here is another "brute force" method of converting:

    var matches = data.match(/&#\d+;?/g);

    for(var i = 0; i < matches.length; i++)
    {
    // line wraps here -- be careful copy/pasting
    var replacement = String.fromCharCode((matches).replace(/\D/g,""));

    data = data.replace(/&#\d+;?/,replacement);
    }


    i used the '?' on the semi-colon because the semi-colon is optional in
    HTML coding (in most browser implementations). you don't need the 'g' on
    the replace regex because you're stepping through each match in order.

    i did this in a hurry -- hopefully you will not have any *more*
    cross-browser issues.




    }
    >
    > Regards
    > Andi
    >
     
    fox, Jun 14, 2005
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    978
    X-Centric
    Jun 30, 2005
  2. Simon Brooke

    numeric entities in XSL

    Simon Brooke, Mar 14, 2007, in forum: XML
    Replies:
    4
    Views:
    618
    Simon Brooke
    Mar 15, 2007
  3. Anuj

    Decoding HTML entities

    Anuj, Jul 24, 2003, in forum: ASP General
    Replies:
    0
    Views:
    126
  4. Jim Higson
    Replies:
    3
    Views:
    247
    Eric Amick
    Jul 25, 2004
  5. Uwe Mayer

    decoding html entities

    Uwe Mayer, Jan 5, 2005, in forum: Perl Misc
    Replies:
    1
    Views:
    123
    A. Sinan Unur
    Jan 5, 2005
Loading...

Share This Page