replace unicode characters by &#number; representation

Discussion in 'Perl Misc' started by Alan J. Flavell, Feb 21, 2004.

  1. Suppose I've read-in a line of text which can contain a large
    repertoire of characters. Assume that I've done it using Perl's
    native unicode support, a la

    binmode IN, ':encoding(whatever)';

    specifying, of course, the correct external character encoding for
    the input file that I'm reading, whatever it might be.

    I've come up with this regex[1]

    s/([^\0-\177])/'&#'.ord($1).';'/eg;

    to replace all non-ASCII characters by their representation.

    Is this indeed the simplest approach, or am I missing some simpler
    code than writing ord($1) and using /e to evaluate it?

    (Use \377 if the requirement is to retain iso-8859-1 characters and
    only to convert the rest).

    [1] Yes, I'm aware that - since this appears to be an HTML/XHTML
    problem - then the proper place to do this would be in whatever
    HTML-processing package/module one is using, but please humour me for
    the low-level approach anyway, for the sake of this discussion.
     
    Alan J. Flavell, Feb 21, 2004
    #1
    1. Advertising

  2. Alan J. Flavell

    Ben Morrow Guest

    "Alan J. Flavell" <> wrote:
    >
    > Suppose I've read-in a line of text which can contain a large
    > repertoire of characters. Assume that I've done it using Perl's
    > native unicode support, a la
    >
    > binmode IN, ':encoding(whatever)';
    >
    > specifying, of course, the correct external character encoding for
    > the input file that I'm reading, whatever it might be.
    >
    > I've come up with this regex[1]
    >
    > s/([^\0-\177])/'&#'.ord($1).';'/eg;
    >
    > to replace all non-ASCII characters by their representation.


    I would have said [^[:ascii:]] was clearer :).

    > Is this indeed the simplest approach, or am I missing some simpler
    > code than writing ord($1) and using /e to evaluate it?
    >
    > (Use \377 if the requirement is to retain iso-8859-1 characters and
    > only to convert the rest).


    I usually use

    use Encode qw/:fallbacks/;

    $PerlIO::encoding::fallback = FB_HTMLCREF;
    binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever

    which will leave the conversion until the data is output.

    Ben

    [1] NMF

    --
    Heracles: Vulture! Here's a titbit for you / A few dried molecules of the gall
    From the liver of a friend of yours. / Excuse the arrow but I have no spoon.
    (Ted Hughes, [ Heracles shoots Vulture with arrow. Vulture bursts into ]
    /Alcestis/) [ flame, and falls out of sight. ]
     
    Ben Morrow, Feb 21, 2004
    #2
    1. Advertising

  3. Alan J. Flavell

    Anno Siegel Guest

    Alan J. Flavell <> wrote in comp.lang.perl.misc:
    >
    > Suppose I've read-in a line of text which can contain a large
    > repertoire of characters. Assume that I've done it using Perl's
    > native unicode support, a la
    >
    > binmode IN, ':encoding(whatever)';
    >
    > specifying, of course, the correct external character encoding for
    > the input file that I'm reading, whatever it might be.
    >
    > I've come up with this regex[1]
    >
    > s/([^\0-\177])/'&#'.ord($1).';'/eg;
    >
    > to replace all non-ASCII characters by their representation.


    Considerations aside whether this should be done on I/O level, if what
    you want is the recoded string, there's nothing wrong with it. In
    particular, /e has none of the bad smell of string eval (/ee does,
    a bit). Ben's suggestion about :ascii: is a good one, and I'd space
    out the perl code in s///e as usual, so

    s/([^[:ascii:]])/'&#' . ord( $1) . ';'/eg;

    but that's only stylistics.

    > Is this indeed the simplest approach, or am I missing some simpler
    > code than writing ord($1) and using /e to evaluate it?


    I can't think of anything more elementary than ord().

    Anno
     
    Anno Siegel, Feb 21, 2004
    #3
  4. On Sat, 21 Feb 2004, Anno Siegel wrote:

    > Alan J. Flavell <> wrote in comp.lang.perl.misc:
    > >
    > > s/([^\0-\177])/'&#'.ord($1).';'/eg;

    >

    [...]
    > Ben's suggestion about :ascii: is a good one,


    You (both) have a point, though I was comfortable with having the
    ability to switch the upper limit between \177 (ASCII) and \377 (for
    iso-8859-1) in an obvious way.

    Thanks for the other comments, too.

    all the best
     
    Alan J. Flavell, Feb 22, 2004
    #4
  5. On Sat, 21 Feb 2004, Ben Morrow wrote:

    > I usually use
    >
    > use Encode qw/:fallbacks/;


    Thanks for pointing that out. I wasn't properly aware of the feature.

    > $PerlIO::encoding::fallback = FB_HTMLCREF;
    > binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever
    >
    > which will leave the conversion until the data is output.


    OK, it looks as if the relevant documentation is in e.g
    http://www.perldoc.com/perl5.8.0/lib/Encode.html

    Thanks.
     
    Alan J. Flavell, Feb 22, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    544
    Pete Becker
    May 21, 2005
  2. Grzegorz ¦liwiñski
    Replies:
    2
    Views:
    985
    Grzegorz ¦liwiñski
    Jan 19, 2011
  3. Ken Fine
    Replies:
    2
    Views:
    204
    Ken Fine
    Feb 5, 2004
  4. James
    Replies:
    0
    Views:
    109
    James
    Sep 8, 2004
  5. Wes Groleau
    Replies:
    6
    Views:
    199
    Wes Groleau
    Dec 19, 2013
Loading...

Share This Page