replace unicode characters by &#number; representation

Discussion in 'Perl Misc' started by Alan J. Flavell, Feb 21, 2004.

  1. Suppose I've read-in a line of text which can contain a large
    repertoire of characters. Assume that I've done it using Perl's
    native unicode support, a la

    binmode IN, ':encoding(whatever)';

    specifying, of course, the correct external character encoding for
    the input file that I'm reading, whatever it might be.

    I've come up with this regex[1]

    s/([^\0-\177])/'&#'.ord($1).';'/eg;

    to replace all non-ASCII characters by their representation.

    Is this indeed the simplest approach, or am I missing some simpler
    code than writing ord($1) and using /e to evaluate it?

    (Use \377 if the requirement is to retain iso-8859-1 characters and
    only to convert the rest).

    [1] Yes, I'm aware that - since this appears to be an HTML/XHTML
    problem - then the proper place to do this would be in whatever
    HTML-processing package/module one is using, but please humour me for
    the low-level approach anyway, for the sake of this discussion.
     
    Alan J. Flavell, Feb 21, 2004
    #1
    1. Advertisements

  2. Alan J. Flavell

    Ben Morrow Guest

    I would have said [^[:ascii:]] was clearer :).
    I usually use

    use Encode qw/:fallbacks/;

    $PerlIO::encoding::fallback = FB_HTMLCREF;
    binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever

    which will leave the conversion until the data is output.

    Ben

    [1] NMF
     
    Ben Morrow, Feb 21, 2004
    #2
    1. Advertisements

  3. Alan J. Flavell

    Anno Siegel Guest

    Considerations aside whether this should be done on I/O level, if what
    you want is the recoded string, there's nothing wrong with it. In
    particular, /e has none of the bad smell of string eval (/ee does,
    a bit). Ben's suggestion about :ascii: is a good one, and I'd space
    out the perl code in s///e as usual, so

    s/([^[:ascii:]])/'&#' . ord( $1) . ';'/eg;

    but that's only stylistics.
    I can't think of anything more elementary than ord().

    Anno
     
    Anno Siegel, Feb 21, 2004
    #3
  4. You (both) have a point, though I was comfortable with having the
    ability to switch the upper limit between \177 (ASCII) and \377 (for
    iso-8859-1) in an obvious way.

    Thanks for the other comments, too.

    all the best
     
    Alan J. Flavell, Feb 22, 2004
    #4
  5. Thanks for pointing that out. I wasn't properly aware of the feature.
    OK, it looks as if the relevant documentation is in e.g
    http://www.perldoc.com/perl5.8.0/lib/Encode.html

    Thanks.
     
    Alan J. Flavell, Feb 22, 2004
    #5
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.