replace unicode characters by &#number; representation

A

Alan J. Flavell

Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their representation.

Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).

[1] Yes, I'm aware that - since this appears to be an HTML/XHTML
problem - then the proper place to do this would be in whatever
HTML-processing package/module one is using, but please humour me for
the low-level approach anyway, for the sake of this discussion.
 
B

Ben Morrow

Alan J. Flavell said:
Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their representation.

I would have said [^[:ascii:]] was clearer :).
Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).

I usually use

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever

which will leave the conversion until the data is output.

Ben

[1] NMF
 
A

Anno Siegel

Alan J. Flavell said:
Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their representation.

Considerations aside whether this should be done on I/O level, if what
you want is the recoded string, there's nothing wrong with it. In
particular, /e has none of the bad smell of string eval (/ee does,
a bit). Ben's suggestion about :ascii: is a good one, and I'd space
out the perl code in s///e as usual, so

s/([^[:ascii:]])/'&#' . ord( $1) . ';'/eg;

but that's only stylistics.
Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

I can't think of anything more elementary than ord().

Anno
 
A

Alan J. Flavell

Alan J. Flavell said:
s/([^\0-\177])/'&#'.ord($1).';'/eg;
[...]
Ben's suggestion about :ascii: is a good one,

You (both) have a point, though I was comfortable with having the
ability to switch the upper limit between \177 (ASCII) and \377 (for
iso-8859-1) in an obvious way.

Thanks for the other comments, too.

all the best
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top