A
Alan J. Flavell
Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la
binmode IN, ':encoding(whatever)';
specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.
I've come up with this regex[1]
s/([^\0-\177])/'&#'.ord($1).';'/eg;
to replace all non-ASCII characters by their representation.
Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?
(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).
[1] Yes, I'm aware that - since this appears to be an HTML/XHTML
problem - then the proper place to do this would be in whatever
HTML-processing package/module one is using, but please humour me for
the low-level approach anyway, for the sake of this discussion.
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la
binmode IN, ':encoding(whatever)';
specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.
I've come up with this regex[1]
s/([^\0-\177])/'&#'.ord($1).';'/eg;
to replace all non-ASCII characters by their representation.
Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?
(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).
[1] Yes, I'm aware that - since this appears to be an HTML/XHTML
problem - then the proper place to do this would be in whatever
HTML-processing package/module one is using, but please humour me for
the low-level approach anyway, for the sake of this discussion.