replace unicode characters by &#number; representation

Alan J. Flavell · Feb 21, 2004

Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their representation.

Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).

[1] Yes, I'm aware that - since this appears to be an HTML/XHTML
problem - then the proper place to do this would be in whatever
HTML-processing package/module one is using, but please humour me for
the low-level approach anyway, for the sake of this discussion.

Ben Morrow · Feb 21, 2004

Alan J. Flavell said:
Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their representation.

I would have said [^[:ascii:]] was clearer

.

Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

(Use \377 if the requirement is to retain iso-8859-1 characters and
only to convert the rest).

I usually use

use Encode qw/:fallbacks/;

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever

which will leave the conversion until the data is output.

Ben

[1] NMF

Anno Siegel · Feb 21, 2004

Alan J. Flavell said:
Suppose I've read-in a line of text which can contain a large
repertoire of characters. Assume that I've done it using Perl's
native unicode support, a la

binmode IN, ':encoding(whatever)';

specifying, of course, the correct external character encoding for
the input file that I'm reading, whatever it might be.

I've come up with this regex[1]

s/([^\0-\177])/'&#'.ord($1).';'/eg;

to replace all non-ASCII characters by their representation.

Considerations aside whether this should be done on I/O level, if what
you want is the recoded string, there's nothing wrong with it. In
particular, /e has none of the bad smell of string eval (/ee does,
a bit). Ben's suggestion about :ascii: is a good one, and I'd space
out the perl code in s///e as usual, so

s/([^[:ascii:]])/'&#' . ord( $1) . ';'/eg;

but that's only stylistics.

Is this indeed the simplest approach, or am I missing some simpler
code than writing ord($1) and using /e to evaluate it?

I can't think of anything more elementary than ord().

Anno

Alan J. Flavell · Feb 22, 2004

Alan J. Flavell said:
Alan J. Flavell said:

s/([^\0-\177])/'&#'.ord($1).';'/eg;

Click to expand...

[...]
Ben's suggestion about :ascii: is a good one,

You (both) have a point, though I was comfortable with having the
ability to switch the upper limit between \177 (ASCII) and \377 (for
iso-8859-1) in an obvious way.

Thanks for the other comments, too.

all the best

Alan J. Flavell · Feb 22, 2004

I usually use

use Encode qw/:fallbacks/;

Thanks for pointing that out. I wasn't properly aware of the feature.

$PerlIO::encoding::fallback = FB_HTMLCREF;
binmode STDOUT, ':encoding(ascii)'; # or iso8859-1, or whatever

which will leave the conversion until the data is output.

OK, it looks as if the relevant documentation is in e.g
http://www.perldoc.com/perl5.8.0/lib/Encode.html

Thanks.

Sort by number of characters	0	Nov 3, 2023
Sort by number of characters	1	Nov 2, 2023
How to replace UniCode representation with actual character?	6	Dec 17, 2013
Flexible string representation, unicode, typography, ...	94	Aug 23, 2012
attempting to print unicode characters.	23	Aug 28, 2010
Reversing output of user input by using while loop...	2	Sep 1, 2022
replace extended characters	29	Feb 10, 2011
Unicode characters in btye-strings	5	Mar 12, 2010

replace unicode characters by &#number; representation

Alan J. Flavell

Ben Morrow

Anno Siegel

Alan J. Flavell

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads