help needed making unicode entities

Dan Jacobson · Aug 7, 2003

Why does
use HTML::Entities; use utf8; print HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D");
print
ç©
i.e. three entities, instead of one?

Must I use locale;? In any particular way?

Am I to blame?

Those three bytes represent a Chinese character.

Must I explore pack()?

Not only do I wish to convert one unicode character (three bytes), but
also a whole string of them.

$ perl -v
This is perl, v5.8.0

perldoc Encode's "The UTF-8 flag" holds the answer? And that is what?

perldoc perluniintro isn't helping.

All I want to do is
$ echo '[unicode string]'|perl -plwe 'something;'
and get
大原雄馬...
Is that to much to ask?

Alan J. Flavell · Aug 7, 2003

Why does
use HTML::Entities; use utf8; print
HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D"); print
ç©
i.e. three entities, instead of one?

I think I'm going to have to leave the author to answer that; but my
question would be, did you have a reason for choosing that particular
solution? All you're trying to do is decode utf-8 and then represent
the answer in decimal.

Those three bytes represent a Chinese character.

Yup, I could well believe that those three octets taken as utf-8
indeed represent a CJK unified character.

Must I explore pack()?

Possibly. But why do you want to write out the nitty details of a
utf-8 coded octet stream? What's the _real_ starting point of this
exercise?

Not only do I wish to convert one unicode character (three bytes), but
also a whole string of them.

[ into HTML &bignumber; representations, apparently. ]

Starting from what? If you want to read them in, then read them in
(with :utf8 in effect, of course); and then use ord() to find out
what they are.

Is that to much to ask?

Too much? I don't think so, but maybe the best way to reach a good
answer is to present the actual problem, rather than complaining about
an apparently non-working solution to an only incompletely stated
problem.

The easy way, btw, is to read your utf-8-encoded data into Mozilla,
edit it, and then save it as iso-8859-1-encoded. Mozilla will happily
then convert your CJK characters into &bignumber; representations.
But that's clearly off-topic for here.

Disclaimer: I don't read CJK, and at my time of life I'm probably
unlikely to start; but I'm still interested in the character coding
technology.

Alan J. Flavell · Aug 8, 2003

Let's try again:

Why does
use HTML::Entities; use utf8;
print HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D");
print
ç© i.e. three entities, instead of one?

I think the reason is that you've given it three characters, not one.

The effect of "use utf8;" is that when you write an 8-bit character
e.g \xE7 in your source code, Perl upgrades it to utf-8 instead of
maintaining it as an 8-bit character. So internally it becomes the
pair of octets which represent the Unicode character U+00E7, although
its ord() value is still, of course, hex E7. This is not what you
want.

What it appears you're trying to do is to construct the internal utf-8
representation yourself. I don't know why you'd want to do that, but
as far as I understand it, the following kind of code (I'm doing it
"per pedes" rather than trying any clever shortcuts) could do it.

Disclaimer: I'm still a bit of a beginner at this, but nobody else
seems particularly keen to offer answers in this area, it seems, so
I'm doing my best.

use Encode;

[...]

my $octets;
{
use bytes;
$octets = "\xE7\xA9\x8D";
}

my $string = decode_utf8($octets);

Note that not all octet sequences represent valid utf-8: this call
should throw a warning if an invalid sequence is presented.

If you want to be quick and dirty, I _think_ you can just set the
internal utf8 flag on your octet-string, taking responsibility
yourself for its validity. Further reading on this is at:

http://www.perldoc.com/perl5.8.0/lib/Encode.html

If you're just trying to compose Unicode characters into your source
code, I suppose you'd be better off using the "wide character"
notation, \x{uuuu} to represent the Unicode character U+uuuu (which
you can look up at the unicode web site, see the URLs I posted on
another recent thread re Japanese), rather than hand-coding utf-8
octets in hex. But then, you didn't explain why or how it arose that
you wanted to start from the latter notation - maybe you have your
own good reasons for wanting that...

cheers

Alan J. Flavell · Aug 8, 2003

Works! That was pleasant.

nice to hear ;-)

Never did figure out how to move the :utf8 inside the program whilst
maintaining the -ple. perldoc -f open doesn't enlighten.

AIUI your standard input and output are already open; to apply :utf8
semantics to an already-open filehandle you use the extended form of
binmode(). I'm not sure if that's really the answer to your question,
though.

as a batch job (no mozilla)?

My mention of Mozilla was very much an aside - but if you want to
convert an HTML document from any known coding, into one using a
specific coding - say utf-8 - or using notations, then it's
quite a handy tool, it seems to me, thanks to its syntax-awareness.

But of course something like HTMLtidy, or SP, can do that too. Or XML
tools if you're using XHTML.

Certainly there is a ready made solution?

As I say, I'm also learning this stuff as I go along, so even if there
*is* one, there's no guarantee I have it at my fingertips. And you
can see for yourself how many other regular contributors here get
involved when the word Unicode is mentioned. Rather few,
unfortunately (which makes me worry a bit...).

cheers

Alan J. Flavell · Aug 10, 2003

Alan> [In perl] to apply :utf8 semantics to an already-open filehandle
Alan> you use the extended form of binmode().

perldoc -f binmode has no eye grabbing example.

I'm looking at http://www.perldoc.com/perl5.8.0/pod/func/binmode.html

binmode FILEHANDLE, LAYER

[...]

If LAYER is present it is a single string, but may contain multiple
directives. The directives alter the behaviour of the file handle.
When LAYER is present using binmode on text file makes sense.

To mark FILEHANDLE as UTF-8, use :utf8.

Might not be an "eyegrabbing example", but it seems clear enough to
me, no?

Your "eyegrabbing example" seens to be here:
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Unicode-I-O

and on already open streams, use binmode():

binmode(STDOUT, ":utf8");

I would certainly recommend referring back to both perluniintro and
perlunicode while doing this sort of work - they've helped me, anyhow.

cheers

Dan Jacobson · Aug 11, 2003

Alan> binmode(STDOUT, ":utf8");

Bad news, only the first one works:
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
ç©ä¸¹å°¼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
ç©ä¸¹å°¼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
ç©ä¸¹å°¼
perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi

Alan J. Flavell · Aug 11, 2003

Alan> binmode(STDOUT, ":utf8");

Bad news, only the first one works:
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼

Seems to be one of the possibilities documented in perlrun, so that's
good.

echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
ç©ä¸¹å°¼

I have to confess, I have no familiarity with the details of this part
of the -p option. I'm really not a great one-liner, I'm afraid.

echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
ç©ä¸¹å°¼

Since you're not trying to send any utf-8-encoded characters (other
than those which are trivially us-ascii) to STDOUT, I'm not sure why
you're suggesting binmode(STDOUT, ...) as being possibly relevant.

Well, it looks as if you have one option which works.

I plead lack of knowledge on the other one, but it's at least
plausible that setting binmode on STDIN ought to work. Maybe someone
reading this who understands the -p processing better than I do would
care to comment - maybe even try reporting a bug - or at least getting
it documented in perlrun?

cheers

Dave Weaver · Aug 12, 2003

Bad news, only the first one works:

echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼

echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
ç©ä¸¹å°¼

Don't know much about utf8 etc, but try putting the binmode in a BEGIN{}
block, so that it is done immediately and only once (rather than once per
line) :

[davew]% echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'BEGIN{binmode(STDIN,":utf8")};s/./"&#".ord($&).";"/eg'
積丹尼
[davew]% perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi

Help needed: Unicode and file format problem	1	Sep 21, 2004
Does unpack() support higher-order Unicode strings for hex conversion?	0	Nov 3, 2005
Perl 5.8.x, Unicode and In-memory Filehandles	3	Mar 1, 2006
Writing UTF-8 file under Windows	1	Jan 5, 2007
Locale not working with Unicode strings in Perl 5.8?	1	Apr 5, 2004
Unicode + jsp + mysql + tomcat = unicode still not displaying	0	Sep 28, 2003
How to decode JavaScript's encodeURIComponent in Perl.	4	Jan 23, 2007
anybody help me	1	Feb 10, 2006

help needed making unicode entities

Dan Jacobson

Alan J. Flavell

Alan J. Flavell

Alan J. Flavell

Alan J. Flavell

Dan Jacobson

Alan J. Flavell

Dave Weaver

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads