help needed making unicode entities

D

Dan Jacobson

Why does
use HTML::Entities; use utf8; print HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D");
print
積
i.e. three entities, instead of one?

Must I use locale;? In any particular way?

Am I to blame?

Those three bytes represent a Chinese character.

Must I explore pack()?

Not only do I wish to convert one unicode character (three bytes), but
also a whole string of them.

$ perl -v
This is perl, v5.8.0

perldoc Encode's "The UTF-8 flag" holds the answer? And that is what?

perldoc perluniintro isn't helping.

All I want to do is
$ echo '[unicode string]'|perl -plwe 'something;'
and get
大原雄馬...
Is that to much to ask?
 
A

Alan J. Flavell

Why does
use HTML::Entities; use utf8; print
HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D"); print
積
i.e. three entities, instead of one?

I think I'm going to have to leave the author to answer that; but my
question would be, did you have a reason for choosing that particular
solution? All you're trying to do is decode utf-8 and then represent
the answer in decimal.
Those three bytes represent a Chinese character.

Yup, I could well believe that those three octets taken as utf-8
indeed represent a CJK unified character.
Must I explore pack()?

Possibly. But why do you want to write out the nitty details of a
utf-8 coded octet stream? What's the _real_ starting point of this
exercise?
Not only do I wish to convert one unicode character (three bytes), but
also a whole string of them.

[ into HTML &bignumber; representations, apparently. ]

Starting from what? If you want to read them in, then read them in
(with :utf8 in effect, of course); and then use ord() to find out
what they are.
Is that to much to ask?

Too much? I don't think so, but maybe the best way to reach a good
answer is to present the actual problem, rather than complaining about
an apparently non-working solution to an only incompletely stated
problem.

The easy way, btw, is to read your utf-8-encoded data into Mozilla,
edit it, and then save it as iso-8859-1-encoded. Mozilla will happily
then convert your CJK characters into &bignumber; representations.
But that's clearly off-topic for here.

Disclaimer: I don't read CJK, and at my time of life I'm probably
unlikely to start; but I'm still interested in the character coding
technology.
 
A

Alan J. Flavell

Let's try again:

Why does
use HTML::Entities; use utf8;
print HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D");
print
積 i.e. three entities, instead of one?

I think the reason is that you've given it three characters, not one.

The effect of "use utf8;" is that when you write an 8-bit character
e.g \xE7 in your source code, Perl upgrades it to utf-8 instead of
maintaining it as an 8-bit character. So internally it becomes the
pair of octets which represent the Unicode character U+00E7, although
its ord() value is still, of course, hex E7. This is not what you
want.

What it appears you're trying to do is to construct the internal utf-8
representation yourself. I don't know why you'd want to do that, but
as far as I understand it, the following kind of code (I'm doing it
"per pedes" rather than trying any clever shortcuts) could do it.

Disclaimer: I'm still a bit of a beginner at this, but nobody else
seems particularly keen to offer answers in this area, it seems, so
I'm doing my best.

use Encode;

[...]

my $octets;
{
use bytes;
$octets = "\xE7\xA9\x8D";
}

my $string = decode_utf8($octets);

Note that not all octet sequences represent valid utf-8: this call
should throw a warning if an invalid sequence is presented.

If you want to be quick and dirty, I _think_ you can just set the
internal utf8 flag on your octet-string, taking responsibility
yourself for its validity. Further reading on this is at:

http://www.perldoc.com/perl5.8.0/lib/Encode.html


If you're just trying to compose Unicode characters into your source
code, I suppose you'd be better off using the "wide character"
notation, \x{uuuu} to represent the Unicode character U+uuuu (which
you can look up at the unicode web site, see the URLs I posted on
another recent thread re Japanese), rather than hand-coding utf-8
octets in hex. But then, you didn't explain why or how it arose that
you wanted to start from the latter notation - maybe you have your
own good reasons for wanting that...

cheers
 
A

Alan J. Flavell

Works! That was pleasant.

nice to hear ;-)
Never did figure out how to move the :utf8 inside the program whilst
maintaining the -ple. perldoc -f open doesn't enlighten.

AIUI your standard input and output are already open; to apply :utf8
semantics to an already-open filehandle you use the extended form of
binmode(). I'm not sure if that's really the answer to your question,
though.
as a batch job (no mozilla)?

My mention of Mozilla was very much an aside - but if you want to
convert an HTML document from any known coding, into one using a
specific coding - say utf-8 - or using notations, then it's
quite a handy tool, it seems to me, thanks to its syntax-awareness.

But of course something like HTMLtidy, or SP, can do that too. Or XML
tools if you're using XHTML.
Certainly there is a ready made solution?

As I say, I'm also learning this stuff as I go along, so even if there
*is* one, there's no guarantee I have it at my fingertips. And you
can see for yourself how many other regular contributors here get
involved when the word Unicode is mentioned. Rather few,
unfortunately (which makes me worry a bit...).

cheers
 
A

Alan J. Flavell

Alan> [In perl] to apply :utf8 semantics to an already-open filehandle
Alan> you use the extended form of binmode().

perldoc -f binmode has no eye grabbing example.

I'm looking at http://www.perldoc.com/perl5.8.0/pod/func/binmode.html

binmode FILEHANDLE, LAYER

[...]

If LAYER is present it is a single string, but may contain multiple
directives. The directives alter the behaviour of the file handle.
When LAYER is present using binmode on text file makes sense.

To mark FILEHANDLE as UTF-8, use :utf8.

Might not be an "eyegrabbing example", but it seems clear enough to
me, no?

Your "eyegrabbing example" seens to be here:
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Unicode-I-O

and on already open streams, use binmode():

binmode(STDOUT, ":utf8");

I would certainly recommend referring back to both perluniintro and
perlunicode while doing this sort of work - they've helped me, anyhow.

cheers
 
D

Dan Jacobson

Alan> binmode(STDOUT, ":utf8");

Bad news, only the first one works:
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼
perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi
 
A

Alan J. Flavell

Alan> binmode(STDOUT, ":utf8");

Bad news, only the first one works:
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼

Seems to be one of the possibilities documented in perlrun, so that's
good.
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼

I have to confess, I have no familiarity with the details of this part
of the -p option. I'm really not a great one-liner, I'm afraid.
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼

Since you're not trying to send any utf-8-encoded characters (other
than those which are trivially us-ascii) to STDOUT, I'm not sure why
you're suggesting binmode(STDOUT, ...) as being possibly relevant.

Well, it looks as if you have one option which works.

I plead lack of knowledge on the other one, but it's at least
plausible that setting binmode on STDIN ought to work. Maybe someone
reading this who understands the -p processing better than I do would
care to comment - maybe even try reporting a bug - or at least getting
it documented in perlrun?

cheers
 
D

Dave Weaver

Bad news, only the first one works:
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
積丹尼
echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
積丹尼

Don't know much about utf8 etc, but try putting the binmode in a BEGIN{}
block, so that it is done immediately and only once (rather than once per
line) :

[davew]% echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
perl -wple 'BEGIN{binmode(STDIN,":utf8")};s/./"&#".ord($&).";"/eg'
積丹尼
[davew]% perl -v
This is perl, v5.8.0 built for i386-linux-thread-multi
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top