help needed making unicode entities

Discussion in 'Perl Misc' started by Dan Jacobson, Aug 7, 2003.

  1. Dan Jacobson

    Dan Jacobson Guest

    Why does
    use HTML::Entities; use utf8; print HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D");
    print
    積
    i.e. three entities, instead of one?

    Must I use locale;? In any particular way?

    Am I to blame?

    Those three bytes represent a Chinese character.

    Must I explore pack()?

    Not only do I wish to convert one unicode character (three bytes), but
    also a whole string of them.

    $ perl -v
    This is perl, v5.8.0

    perldoc Encode's "The UTF-8 flag" holds the answer? And that is what?

    perldoc perluniintro isn't helping.

    All I want to do is
    $ echo '[unicode string]'|perl -plwe 'something;'
    and get
    大原雄馬...
    Is that to much to ask?
    Dan Jacobson, Aug 7, 2003
    #1
    1. Advertising

  2. On Fri, Aug 8, Dan Jacobson inscribed on the eternal scroll:

    > Why does
    > use HTML::Entities; use utf8; print
    > HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D"); print
    > 積
    > i.e. three entities, instead of one?


    I think I'm going to have to leave the author to answer that; but my
    question would be, did you have a reason for choosing that particular
    solution? All you're trying to do is decode utf-8 and then represent
    the answer in decimal.

    > Those three bytes represent a Chinese character.


    Yup, I could well believe that those three octets taken as utf-8
    indeed represent a CJK unified character.

    > Must I explore pack()?


    Possibly. But why do you want to write out the nitty details of a
    utf-8 coded octet stream? What's the _real_ starting point of this
    exercise?

    > Not only do I wish to convert one unicode character (three bytes), but
    > also a whole string of them.


    [ into HTML &bignumber; representations, apparently. ]

    Starting from what? If you want to read them in, then read them in
    (with :utf8 in effect, of course); and then use ord() to find out
    what they are.

    > Is that to much to ask?


    Too much? I don't think so, but maybe the best way to reach a good
    answer is to present the actual problem, rather than complaining about
    an apparently non-working solution to an only incompletely stated
    problem.

    The easy way, btw, is to read your utf-8-encoded data into Mozilla,
    edit it, and then save it as iso-8859-1-encoded. Mozilla will happily
    then convert your CJK characters into &bignumber; representations.
    But that's clearly off-topic for here.

    Disclaimer: I don't read CJK, and at my time of life I'm probably
    unlikely to start; but I'm still interested in the character coding
    technology.
    Alan J. Flavell, Aug 7, 2003
    #2
    1. Advertising

  3. Let's try again:

    On Fri, Aug 8, Dan Jacobson inscribed on the eternal scroll:

    > Why does
    > use HTML::Entities; use utf8;
    > print HTML::Entities::encode_entities_numeric("\xE7\xA9\x8D");
    > print
    > 積 i.e. three entities, instead of one?


    I think the reason is that you've given it three characters, not one.

    The effect of "use utf8;" is that when you write an 8-bit character
    e.g \xE7 in your source code, Perl upgrades it to utf-8 instead of
    maintaining it as an 8-bit character. So internally it becomes the
    pair of octets which represent the Unicode character U+00E7, although
    its ord() value is still, of course, hex E7. This is not what you
    want.

    What it appears you're trying to do is to construct the internal utf-8
    representation yourself. I don't know why you'd want to do that, but
    as far as I understand it, the following kind of code (I'm doing it
    "per pedes" rather than trying any clever shortcuts) could do it.

    Disclaimer: I'm still a bit of a beginner at this, but nobody else
    seems particularly keen to offer answers in this area, it seems, so
    I'm doing my best.

    use Encode;

    [...]

    my $octets;
    {
    use bytes;
    $octets = "\xE7\xA9\x8D";
    }

    my $string = decode_utf8($octets);

    Note that not all octet sequences represent valid utf-8: this call
    should throw a warning if an invalid sequence is presented.

    If you want to be quick and dirty, I _think_ you can just set the
    internal utf8 flag on your octet-string, taking responsibility
    yourself for its validity. Further reading on this is at:

    http://www.perldoc.com/perl5.8.0/lib/Encode.html


    If you're just trying to compose Unicode characters into your source
    code, I suppose you'd be better off using the "wide character"
    notation, \x{uuuu} to represent the Unicode character U+uuuu (which
    you can look up at the unicode web site, see the URLs I posted on
    another recent thread re Japanese), rather than hand-coding utf-8
    octets in hex. But then, you didn't explain why or how it arose that
    you wanted to start from the latter notation - maybe you have your
    own good reasons for wanting that...

    cheers
    Alan J. Flavell, Aug 8, 2003
    #3
  4. On Fri, Aug 8, Dan Jacobson inscribed on the eternal scroll:

    > Works! That was pleasant.


    nice to hear ;-)

    > Never did figure out how to move the :utf8 inside the program whilst
    > maintaining the -ple. perldoc -f open doesn't enlighten.


    AIUI your standard input and output are already open; to apply :utf8
    semantics to an already-open filehandle you use the extended form of
    binmode(). I'm not sure if that's really the answer to your question,
    though.

    > as a batch job (no mozilla)?


    My mention of Mozilla was very much an aside - but if you want to
    convert an HTML document from any known coding, into one using a
    specific coding - say utf-8 - or using notations, then it's
    quite a handy tool, it seems to me, thanks to its syntax-awareness.

    But of course something like HTMLtidy, or SP, can do that too. Or XML
    tools if you're using XHTML.

    > Certainly there is a ready made solution?


    As I say, I'm also learning this stuff as I go along, so even if there
    *is* one, there's no guarantee I have it at my fingertips. And you
    can see for yourself how many other regular contributors here get
    involved when the word Unicode is mentioned. Rather few,
    unfortunately (which makes me worry a bit...).

    cheers
    Alan J. Flavell, Aug 8, 2003
    #4
  5. On Sun, Aug 10, Dan Jacobson inscribed on the eternal scroll:

    > Alan> [In perl] to apply :utf8 semantics to an already-open filehandle
    > Alan> you use the extended form of binmode().
    >
    > perldoc -f binmode has no eye grabbing example.


    I'm looking at http://www.perldoc.com/perl5.8.0/pod/func/binmode.html

    binmode FILEHANDLE, LAYER

    [...]

    If LAYER is present it is a single string, but may contain multiple
    directives. The directives alter the behaviour of the file handle.
    When LAYER is present using binmode on text file makes sense.

    To mark FILEHANDLE as UTF-8, use :utf8.

    Might not be an "eyegrabbing example", but it seems clear enough to
    me, no?

    Your "eyegrabbing example" seens to be here:
    http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Unicode-I-O

    and on already open streams, use binmode():

    binmode(STDOUT, ":utf8");

    I would certainly recommend referring back to both perluniintro and
    perlunicode while doing this sort of work - they've helped me, anyhow.

    cheers
    Alan J. Flavell, Aug 10, 2003
    #5
  6. Dan Jacobson

    Dan Jacobson Guest

    Alan> binmode(STDOUT, ":utf8");

    Bad news, only the first one works:
    echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
    積丹尼
    echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
    積丹尼
    echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
    積丹尼
    echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    perl -wple 'binmode(STDOUT,":utf8");binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
    積丹尼
    perl -v
    This is perl, v5.8.0 built for i386-linux-thread-multi
    Dan Jacobson, Aug 11, 2003
    #6
  7. On Mon, Aug 11, Dan Jacobson inscribed on the eternal scroll:

    > Alan> binmode(STDOUT, ":utf8");
    >
    > Bad news, only the first one works:
    > echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    > PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
    > 積丹尼


    Seems to be one of the possibilities documented in perlrun, so that's
    good.

    > echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    > perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
    > 積丹尼


    I have to confess, I have no familiarity with the details of this part
    of the -p option. I'm really not a great one-liner, I'm afraid.

    > echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    > perl -wple 'binmode(STDOUT,":utf8");s/./"&#".ord($&).";"/eg'
    > 積丹尼


    Since you're not trying to send any utf-8-encoded characters (other
    than those which are trivially us-ascii) to STDOUT, I'm not sure why
    you're suggesting binmode(STDOUT, ...) as being possibly relevant.

    Well, it looks as if you have one option which works.

    I plead lack of knowledge on the other one, but it's at least
    plausible that setting binmode on STDIN ought to work. Maybe someone
    reading this who understands the -p processing better than I do would
    care to comment - maybe even try reporting a bug - or at least getting
    it documented in perlrun?

    cheers
    Alan J. Flavell, Aug 11, 2003
    #7
  8. Dan Jacobson

    Dave Weaver Guest

    On Mon, 11 Aug 2003 09:47:43 +0800, Dan Jacobson <> wrote:
    >
    > Bad news, only the first one works:


    > echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    > PERLIO=:utf8 perl -wple 's/./"&#".ord($&).";"/eg'
    > 積丹尼


    > echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    > perl -wple 'binmode(STDIN,":utf8");s/./"&#".ord($&).";"/eg'
    > 積丹尼


    Don't know much about utf8 etc, but try putting the binmode in a BEGIN{}
    block, so that it is done immediately and only once (rather than once per
    line) :

    [davew]% echo =E7=A9=8D=E4=B8=B9=E5=B0=BC|mmencode -u -q|
    perl -wple 'BEGIN{binmode(STDIN,":utf8")};s/./"&#".ord($&).";"/eg'
    積丹尼
    [davew]% perl -v
    This is perl, v5.8.0 built for i386-linux-thread-multi


    --
    Cheers,
    Dave
    Dave Weaver, Aug 12, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tom
    Replies:
    0
    Views:
    441
  2. Matthew Burgess
    Replies:
    3
    Views:
    465
    Toni Uusitalo
    Jul 28, 2003
  3. Tom
    Replies:
    0
    Views:
    570
  4. Steven D'Aprano

    Convert from unicode chars to HTML entities

    Steven D'Aprano, Jan 29, 2007, in forum: Python
    Replies:
    8
    Views:
    652
    Roberto Bonvallet
    Feb 8, 2007
  5. Jim Higson
    Replies:
    3
    Views:
    223
    Eric Amick
    Jul 25, 2004
Loading...

Share This Page