How to decode this unicode-hex string

Discussion in 'Perl Misc' started by * Tong *, Feb 25, 2005.

  1. * Tong *

    * Tong * Guest

    Hi,

    When I select from non-English web sites and paste into my emacs,
    sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
    "English" in Big5 encoding.

    I'm wondering how I can decode such strings and return the 8-bit character.

    So far I've been looking into the following Perl modules man pages an
    tried each one of them: Unicode::UTF8simple, Unicode::String,
    Unicode::Lite. None of them seems to be able to do that. They handle
    unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
    difference between the above representation is that, the \u82f1 represent
    one 8-bit character, while in Perl it is represented in two U+00xx values.

    I had also played with tcl decodings, but wasn't successful. Please help.

    Thanks a lot!

    tong

    --
    Tong (remove underscore(s) to reply)
    *niX Power Tools Project: http://xpt.sourceforge.net/
    - All free contribution & collection
    * Tong *, Feb 25, 2005
    #1
    1. Advertising

  2. * Tong *

    phaylon Guest

    * Tong * wrote:

    > I'm wondering how I can decode such strings and return the 8-bit
    > character.


    Sometimes I think all some people read from this group before posting is
    the name. Look at the thread right before yours.

    --
    http://www.dunkelheit.at/

    The eternal mistake of mankind is to set up an attainable ideal.
    -- Aleister Crowley
    phaylon, Feb 25, 2005
    #2
    1. Advertising

  3. * Tong *

    * Tong * Guest

    On Fri, 25 Feb 2005 17:42:09 +0100, phaylon wrote:

    >> I'm wondering how I can decode such strings and return the 8-bit
    >> character.

    >
    > Sometimes I think all some people read from this group before posting is
    > the name. Look at the thread right before yours.


    Can you at least specify the thread subject if you want to help? Did you
    mean the thread "How to convert latin1 to utf8"? Did you see that I've tried the
    Unicode::String (and much more) before the posting? After all, have you
    read the two threads carefully and seen the giant difference between them?


    --
    Tong (remove underscore(s) to reply)
    *niX Power Tools Project: http://xpt.sourceforge.net/
    - All free contribution & collection
    * Tong *, Feb 25, 2005
    #3
  4. * Tong *

    phaylon Guest

    * Tong * wrote:

    > Can you at least specify the thread subject if you want to help?


    No, that's your job. My job is to code. But sometimes I make breaks. And,
    I'm sorry if this is offensive to you, but I'm not willing to spend my
    breaks doing someone other's work.

    > Did you mean the thread "How to convert latin1 to utf8"?


    Bingo.

    > Did you see that I've tried the Unicode::String (and much more) before
    > the posting?


    Yeah. And I said there I would try out Encode, have you done that?

    > After all, have you read the two threads carefully and seen the giant
    > difference between them?


    Nope, clear me up.

    --
    http://www.dunkelheit.at/
    That is not dead, which can eternal lie,
    and with strange aeons even death may die.
    -- H.P. Lovecraft
    phaylon, Feb 25, 2005
    #4
  5. * Tong * wrote:
    > Hi,
    >
    > When I select from non-English web sites and paste into my emacs,
    > sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
    > "English" in Big5 encoding.


    I'm confused. Unicode and Big5 are completely different aren't they? For
    one thing Unicode is a character set, there are several encodings such
    as UTF-8.

    u8251 and u6581 are Chinese characters in Unicode. They are within the
    CJK Unified Ideographs 4E00-9FAF.
    http://www.unicode.org/charts/PDF/U4E00.pdf
    Together they form the Chonese word whose English translation is the
    word "English".

    > I'm wondering how I can decode such strings and return the 8-bit character.


    An 8-bit character set would surely not be large enough to contain a
    usable subset of the Chinese ideographs. Big 5 has 13,000 ideographs. An
    8-bit character set has room for 256 at most.

    When you say "the 8 bit character" are you thinking of something like
    the ISO 8859-1 Latin-1 character set?

    Without a Chinese-English dictionary, there's no way to "decode" the two
    Chinese ideograms u8251 u6581 into the seven English letters u0045 u006e
    u0067 u006C u0069 u0073 u0068

    > So far I've been looking into the following Perl modules man pages an
    > tried each one of them: Unicode::UTF8simple, Unicode::String,
    > Unicode::Lite. None of them seems to be able to do that. They handle
    > unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
    > difference between the above representation is that,




    > the \u82f1 represent one 8-bit character,


    No it doesn't!

    while in Perl it is represented in two U+00xx values.

    Two U+00xx values represent *TWO* Latin-1 characters.
    RedGrittyBrick, Feb 25, 2005
    #5
  6. On Fri, 25 Feb 2005, * Tong * wrote:

    > the \u82f1 represent one Chinese character,


    Yes

    > which is in two 8-bit characters


    No way. As written, it's six *characters*. Encoded, it might be
    two *bytes* (depends on the encoding).

    > Any way, I figured out a way to do it, without any the
    > aforementioned unicode packages.


    But you're not going to tell us what it is?
    Alan J. Flavell, Feb 25, 2005
    #6
  7. * Tong *

    * Tong * Guest

    On Fri, 25 Feb 2005 21:42:38 +0000, Alan J. Flavell wrote:

    >> Any way, I figured out a way to do it, without any the
    >> aforementioned unicode packages.

    >
    > But you're not going to tell us what it is?


    Well, it actually has nothing to do with unicode. Here is what I did to
    decode such string:

    perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;


    --
    Tong (remove underscore(s) to reply)
    *niX Power Tools Project: http://xpt.sourceforge.net/
    - All free contribution & collection
    * Tong *, Feb 27, 2005
    #7
  8. On Sun, 27 Feb 2005, * Tong * wrote:

    > > But you're not going to tell us what it is?

    >
    > Well, it actually has nothing to do with unicode.


    Actually, it has a great deal to do with Unicode...

    > Here is what I did to decode such string:
    >
    > perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;


    Fine. chr(hex($1)) is the Unicode character in question - in Perl's
    native representation.

    Thanks. It just goes to show how seamless Perl's Unicode
    implementation is, when one can use it without even believing in it
    ;-)

    Perhaps our questioner on another thread, who's determined to prevent
    Perl's unicode from working for him, could take a lesson from this.

    all the best
    Alan J. Flavell, Feb 28, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    10
    Views:
    6,188
    Neredbojias
    Aug 19, 2005
  2. Bengt Richter
    Replies:
    6
    Views:
    464
    Juha Autero
    Aug 19, 2003
  3. jack
    Replies:
    4
    Views:
    584
  4. tim

    hex string to hex value

    tim, Nov 22, 2005, in forum: Python
    Replies:
    8
    Views:
    18,847
  5. tim
    Replies:
    2
    Views:
    1,557
    Dennis Lee Bieber
    Nov 23, 2005
Loading...

Share This Page