How to decode this unicode-hex string

T

* Tong *

Hi,

When I select from non-English web sites and paste into my emacs,
sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
"English" in Big5 encoding.

I'm wondering how I can decode such strings and return the 8-bit character.

So far I've been looking into the following Perl modules man pages an
tried each one of them: Unicode::UTF8simple, Unicode::String,
Unicode::Lite. None of them seems to be able to do that. They handle
unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
difference between the above representation is that, the \u82f1 represent
one 8-bit character, while in Perl it is represented in two U+00xx values.

I had also played with tcl decodings, but wasn't successful. Please help.

Thanks a lot!

tong
 
P

phaylon

* Tong * said:
I'm wondering how I can decode such strings and return the 8-bit
character.

Sometimes I think all some people read from this group before posting is
the name. Look at the thread right before yours.
 
T

* Tong *

Sometimes I think all some people read from this group before posting is
the name. Look at the thread right before yours.

Can you at least specify the thread subject if you want to help? Did you
mean the thread "How to convert latin1 to utf8"? Did you see that I've tried the
Unicode::String (and much more) before the posting? After all, have you
read the two threads carefully and seen the giant difference between them?
 
P

phaylon

* Tong * said:
Can you at least specify the thread subject if you want to help?

No, that's your job. My job is to code. But sometimes I make breaks. And,
I'm sorry if this is offensive to you, but I'm not willing to spend my
breaks doing someone other's work.
Did you mean the thread "How to convert latin1 to utf8"?
Bingo.

Did you see that I've tried the Unicode::String (and much more) before
the posting?

Yeah. And I said there I would try out Encode, have you done that?
After all, have you read the two threads carefully and seen the giant
difference between them?

Nope, clear me up.
 
R

RedGrittyBrick

* Tong * said:
Hi,

When I select from non-English web sites and paste into my emacs,
sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
"English" in Big5 encoding.

I'm confused. Unicode and Big5 are completely different aren't they? For
one thing Unicode is a character set, there are several encodings such
as UTF-8.

u8251 and u6581 are Chinese characters in Unicode. They are within the
CJK Unified Ideographs 4E00-9FAF.
http://www.unicode.org/charts/PDF/U4E00.pdf
Together they form the Chonese word whose English translation is the
word "English".
I'm wondering how I can decode such strings and return the 8-bit character.

An 8-bit character set would surely not be large enough to contain a
usable subset of the Chinese ideographs. Big 5 has 13,000 ideographs. An
8-bit character set has room for 256 at most.

When you say "the 8 bit character" are you thinking of something like
the ISO 8859-1 Latin-1 character set?

Without a Chinese-English dictionary, there's no way to "decode" the two
Chinese ideograms u8251 u6581 into the seven English letters u0045 u006e
u0067 u006C u0069 u0073 u0068
So far I've been looking into the following Perl modules man pages an
tried each one of them: Unicode::UTF8simple, Unicode::String,
Unicode::Lite. None of them seems to be able to do that. They handle
unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
difference between the above representation is that,


the \u82f1 represent one 8-bit character,

No it doesn't!

while in Perl it is represented in two U+00xx values.

Two U+00xx values represent *TWO* Latin-1 characters.
 
A

Alan J. Flavell

the \u82f1 represent one Chinese character,
Yes

which is in two 8-bit characters

No way. As written, it's six *characters*. Encoded, it might be
two *bytes* (depends on the encoding).
Any way, I figured out a way to do it, without any the
aforementioned unicode packages.

But you're not going to tell us what it is?
 
T

* Tong *

But you're not going to tell us what it is?

Well, it actually has nothing to do with unicode. Here is what I did to
decode such string:

perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;
 
A

Alan J. Flavell

Well, it actually has nothing to do with unicode.

Actually, it has a great deal to do with Unicode...
Here is what I did to decode such string:

perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;

Fine. chr(hex($1)) is the Unicode character in question - in Perl's
native representation.

Thanks. It just goes to show how seamless Perl's Unicode
implementation is, when one can use it without even believing in it
;-)

Perhaps our questioner on another thread, who's determined to prevent
Perl's unicode from working for him, could take a lesson from this.

all the best
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top