How to decode this unicode-hex string

* Tong * · Feb 25, 2005

Hi,

When I select from non-English web sites and paste into my emacs,
sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
"English" in Big5 encoding.

I'm wondering how I can decode such strings and return the 8-bit character.

So far I've been looking into the following Perl modules man pages an
tried each one of them: Unicode::UTF8simple, Unicode::String,
Unicode::Lite. None of them seems to be able to do that. They handle
unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
difference between the above representation is that, the \u82f1 represent
one 8-bit character, while in Perl it is represented in two U+00xx values.

I had also played with tcl decodings, but wasn't successful. Please help.

Thanks a lot!

tong

phaylon · Feb 25, 2005

* Tong * said:
I'm wondering how I can decode such strings and return the 8-bit
character.

Sometimes I think all some people read from this group before posting is
the name. Look at the thread right before yours.

* Tong * · Feb 25, 2005

Sometimes I think all some people read from this group before posting is
the name. Look at the thread right before yours.

Can you at least specify the thread subject if you want to help? Did you
mean the thread "How to convert latin1 to utf8"? Did you see that I've tried the
Unicode::String (and much more) before the posting? After all, have you
read the two threads carefully and seen the giant difference between them?

phaylon · Feb 25, 2005

* Tong * said:
Can you at least specify the thread subject if you want to help?

No, that's your job. My job is to code. But sometimes I make breaks. And,
I'm sorry if this is offensive to you, but I'm not willing to spend my
breaks doing someone other's work.

Did you mean the thread "How to convert latin1 to utf8"?
Bingo.

Did you see that I've tried the Unicode::String (and much more) before
the posting?

Yeah. And I said there I would try out Encode, have you done that?

After all, have you read the two threads carefully and seen the giant
difference between them?

Nope, clear me up.

RedGrittyBrick · Feb 25, 2005

* Tong * said:
Hi,

When I select from non-English web sites and paste into my emacs,
sometimes I get a unicode-hex string like this: \u82f1\u6587, which was
"English" in Big5 encoding.

I'm confused. Unicode and Big5 are completely different aren't they? For
one thing Unicode is a character set, there are several encodings such
as UTF-8.

u8251 and u6581 are Chinese characters in Unicode. They are within the
CJK Unified Ideographs 4E00-9FAF.
http://www.unicode.org/charts/PDF/U4E00.pdf
Together they form the Chonese word whose English translation is the
word "English".

I'm wondering how I can decode such strings and return the 8-bit character.

An 8-bit character set would surely not be large enough to contain a
usable subset of the Chinese ideographs. Big 5 has 13,000 ideographs. An
8-bit character set has room for 256 at most.

When you say "the 8 bit character" are you thinking of something like
the ISO 8859-1 Latin-1 character set?

Without a Chinese-English dictionary, there's no way to "decode" the two
Chinese ideograms u8251 u6581 into the seven English letters u0045 u006e
u0067 u006C u0069 u0073 u0068

So far I've been looking into the following Perl modules man pages an
tried each one of them: Unicode::UTF8simple, Unicode::String,
Unicode::Lite. None of them seems to be able to do that. They handle
unicode-hex strings like this: "U+00d6 U+00d0 U+00b9 U+00fa". The
difference between the above representation is that,

the \u82f1 represent one 8-bit character,

No it doesn't!

while in Perl it is represented in two U+00xx values.

Two U+00xx values represent *TWO* Latin-1 characters.

Alan J. Flavell · Feb 25, 2005

the \u82f1 represent one Chinese character,
Yes

which is in two 8-bit characters

No way. As written, it's six *characters*. Encoded, it might be
two *bytes* (depends on the encoding).

Any way, I figured out a way to do it, without any the
aforementioned unicode packages.

But you're not going to tell us what it is?

* Tong * · Feb 27, 2005

But you're not going to tell us what it is?

Well, it actually has nothing to do with unicode. Here is what I did to
decode such string:

perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;

Alan J. Flavell · Feb 28, 2005

Well, it actually has nothing to do with unicode.

Actually, it has a great deal to do with Unicode...

Here is what I did to decode such string:

perl -pe 's / \\u([0-9a-f]+) / chr(hex($1)) /giex;' 2>/dev/null;

Fine. chr(hex($1)) is the Unicode character in question - in Perl's
native representation.

Thanks. It just goes to show how seamless Perl's Unicode
implementation is, when one can use it without even believing in it
;-)

Perhaps our questioner on another thread, who's determined to prevent
Perl's unicode from working for him, could take a lesson from this.

all the best

How to replace UniCode representation with actual character?	6	Dec 18, 2013
How to decode JavaScript's encodeURIComponent in Perl.	4	Jan 23, 2007
How can I get a character, given its Unicode index?	5	Aug 30, 2009
string to unicode	0	Aug 15, 2011
Does unpack() support higher-order Unicode strings for hex conversion?	0	Nov 3, 2005
Unicode conversion problem (codec can't decode)	2	Apr 4, 2008
FAQ 9.10 How do I decode or create those %-encodings on the web?	0	Apr 5, 2011
decode a string to "Perl's internal form" without Encode module?	4	Feb 28, 2007

How to decode this unicode-hex string

* Tong *

phaylon

* Tong *

phaylon

RedGrittyBrick

Alan J. Flavell

* Tong *

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads