Convert UFT-8 to unicode?

Andreas Schmidt · Aug 6, 2003

Hi,

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

Thanks for every hint!
Andi

Jürgen Exner · Aug 6, 2003

Andreas said:
my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
Euro symbol.
However, the unicode for this symbol is 0x20AC.
How can I convert from UTF-8 to Unicode?

Text::Iconv does a good job in converting between pretty much any encoding.

jue

Bart Lateur · Aug 6, 2003

Andreas said:
my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

I assume you mean the UTF-8 looks like "\xE2\x82\xAC"?

However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

If you're sure the string contains valid UTF-8, all you have to do is
enable the UTF-8 flag of the string. If you're using Perl 5.8.0 or
above, you have the Encode module at your displosal. See the last
section in its POD, "The UTF-8 flag" and "Messing with Perl's
Internals". You'll see the function _utf8_on($scalar) mentioned there.

<http://www.perldoc.com/perl5.8.0/lib/Encode.html>

If you're using a Perl 5.6.x, you can emulate that function using pack()
(most likely it will work for 5.8.x, too):

$utf8 = pack "U0a*", $bytes;

$utf8 will contain a string with exactly the same bytes as $bytes, but
having the UTF-8 flag on.

Alan J. Flavell · Aug 6, 2003

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

Then it would be best to use at least Perl 5.8.0 ...

However, the unicode for this symbol is 0x20AC.

"Unicode" is an abstract concept - an identification of particular
characters with particular integer numbers ("code points" in the
Unicode character set). In order to actually _use_ those abtract
Unicode characters, it's necessary to have a way of representing them.
utf-8 is one particular way of representing them (and it just happens
to be Perl's own internal representation of Unicode, although you
don't need to know that in order to use it). You writing 0x20AC (or
as the Unicode folks would write it, U+20AC) are just other ways of
giving a concrete representation to the abstract characters. None of
them is "Unicode" per se: all of them are representations of Unicode.

How can I convert from UTF-8 to Unicode?

utf-8 already _is_ (a representation of) Unicode.

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){

Yup, that's another way of representing Unicode: it's Perl's way of
writing a "wide character" in source code.

Perhaps you could be a bit more precise about how this script
"receives" Unicode characters. Is it reading them directly from a
file (then it's easy in 5.8.0, you just open the file with :utf8), or
is it that you've decoded some HTML form submission data, and got
yourself a string of bytes which contains some utf-8 representations
of characters?

If it's the latter, and you really have to handle this yourself by
hand (it appears that recent versions of CGI.pm handle it for you, but
I have to admit to not trying that myself yet), then I think you want
pack() with a template of U0, as others have said.

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

Sort-of; but I'd still recommend taking a bit of time out to study
relevant parts of
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html and then
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

to get a firmer understanding of what's going on, and how it's meant
to be used.

Alan J. Flavell · Aug 6, 2003

utf-8 already _is_ (a representation of) Unicode.

I'm glad to see now that you got much the same answer to this point
when you posted the same question to the German-language Perl group.

But it's not nice to post the same question in several places without
informing the respective participants that you are doing that. It
leads to pointless duplication of effort by people who were trying to
help you.

Ted Zlatanov · Aug 6, 2003

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
Euro symbol.

However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of
course...

Start with "perldoc utf8" and "perldoc perlunicode." That will
probably do a large chunk of what you need.

The detailed answer depends a *lot* on your Perl version, your goals,
and your Unicode programming experience. You can also check CPAN for
UTF-8 modules that may be helpful:

http://search.cpan.org/search?query=UTF-8&mode=all

Ted

Alan J. Flavell · Aug 7, 2003

I have been sent a file in UTF format,

If we're to believe your subject header, it's utf-8 (as opposed to
utf-16LE or utf-16BE or whatever...)

that is a file with UTF characters.

utf-8 is a representation of Unicode characters. I don't know what
the term "UTF characters" would mean.

If I cat(1) the file I correctly see the Japanese characters.

It sounds as if you have a utf-8-capable terminal, then.

How do I display the same characters in Perl?

Not to be too trite, but you'd read them in and then you'd print them
out. Just where are you experiencing a problem?

An "od -x" of the file looks like
this:

0000000 a4e6 e79c a2b4

I think that's OK; I'm not too good with doing utf-8 in my head.

I don't grasp your problem yet. Where did you get so far? Are you
using Perl 5.8 ? Have you read the relevant perldoc pages? Are you
opening output and input with ":utf8"?

Alan J. Flavell · Aug 7, 2003

On Wed, Aug 6, Nigel Horne inscribed on the eternal scroll:

I think that's OK; I'm not too good with doing utf-8 in my head.

The only octets in there which could be the first octet of a utf-8
character are the "e6" and "e7", and, since they are both of the form
"1110xxxx", each would be followed by two non-first octets (see the
utf-8 spec if you don't get this). Non-first octets have to be of the
form "10xxxxxx" i.e one of 8x, 9x, ax or bx. The bytes appear to be in
the wrong order for that.

I think this is because od printed little-endian 16-bit units instead
of printing bytes in sequence. Could it be that the actual byte
sequence in question is:

e6 a4 9c , e7 b4 a2

If so, then that could indeed be a legal utf-8 sequence, representing
two CJK-unified characters, namely U+691c and u+7d22.

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=691C
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=7d22

I don't read CJK myself, sorry, so this is sheer guesswork, I have no
idea whether it makes sense in the original.

But I come back to the original question. It looks as if in Perl 5.8
you can simply read this in and print it out (having opened the files
with :utf8 if you hope for the data to make any kind of sense in the
program). So, at which point are you experiencing a problem?

Philip Newton · Aug 31, 2003

If so, then that could indeed be a legal utf-8 sequence, representing
two CJK-unified characters, namely U+691c and u+7d22.

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=691C
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=7d22

I don't read CJK myself, sorry, so this is sheer guesswork, I have no
idea whether it makes sense in the original.

In Japanese, that would make the word "kensaku", which EDICT translates
as "retrieval (vs), looking up (a word in a dictionary), searching for,
referring to", which looks very sensical to me.

Cheers,
Philip

Alan J. Flavell · Sep 1, 2003

On Thu, 7 Aug 2003 14:09:18 +0200, "Alan J. Flavell" ^^^

In Japanese, that would make the word "kensaku", which EDICT translates
as "retrieval (vs), looking up (a word in a dictionary), searching for,
referring to", which looks very sensical to me.

Well, that was a real slow-burner of a thread, but thanks! ;-))

The O.P never did come back with any further details, as far
as I can see. I hope he got a workable solution.

cheers

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Unicode (UTF-8) in C	13	Mar 16, 2014
UTF-8 read & print?	6	Nov 25, 2012
converting UTF-8 to unicode hex with perl	4	Jun 27, 2009
Convert ellipsis to utf-8	1	Sep 30, 2010
Unicode/UTF-8 confusion	1	Mar 15, 2008
Problem converting euro from windows-1252 to UTF-8 !!	5	Jul 10, 2006
UTF-8 problem	8	Aug 21, 2007

Convert UFT-8 to unicode?

Andreas Schmidt

Jürgen Exner

Bart Lateur

Alan J. Flavell

Alan J. Flavell

Ted Zlatanov

Alan J. Flavell

Alan J. Flavell

Philip Newton

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads