Convert UFT-8 to unicode?

  • Thread starter Andreas Schmidt
  • Start date
A

Andreas Schmidt

Hi,

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

Thanks for every hint!
Andi
 
J

Jürgen Exner

Andreas said:
my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
Euro symbol.
However, the unicode for this symbol is 0x20AC.
How can I convert from UTF-8 to Unicode?

Text::Iconv does a good job in converting between pretty much any encoding.

jue
 
B

Bart Lateur

Andreas said:
my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

I assume you mean the UTF-8 looks like "\xE2\x82\xAC"?
However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?


I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

If you're sure the string contains valid UTF-8, all you have to do is
enable the UTF-8 flag of the string. If you're using Perl 5.8.0 or
above, you have the Encode module at your displosal. See the last
section in its POD, "The UTF-8 flag" and "Messing with Perl's
Internals". You'll see the function _utf8_on($scalar) mentioned there.

<http://www.perldoc.com/perl5.8.0/lib/Encode.html>


If you're using a Perl 5.6.x, you can emulate that function using pack()
(most likely it will work for 5.8.x, too):

$utf8 = pack "U0a*", $bytes;

$utf8 will contain a string with exactly the same bytes as $bytes, but
having the UTF-8 flag on.
 
A

Alan J. Flavell

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
symbol.

Then it would be best to use at least Perl 5.8.0 ...
However, the unicode for this symbol is 0x20AC.

"Unicode" is an abstract concept - an identification of particular
characters with particular integer numbers ("code points" in the
Unicode character set). In order to actually _use_ those abtract
Unicode characters, it's necessary to have a way of representing them.
utf-8 is one particular way of representing them (and it just happens
to be Perl's own internal representation of Unicode, although you
don't need to know that in order to use it). You writing 0x20AC (or
as the Unicode folks would write it, U+20AC) are just other ways of
giving a concrete representation to the abstract characters. None of
them is "Unicode" per se: all of them are representations of Unicode.
How can I convert from UTF-8 to Unicode?

utf-8 already _is_ (a representation of) Unicode.
I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){

Yup, that's another way of representing Unicode: it's Perl's way of
writing a "wide character" in source code.

Perhaps you could be a bit more precise about how this script
"receives" Unicode characters. Is it reading them directly from a
file (then it's easy in 5.8.0, you just open the file with :utf8), or
is it that you've decoded some HTML form submission data, and got
yourself a string of bytes which contains some utf-8 representations
of characters?

If it's the latter, and you really have to handle this yourself by
hand (it appears that recent versions of CGI.pm handle it for you, but
I have to admit to not trying that myself yet), then I think you want
pack() with a template of U0, as others have said.
but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

Sort-of; but I'd still recommend taking a bit of time out to study
relevant parts of
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html and then
http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

to get a firmer understanding of what's going on, and how it's meant
to be used.
 
A

Alan J. Flavell

utf-8 already _is_ (a representation of) Unicode.

I'm glad to see now that you got much the same answer to this point
when you posted the same question to the German-language Perl group.

But it's not nice to post the same question in several places without
informing the respective participants that you are doing that. It
leads to pointless duplication of effort by people who were trying to
help you.
 
T

Ted Zlatanov

my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
Euro symbol.

However, the unicode for this symbol is 0x20AC.

How can I convert from UTF-8 to Unicode?

I'd like to do sth like:

if( $str =~ m/\x{20AC}/ ){
print "used euro";
}

but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of
course...

Start with "perldoc utf8" and "perldoc perlunicode." That will
probably do a large chunk of what you need.

The detailed answer depends a *lot* on your Perl version, your goals,
and your Unicode programming experience. You can also check CPAN for
UTF-8 modules that may be helpful:

http://search.cpan.org/search?query=UTF-8&mode=all

Ted
 
A

Alan J. Flavell

I have been sent a file in UTF format,

If we're to believe your subject header, it's utf-8 (as opposed to
utf-16LE or utf-16BE or whatever...)
that is a file with UTF characters.

utf-8 is a representation of Unicode characters. I don't know what
the term "UTF characters" would mean.
If I cat(1) the file I correctly see the Japanese characters.

It sounds as if you have a utf-8-capable terminal, then.
How do I display the same characters in Perl?

Not to be too trite, but you'd read them in and then you'd print them
out. Just where are you experiencing a problem?
An "od -x" of the file looks like
this:

0000000 a4e6 e79c a2b4

I think that's OK; I'm not too good with doing utf-8 in my head.

I don't grasp your problem yet. Where did you get so far? Are you
using Perl 5.8 ? Have you read the relevant perldoc pages? Are you
opening output and input with ":utf8"?
 
A

Alan J. Flavell

On Wed, Aug 6, Nigel Horne inscribed on the eternal scroll:

I think that's OK; I'm not too good with doing utf-8 in my head.

The only octets in there which could be the first octet of a utf-8
character are the "e6" and "e7", and, since they are both of the form
"1110xxxx", each would be followed by two non-first octets (see the
utf-8 spec if you don't get this). Non-first octets have to be of the
form "10xxxxxx" i.e one of 8x, 9x, ax or bx. The bytes appear to be in
the wrong order for that.

I think this is because od printed little-endian 16-bit units instead
of printing bytes in sequence. Could it be that the actual byte
sequence in question is:

e6 a4 9c , e7 b4 a2

If so, then that could indeed be a legal utf-8 sequence, representing
two CJK-unified characters, namely U+691c and u+7d22.

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=691C
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=7d22

I don't read CJK myself, sorry, so this is sheer guesswork, I have no
idea whether it makes sense in the original.

But I come back to the original question. It looks as if in Perl 5.8
you can simply read this in and print it out (having opened the files
with :utf8 if you hope for the data to make any kind of sense in the
program). So, at which point are you experiencing a problem?
 
P

Philip Newton

If so, then that could indeed be a legal utf-8 sequence, representing
two CJK-unified characters, namely U+691c and u+7d22.

http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=691C
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=7d22

I don't read CJK myself, sorry, so this is sheer guesswork, I have no
idea whether it makes sense in the original.

In Japanese, that would make the word "kensaku", which EDICT translates
as "retrieval (vs), looking up (a word in a dictionary), searching for,
referring to", which looks very sensical to me.

Cheers,
Philip
 
A

Alan J. Flavell

On Thu, 7 Aug 2003 14:09:18 +0200, "Alan J. Flavell" ^^^


In Japanese, that would make the word "kensaku", which EDICT translates
as "retrieval (vs), looking up (a word in a dictionary), searching for,
referring to", which looks very sensical to me.

Well, that was a real slow-burner of a thread, but thanks! ;-))

The O.P never did come back with any further details, as far
as I can see. I hope he got a workable solution.

cheers
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top