Convert UFT-8 to unicode?

Discussion in 'Perl Misc' started by Andreas Schmidt, Aug 6, 2003.

  1. Hi,

    my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
    symbol.

    However, the unicode for this symbol is 0x20AC.

    How can I convert from UTF-8 to Unicode?

    I'd like to do sth like:

    if( $str =~ m/\x{20AC}/ ){
    print "used euro";
    }

    but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...

    Thanks for every hint!
    Andi
     
    Andreas Schmidt, Aug 6, 2003
    #1
    1. Advertising

  2. Andreas Schmidt wrote:
    > my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
    > Euro symbol.
    > However, the unicode for this symbol is 0x20AC.
    > How can I convert from UTF-8 to Unicode?


    Text::Iconv does a good job in converting between pretty much any encoding.

    jue
     
    Jürgen Exner, Aug 6, 2003
    #2
    1. Advertising

  3. Andreas Schmidt

    Bart Lateur Guest

    Andreas Schmidt wrote:

    >my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
    >symbol.


    I assume you mean the UTF-8 looks like "\xE2\x82\xAC"?

    >However, the unicode for this symbol is 0x20AC.
    >
    >How can I convert from UTF-8 to Unicode?
    >
    >
    >I'd like to do sth like:
    >
    >if( $str =~ m/\x{20AC}/ ){
    > print "used euro";
    >}
    >
    >but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...


    If you're sure the string contains valid UTF-8, all you have to do is
    enable the UTF-8 flag of the string. If you're using Perl 5.8.0 or
    above, you have the Encode module at your displosal. See the last
    section in its POD, "The UTF-8 flag" and "Messing with Perl's
    Internals". You'll see the function _utf8_on($scalar) mentioned there.

    <http://www.perldoc.com/perl5.8.0/lib/Encode.html>


    If you're using a Perl 5.6.x, you can emulate that function using pack()
    (most likely it will work for 5.8.x, too):

    $utf8 = pack "U0a*", $bytes;

    $utf8 will contain a string with exactly the same bytes as $bytes, but
    having the UTF-8 flag on.

    --
    Bart.
     
    Bart Lateur, Aug 6, 2003
    #3
  4. On Wed, Aug 6, Andreas Schmidt inscribed on the eternal scroll:

    > my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the Euro
    > symbol.


    Then it would be best to use at least Perl 5.8.0 ...

    > However, the unicode for this symbol is 0x20AC.


    "Unicode" is an abstract concept - an identification of particular
    characters with particular integer numbers ("code points" in the
    Unicode character set). In order to actually _use_ those abtract
    Unicode characters, it's necessary to have a way of representing them.
    utf-8 is one particular way of representing them (and it just happens
    to be Perl's own internal representation of Unicode, although you
    don't need to know that in order to use it). You writing 0x20AC (or
    as the Unicode folks would write it, U+20AC) are just other ways of
    giving a concrete representation to the abstract characters. None of
    them is "Unicode" per se: all of them are representations of Unicode.

    > How can I convert from UTF-8 to Unicode?


    utf-8 already _is_ (a representation of) Unicode.

    > I'd like to do sth like:
    >
    > if( $str =~ m/\x{20AC}/ ){


    Yup, that's another way of representing Unicode: it's Perl's way of
    writing a "wide character" in source code.

    Perhaps you could be a bit more precise about how this script
    "receives" Unicode characters. Is it reading them directly from a
    file (then it's easy in 5.8.0, you just open the file with :utf8), or
    is it that you've decoded some HTML form submission data, and got
    yourself a string of bytes which contains some utf-8 representations
    of characters?

    If it's the latter, and you really have to handle this yourself by
    hand (it appears that recent versions of CGI.pm handle it for you, but
    I have to admit to not trying that myself yet), then I think you want
    pack() with a template of U0, as others have said.

    > but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of course...


    Sort-of; but I'd still recommend taking a bit of time out to study
    relevant parts of
    http://www.perldoc.com/perl5.8.0/pod/perluniintro.html and then
    http://www.perldoc.com/perl5.8.0/pod/perlunicode.html

    to get a firmer understanding of what's going on, and how it's meant
    to be used.
     
    Alan J. Flavell, Aug 6, 2003
    #4
  5. On Wed, Aug 6, Alan J. Flavell inscribed on the eternal scroll:

    > On Wed, Aug 6, Andreas Schmidt inscribed on the eternal scroll:
    >
    > > How can I convert from UTF-8 to Unicode?

    >
    > utf-8 already _is_ (a representation of) Unicode.


    I'm glad to see now that you got much the same answer to this point
    when you posted the same question to the German-language Perl group.

    But it's not nice to post the same question in several places without
    informing the respective participants that you are doing that. It
    leads to pointless duplication of effort by people who were trying to
    help you.
     
    Alan J. Flavell, Aug 6, 2003
    #5
  6. Andreas Schmidt

    Ted Zlatanov Guest

    On Wed, 06 Aug 2003, wrote:
    > my CGI script receives UTF-8 strings, like "0xE2 0x82 0xAC" for the
    > Euro symbol.
    >
    > However, the unicode for this symbol is 0x20AC.
    >
    > How can I convert from UTF-8 to Unicode?
    >
    > I'd like to do sth like:
    >
    > if( $str =~ m/\x{20AC}/ ){
    > print "used euro";
    >}
    >
    > but first, I have to convert "0xE2 0x82 0xAC" to Unicode, of
    > course...


    Start with "perldoc utf8" and "perldoc perlunicode." That will
    probably do a large chunk of what you need.

    The detailed answer depends a *lot* on your Perl version, your goals,
    and your Unicode programming experience. You can also check CPAN for
    UTF-8 modules that may be helpful:

    http://search.cpan.org/search?query=UTF-8&mode=all

    Ted
     
    Ted Zlatanov, Aug 6, 2003
    #6
  7. Re: Convert UTF-8 to unicode?

    On Wed, Aug 6, Nigel Horne inscribed on the eternal scroll:

    > I have been sent a file in UTF format,


    If we're to believe your subject header, it's utf-8 (as opposed to
    utf-16LE or utf-16BE or whatever...)

    > that is a file with UTF characters.


    utf-8 is a representation of Unicode characters. I don't know what
    the term "UTF characters" would mean.

    > If I cat(1) the file I correctly see the Japanese characters.


    It sounds as if you have a utf-8-capable terminal, then.

    > How do I display the same characters in Perl?


    Not to be too trite, but you'd read them in and then you'd print them
    out. Just where are you experiencing a problem?

    > An "od -x" of the file looks like
    > this:
    >
    > 0000000 a4e6 e79c a2b4


    I think that's OK; I'm not too good with doing utf-8 in my head.

    I don't grasp your problem yet. Where did you get so far? Are you
    using Perl 5.8 ? Have you read the relevant perldoc pages? Are you
    opening output and input with ":utf8"?
     
    Alan J. Flavell, Aug 7, 2003
    #7
  8. Re: Convert UTF-8 to unicode?

    On Thu, Aug 7, Alan J. Flavell inscribed on the eternal scroll:

    > On Wed, Aug 6, Nigel Horne inscribed on the eternal scroll:


    > > An "od -x" of the file looks like
    > > this:
    > >
    > > 0000000 a4e6 e79c a2b4

    >
    > I think that's OK; I'm not too good with doing utf-8 in my head.


    The only octets in there which could be the first octet of a utf-8
    character are the "e6" and "e7", and, since they are both of the form
    "1110xxxx", each would be followed by two non-first octets (see the
    utf-8 spec if you don't get this). Non-first octets have to be of the
    form "10xxxxxx" i.e one of 8x, 9x, ax or bx. The bytes appear to be in
    the wrong order for that.

    I think this is because od printed little-endian 16-bit units instead
    of printing bytes in sequence. Could it be that the actual byte
    sequence in question is:

    e6 a4 9c , e7 b4 a2

    If so, then that could indeed be a legal utf-8 sequence, representing
    two CJK-unified characters, namely U+691c and u+7d22.

    http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=691C
    http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=7d22

    I don't read CJK myself, sorry, so this is sheer guesswork, I have no
    idea whether it makes sense in the original.

    But I come back to the original question. It looks as if in Perl 5.8
    you can simply read this in and print it out (having opened the files
    with :utf8 if you hope for the data to make any kind of sense in the
    program). So, at which point are you experiencing a problem?
     
    Alan J. Flavell, Aug 7, 2003
    #8
  9. Re: Convert UTF-8 to unicode?

    On Thu, 7 Aug 2003 14:09:18 +0200, "Alan J. Flavell"
    <> wrote:

    > If so, then that could indeed be a legal utf-8 sequence, representing
    > two CJK-unified characters, namely U+691c and u+7d22.
    >
    > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=691C
    > http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=7d22
    >
    > I don't read CJK myself, sorry, so this is sheer guesswork, I have no
    > idea whether it makes sense in the original.


    In Japanese, that would make the word "kensaku", which EDICT translates
    as "retrieval (vs), looking up (a word in a dictionary), searching for,
    referring to", which looks very sensical to me.

    Cheers,
    Philip
    --
    Philip Newton <>
    That really is my address; no need to remove anything to reply.
    If you're not part of the solution, you're part of the precipitate.
     
    Philip Newton, Aug 31, 2003
    #9
  10. Re: Convert UTF-8 to unicode?

    On Sun, Aug 31, Philip Newton inscribed on the eternal scroll:

    > On Thu, 7 Aug 2003 14:09:18 +0200, "Alan J. Flavell"

    ^^^

    > > I don't read CJK myself, sorry, so this is sheer guesswork, I have no
    > > idea whether it makes sense in the original.

    >
    > In Japanese, that would make the word "kensaku", which EDICT translates
    > as "retrieval (vs), looking up (a word in a dictionary), searching for,
    > referring to", which looks very sensical to me.


    Well, that was a real slow-burner of a thread, but thanks! ;-))

    The O.P never did come back with any further details, as far
    as I can see. I hope he got a workable solution.

    cheers

    --
    The following corrective action will be
    taken in 0 milliseconds: No action
    - seen in Win2K event viewer
     
    Alan J. Flavell, Sep 1, 2003
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,991
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    579
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. =?Utf-8?B?U3VzaGlTZWFu?=

    asp.net page uft-8 to ms sql

    =?Utf-8?B?U3VzaGlTZWFu?=, Sep 8, 2007, in forum: ASP .Net
    Replies:
    3
    Views:
    511
    Alexey Smirnov
    Sep 10, 2007
  4. Jeremy
    Replies:
    1
    Views:
    832
    Alex Willmer
    Jan 11, 2011
  5. Jeremy
    Replies:
    0
    Views:
    609
    Jeremy
    Jan 11, 2011
Loading...

Share This Page