unicode: is decode-process-encode a "good" aproach?

Discussion in 'Perl Misc' started by peter pilsl, Sep 28, 2004.

  1. peter pilsl

    peter pilsl Guest

    Thnx to Alan and Shawn for their reply to my last posting. I read a lot
    of docs before, after and still do, but its all very confusing.

    Finally I found an aproach that is actually working to me and I wanted
    to ask you if this makes sense and *might* even work for longer or if it
    just cries for troubles.

    I read parameters delivered by the webbrowser (html-header is always
    UTF-8 !!), and want to sort and lowercase them and print them out again.
    I dont set STDIN and STDOUT to ":utf8", cause this does not work with
    mod_perl.


    .....
    my $input=$cgi->param('myfield');
    utf8::decode($input);
    utf8::downgrade($input); # otherwise sort will not sort according to
    # my LC_COLLATE-setting and I need

    # localized sort (mainly german data)


    my $value=do_a_lot($input); # do some dataprocessing including sorting

    utf8::upgrade($value); # otherwise the lc() in the next line would
    # not lower chars like german umlauts
    $value=lc($value);
    utf8::downgrade($value); # to make sort work again

    $value=do_a_lot_more($value); # do some more dataprocessing and sorting

    utf8::encode($value);
    print $value;


    So is it ok to get the data somehow "raw" from the webinterface, then
    decode it, process it and encode it again to print it out or is this a
    rather stupid approach?

    Is it normal that I need to decode values delivered by an webpage that
    has UTF-8 charset in its header?

    Is it ok to clear the utf-8 flag to make sorting work in a locale-way
    and set the flag again to make lc() work? Or does this just show that
    there is something wrong in my script?
    If I use Unicode::Collate I would not need this fiddling with utf-8, but
    this is very slow (cause it loads the big allkeys.txt - file) and might
    cause troubles in multithreaded applications (as I read somewhere)

    I did not provide a full script, cause this posting is long enough that
    way. Hope this is ok.


    I also tried to replace the utf8::encode/decode with Encode::from_to but
    failed so far, cause I actually dont know from what to what I like to
    convert. One side is utf8 but what is the other side?


    thnx a lot,
    peter





    --
    http://www2.goldfisch.at/know_list
    http://leblogsportif.sportnation.at
    peter pilsl, Sep 28, 2004
    #1
    1. Advertising

  2. peter pilsl

    Ben Morrow Guest

    Quoth peter pilsl <>:
    >
    > I read parameters delivered by the webbrowser (html-header is always
    > UTF-8 !!), and want to sort and lowercase them and print them out again.
    > I dont set STDIN and STDOUT to ":utf8",


    I will say, as I often have: I would recommend using :encoding(utf8)
    rather that :utf8, as you can then handle malformed utf8 properly.

    > cause this does not work with
    > mod_perl.
    >
    >
    > ....
    > my $input=$cgi->param('myfield');
    > utf8::decode($input);


    I would use Encode::decode here, as you'll get better error handling.

    <snip>
    > So is it ok to get the data somehow "raw" from the webinterface, then
    > decode it, process it and encode it again to print it out or is this a
    > rather stupid approach?
    >
    > Is it normal that I need to decode values delivered by an webpage that
    > has UTF-8 charset in its header?


    If you haven't specified that the FH is utf8, then you'll have to decode it
    by hand.

    > Is it ok to clear the utf-8 flag to make sorting work in a locale-way
    > and set the flag again to make lc() work? Or does this just show that
    > there is something wrong in my script?


    Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
    ISO8859-1? I would strongly recommend using Encode::encode to convert it
    to ISO8859-1 explicitly, and be prepared to handle errors.

    If you read perlunicode it tells you that Unicode and locales currently
    don't play nicely together; I'd probably recommend doing something like
    this:

    my $iso = Encode::encode 'iso8859-1' => $utf8;
    {
    use locale;
    do_stuff_with($iso);
    }
    $utf8 = Encode::decode 'iso8859-1' => $iso;

    so that you don't try and use unicode data when locales are switched on.

    Ben

    --
    We do not stop playing because we grow old;
    we grow old because we stop playing.
    Ben Morrow, Sep 30, 2004
    #2
    1. Advertising

  3. peter pilsl

    peter pilsl Guest

    >
    >>Is it ok to clear the utf-8 flag to make sorting work in a locale-way
    >>and set the flag again to make lc() work? Or does this just show that
    >>there is something wrong in my script?

    >
    >
    > Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
    > ISO8859-1? I would strongly recommend using Encode::encode to convert it
    > to ISO8859-1 explicitly, and be prepared to handle errors.
    >


    thnx. I got around all these problems now by finding an appropriate
    locale for my needs : "de_AT.UTF-8". I get the input from a
    non-utf8-filehandle, decode and then everythings works smoothly
    including sorting, lowercasing, patternmatching (see below). Then I
    encode and print out to non-utf8-filehandle again.


    > If you read perlunicode it tells you that Unicode and locales currently
    > don't play nicely together; I'd probably recommend doing something like
    > this:
    >
    > my $iso = Encode::encode 'iso8859-1' => $utf8;
    > {
    > use locale;
    > do_stuff_with($iso);
    > }
    > $utf8 = Encode::decode 'iso8859-1' => $iso;
    >
    > so that you don't try and use unicode data when locales are switched on.
    >


    perlunicode states that is discouraged, but it also explains a bit what
    can happen and and at the end I dont have much of a choice but using
    Unicode and locales.
    The Data I need to process can definitely include many different
    languages and charsets. And the handling (especially collate) should
    definitely follow german rules. (german text that can include words from
    any other language, including chinese and hindi and other things I never
    heard of). And it should be fast ....

    Your idea above looks very smart and I'll definitely give it a very
    close look. Currently all my locale-stuff work. (almost all - see my
    other new posting where there is one construct that makes $s=~/$s/i fail !!)


    thnx a lot,
    peter



    --
    http://www2.goldfisch.at/know_list
    http://leblogsportif.sportnation.at
    peter pilsl, Oct 1, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Harald Kirsch
    Replies:
    2
    Views:
    2,122
    Harald Kirsch
    Aug 28, 2003
  2. =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=

    c++ support for unicode, utf-8, encode/decode, ifstream, wstream?

    =?UTF-8?B?UmFmYcWCIE1haiBSYWYyNTY=?=, Jan 20, 2006, in forum: C++
    Replies:
    12
    Views:
    6,348
    JustBoo
    Jan 23, 2006
  3. anonymous
    Replies:
    1
    Views:
    623
  4. Kless

    Decode/encode Unicode

    Kless, Aug 28, 2008, in forum: Ruby
    Replies:
    4
    Views:
    142
    Kless
    Aug 28, 2008
  5. Alan Franzoni
    Replies:
    0
    Views:
    201
    Alan Franzoni
    Jul 27, 2012
Loading...

Share This Page