unicode: is decode-process-encode a "good" aproach?

P

peter pilsl

Thnx to Alan and Shawn for their reply to my last posting. I read a lot
of docs before, after and still do, but its all very confusing.

Finally I found an aproach that is actually working to me and I wanted
to ask you if this makes sense and *might* even work for longer or if it
just cries for troubles.

I read parameters delivered by the webbrowser (html-header is always
UTF-8 !!), and want to sort and lowercase them and print them out again.
I dont set STDIN and STDOUT to ":utf8", cause this does not work with
mod_perl.


.....
my $input=$cgi->param('myfield');
utf8::decode($input);
utf8::downgrade($input); # otherwise sort will not sort according to
# my LC_COLLATE-setting and I need

# localized sort (mainly german data)


my $value=do_a_lot($input); # do some dataprocessing including sorting

utf8::upgrade($value); # otherwise the lc() in the next line would
# not lower chars like german umlauts
$value=lc($value);
utf8::downgrade($value); # to make sort work again

$value=do_a_lot_more($value); # do some more dataprocessing and sorting

utf8::encode($value);
print $value;


So is it ok to get the data somehow "raw" from the webinterface, then
decode it, process it and encode it again to print it out or is this a
rather stupid approach?

Is it normal that I need to decode values delivered by an webpage that
has UTF-8 charset in its header?

Is it ok to clear the utf-8 flag to make sorting work in a locale-way
and set the flag again to make lc() work? Or does this just show that
there is something wrong in my script?
If I use Unicode::Collate I would not need this fiddling with utf-8, but
this is very slow (cause it loads the big allkeys.txt - file) and might
cause troubles in multithreaded applications (as I read somewhere)

I did not provide a full script, cause this posting is long enough that
way. Hope this is ok.


I also tried to replace the utf8::encode/decode with Encode::from_to but
failed so far, cause I actually dont know from what to what I like to
convert. One side is utf8 but what is the other side?


thnx a lot,
peter
 
B

Ben Morrow

Quoth peter pilsl said:
I read parameters delivered by the webbrowser (html-header is always
UTF-8 !!), and want to sort and lowercase them and print them out again.
I dont set STDIN and STDOUT to ":utf8",

I will say, as I often have: I would recommend using :encoding(utf8)
rather that :utf8, as you can then handle malformed utf8 properly.
cause this does not work with
mod_perl.


....
my $input=$cgi->param('myfield');
utf8::decode($input);

I would use Encode::decode here, as you'll get better error handling.

So is it ok to get the data somehow "raw" from the webinterface, then
decode it, process it and encode it again to print it out or is this a
rather stupid approach?

Is it normal that I need to decode values delivered by an webpage that
has UTF-8 charset in its header?

If you haven't specified that the FH is utf8, then you'll have to decode it
by hand.
Is it ok to clear the utf-8 flag to make sorting work in a locale-way
and set the flag again to make lc() work? Or does this just show that
there is something wrong in my script?

Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
ISO8859-1? I would strongly recommend using Encode::encode to convert it
to ISO8859-1 explicitly, and be prepared to handle errors.

If you read perlunicode it tells you that Unicode and locales currently
don't play nicely together; I'd probably recommend doing something like
this:

my $iso = Encode::encode 'iso8859-1' => $utf8;
{
use locale;
do_stuff_with($iso);
}
$utf8 = Encode::decode 'iso8859-1' => $iso;

so that you don't try and use unicode data when locales are switched on.

Ben
 
P

peter pilsl

Hmmmmmmm..... I think this is a bad idea. What if you have chars outside
ISO8859-1? I would strongly recommend using Encode::encode to convert it
to ISO8859-1 explicitly, and be prepared to handle errors.

thnx. I got around all these problems now by finding an appropriate
locale for my needs : "de_AT.UTF-8". I get the input from a
non-utf8-filehandle, decode and then everythings works smoothly
including sorting, lowercasing, patternmatching (see below). Then I
encode and print out to non-utf8-filehandle again.

If you read perlunicode it tells you that Unicode and locales currently
don't play nicely together; I'd probably recommend doing something like
this:

my $iso = Encode::encode 'iso8859-1' => $utf8;
{
use locale;
do_stuff_with($iso);
}
$utf8 = Encode::decode 'iso8859-1' => $iso;

so that you don't try and use unicode data when locales are switched on.

perlunicode states that is discouraged, but it also explains a bit what
can happen and and at the end I dont have much of a choice but using
Unicode and locales.
The Data I need to process can definitely include many different
languages and charsets. And the handling (especially collate) should
definitely follow german rules. (german text that can include words from
any other language, including chinese and hindi and other things I never
heard of). And it should be fast ....

Your idea above looks very smart and I'll definitely give it a very
close look. Currently all my locale-stuff work. (almost all - see my
other new posting where there is one construct that makes $s=~/$s/i fail !!)


thnx a lot,
peter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,019
Latest member
RoxannaSta

Latest Threads

Top