Hmn, so even if I've specified the html page containing the form
with a charset='utf-8' and form with accept-charset='utf-8', you
say, I have to do some checking before "$value = decode('utf8',
$value);" ?
Never trust external data! Not only could there be browser bugs, but
in a WWW context, somebody may be submitting deliberately defective
data in the hope of compromising your server-side script.
Do you mean I've to use the optional [CHECK] argument in the
decode() call as explained in the Encode module's source ? If yes,
what' the best CHECK value ? WARN_ON_ERR ?
Well, I mean you need to think about it. The decision what to do may
depend on circumstances. Perhaps it is enough to allow the decode to
insert the bad-character marker; perhaps it is more appropriate to
break off altogether; it depends on your assessment of the
consequences.
Raw server-side warnings aren't much use in the web client/server
context. You need to catch anything that is serious enough to be
caught, and report its implications to the client in terms which make
sense to the client (i.e not just passing-on the text of some
Perl-specific diagnostic).
Well, so, what you say is that :
- If I have to treat POST being a 'application/x-www-form-urlencoded'
one, I can just read like this :
read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...
rather than :
binmode(STDIN, ':utf8');
read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...
What you are reading from STDIN at this point is (or should be) in the
form-urlencoded format, and certainly *NOT* in utf-8 at this protocol
level. The data that you are reading *should* contain only us-ascii
characters, with everything else replaced by %xx hexadecimal notation
as I already said - and as you surely must already know!
http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4.1
You can read it in text mode or in raw binmode - the difference will
be in the handling of newlines, but your code can easily handle that.
Having read it, then, logically, you need to decode the
form-urlencoded format into bytes (octets); and then you need to turn
those byte-sequences into Perl's own characters.
If you look in CGI.pm you will also find how to handle EBCDIC, but
perhaps you will never want to do that.
but never (as multipart example at bottom) :
read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...
and for each value found :
decode('utf8', $value);
In general, each unicode character results from a variable number
of bytes in your $data. decode_utf8() knows how many it needs for each
utf8 character which it will output.
There's parsing code in CGI.pm for this, why not take a look at what
it does, though it might be overkill for you since it has to deal with
other encodings as well as with EBCDIC.
(I've lost sight of just what it is that you are doing which CGI.pm
would not do for you anyway...?)
Encode::_utf8_on($value) unless Encode::is_utf8($value);
Please, don't play around with _utf8_on() without *very* good reason,
which you certainly don't seem to have here. Just use Perl's own
natural character formats, and they will take care of the internal
detail, in all but the most specialised of cases.
- If I have to treat POST being a 'multipart/form-data' one,
Sorry, I've run out of time for now; but again, if you make the right
moves with Perl, it *will* give you your text characters, in its
natural representation, you do *not* normally have to do the low level
work of _utf8_on() for yourself, and you can cause harm if you do it
wrongly.
- What's about GET params when browser doesn't send url as UTF-8
(e.g. user didn't checked the "send as utf-8" in the IE options) :
do I have to decode() them from $ENV{'QUERY_STRING'} ?
Query string handling is not the same as encoding in URLs.
Try this experiment: send this sample query to google (by typing it in
the URL bar):
http://www.google.co.uk/search?q=При ,
and remember the result.
Then change the "send as utf-8" option to the other setting, re-start
IE, and try it again. I reckon you will get the same as I did: the
same results with either setting. (Three Cyrillic letters
corresponding to Latin "Pri"). The URL in this case is in ASCII.
Even though the ASCII characters are encoding some utf-8 data. See
the distinction?
And this would also apply if you code those same parameters into a
form.
- What's about name in name-value pair from web form : do I have to
decode() them, knowing they surely been in the us-ASCII set (a-z,
A-Z, 1-9) ?
How can you be so sure? In a web context, browser bugs or malicious
users can deliver defective data, this applies just as much to the
names as to the values. Many server-side insecurities have resulted
from scripts which did not sufficiently validate the submitted data.
[...]
Sorry, now I'm really out of time for the moment. Maybe someone else
will want to comment. Good luck.