How to mark UTF-8 string as being UTF-8

Yohan N. Leder · Jun 2, 2006

Hi,

In a script I'm reviewing to support UTF-8 everywhere ("use utf8",
"binmode STDOUT, ':utf8'", generated html's content-type as utf8), I'm
parsing STDIN containing form data (urlencoded or multipart ones). This
part of code is starting with the usual :

binmode STDIN; # raw while it may, eventually, contain binary data
read (STDIN, $data, $ENV{CONTENT_LENGTH});
.... here parsing code ...

But, the text fields values pairs are not recognized as being UTF-8
after parsing (a print show two chars rathers than one accentuated char)
even if they are. So, how to do to mark these string as being UTF-8 (not
any conversion to do, since they already are UTF-8 catched from raw
read)?

And, of course, I can't do a "binmode STDIN, ':utf8'" due to the
presence of binary/text mixing in STDIN.

Dr.Ruud · Jun 2, 2006

Yohan N. Leder schreef:

how to do to mark these string as being UTF-8

See `perldoc Encode`.

Bart Van der Donck · Jun 3, 2006

Yohan said:
In a script I'm reviewing to support UTF-8 everywhere ("use utf8",
"binmode STDOUT, ':utf8'", generated html's content-type as utf8), I'm
parsing STDIN containing form data (urlencoded or multipart ones). This
part of code is starting with the usual :

binmode STDIN; # raw while it may, eventually, contain binary data
read (STDIN, $data, $ENV{CONTENT_LENGTH});
... here parsing code ...

But, the text fields values pairs are not recognized as being UTF-8
after parsing (a print show two chars rathers than one accentuated char)
even if they are. So, how to do to mark these string as being UTF-8 (not
any conversion to do, since they already are UTF-8 catched from raw
read)?

Are you sure of that ? Maybe the web form didn't give you utf8 in the
first place. As basic rule, the browser performs the form submission in
the same charset as the webpage of the form, but only IF the sent
characters can be represented in that charset. Setting the webpage of
the form to "charset=utf-8" should help.

In my experience, Perl + utf8 = headache. But I've a feeling you're
rather looking at a form encoding issue in stead of a Perl issue.

Encoding at CGI/browser level (tip):
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

Encoding at Perl level:
http://perldoc.perl.org/Encode.html
http://perldoc.perl.org/Encode/Supported.html

Yohan N. Leder · Jun 3, 2006

Are you sure of that ? Maybe the web form didn't give you utf8 in the
first place. As basic rule, the browser performs the form submission in
the same charset as the webpage of the form, but only IF the sent
characters can be represented in that charset. Setting the webpage of
the form to "charset=utf-8" should help.

It's what I've done using http header and in html head too :

print "Content-type: text/html; charset=utf-8\n\n";
print "<meta http-equiv='Content-Type' content='text/html; charset=utf-
8'>";

In my experience, Perl + utf8 = headache.

From what I've seen until now, it seems to be an headache when you want
to manage several charset. So, it's because of that I've decided to
switch in full toward UTF-8 : it means, about literal strings ("use
utf8;" and taking care to use an unicode-enabled editor), treatment
("require 5.8.0;"), browser outputs ("binmode(STDOUT, ":utf8");" and
content-type + meta tag above), browser inputs (the subject of the
current thread), files i/o (use open ':utf8'

.

But I've a feeling you're
rather looking at a form encoding issue in stead of a Perl issue.
Encoding at CGI/browser level (tip):
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

Already read this page. Unless, mistake, no.

Encoding at Perl level:
http://perldoc.perl.org/Encode.html
http://perldoc.perl.org/Encode/Supported.html

Crtainly here, I'll found the solution to mak the input text values as
UTF-8... But', I have to read these page : not done, yet.

Yohan N. Leder · Jun 3, 2006

Yohan N. Leder schreef:

See `perldoc Encode`.

Thanks. Started to read. Not sure about understanding : I have to do
real trials to well understood.

Dr.Ruud · Jun 3, 2006

Yohan N. Leder schreef:

rvtol:

Thanks. Started to read. Not sure about understanding : I have to do
real trials to well understood.

You shouldn't fall back on trials, just read again until you understand.

The article talks about octets, bytes and characters; see the
TERMINOLOGY section.

Perl's utf8 is not the same as UTF-8, but don't worry about the
difference.

To set the UTF-8/utf8 flag on for a $string, you can use $string =
decode_utf8( $string ), but read the CAVEAT section on decode(), about
the case where $string contains only ASCII data.

Yohan N. Leder · Jun 3, 2006

You shouldn't fall back on trials, just read again until you understand.

OK, better understood now.

Perl's utf8 is not the same as UTF-8, but don't worry about the
difference.

Effectively, I didn't understood this specific point. Thank you !

To set the UTF-8/utf8 flag on for a $string, you can use $string =
decode_utf8( $string ), but read the CAVEAT section on decode(), about
the case where $string contains only ASCII data.

OK, "$value = decode("utf8", $value);" gives me the right $value, but, I
have another problem. To workaround the caveat about full ASCII strings,
I've thought to do something like this (from head, since not in front of
my station) :

use Encode;
$value = decode("utf8", $value);
if (!Encode::is_utf8($value)){Encode::_utf8_on($value);}

And I've been surprise to see all values have to go through Encode::
_utf8_on(). What does it means ? I've done my test using strings in form
field using characters like "accentuated 'i'", "accentuated uppercases",
etc. The strings are well decoded toward internal utf8 , but all retrun
false on Encode::is_utf8($value).

Also, I've tried using utf8::is_utf8() and it gives the same. Do I have
made a mistake somewhere ?

Alan J. Flavell · Jun 4, 2006

It's what I've done using http header and in
html head too :

print "Content-type: text/html; charset=utf-8\n\n";

Looks good

print "<meta http-equiv='Content-Type' content='text/html; charset=utf-
8'>";

Should have no effect - the real HTTP header has priority. But does
no great harm in practice.

(e-mail address removed) says...

There are some genuine problems; but many practical problems turn out
to be a misunderstanding of how things are supposed to work.

Forms submission plus i18n plus *non*-utf-8 encoding is a recipe for
even greater headaches, and my cited page shows some practical cases.
With current browsers it's definitely to be recommended to use utf-8.
You can deduce that from the fact that the main search engines have
(now that NN4.* is effectively dead) gone over to using utf-8 for
their query submissions, and no longer offer the range of localised
8-bit character codings that they used to offer on their international
query pages.

From what I've seen until now, it seems to be an headache when you want
to manage several charset. So, it's because of that I've decided to
switch in full toward UTF-8 :

Yes, good choice nowadays.

[...]

files i/o (use open ':utf8'.

Yes, but the forms submission data is not directly "in" utf-8. You
have to decode it first, and then ask Perl to represent it in its
natural text format (which, if the characters are more than just
ASCII, happens to be based on utf-8, but don't let that fool you).

If we consider for example the form-url-encoded format, you could have
a string like for example %D0%9F%D1%80%D0%B8 (that happens to be
three Cyrillic characters in utf-8).

When you "decode" that, you get a sequence of bytes whose contents
(seen as raw bytes) are xD0, x9F, xD1, etc. You need to feed this to
Perl's appropriate function to get them represented as Perl
characters. That's decode_utf8, but since you don't have any way to
be sure that the submitted data is valid, you'd also need to make some
preparation to handle any errors reported by decode_utf8.

Sorry, I don't have a practical example to hand of actually doing this
in a production situation - all my Perl code so far for this is just
for testing and diagnostics :-}

Crtainly here, I'll found the solution to mak the input text values
as UTF-8...

I think you're putting the wrong emphasis here. What you (should)
want to do is to express the input text values in Perl's appropriate
character representation. On the occasions when the input contains
only ASCII characters, Perl does not want to mark these strings as
unicode, and you should not want to mark them either. On the
occasions when the input contains nothing more than iso-8859-1, then
you need to take care that *those* characters are stored in Perl's
unicode format - which quite naturally will have its utf8 flag set -
and not just in raw iso-8859-1 octets (bytes) which would not.

If you process the characters as text, all should then work naturally.

But forcibly setting the utf8 flag when the data is inappropriate for
it, would be no kind of progress. I hope that wasn't what you meant.

Yohan N. Leder · Jun 4, 2006

Thank you for your encouragements, Alan !

Yes, but the forms submission data is not directly "in" utf-8. You
have to decode it first, and then ask Perl to represent it in its
natural text format (which, if the characters are more than just
ASCII, happens to be based on utf-8, but don't let that fool you).
[...]
When you "decode" that, you get a sequence of bytes whose contents
(seen as raw bytes) are xD0, x9F, xD1, etc. You need to feed this to
Perl's appropriate function to get them represented as Perl
characters. That's decode_utf8, but since you don't have any way to
be sure that the submitted data is valid, you'd also need to make some
preparation to handle any errors reported by decode_utf8.

Hmn, so even if I've specified the html page containing the form with a
charset='utf-8' and form with accept-charset='utf-8', you say, I have to
do some checking before "$value = decode('utf8', $value);" ?

Do you mean I've to use the optional [CHECK] argument in the decode()
call as explained in the Encode module's source ? If yes, what' the best
CHECK value ? WARN_ON_ERR ?

Something else in mind ?

On the occasions when the input contains
only ASCII characters, Perl does not want to mark these strings as
unicode, and you should not want to mark them either. On the
occasions when the input contains nothing more than iso-8859-1, then
you need to take care that *those* characters are stored in Perl's
unicode format - which quite naturally will have its utf8 flag set -
and not just in raw iso-8859-1 octets (bytes) which would not.

If you process the characters as text, all should then work naturally.

But forcibly setting the utf8 flag when the data is inappropriate for
it, would be no kind of progress. I hope that wasn't what you meant.

Well, so, what you say is that :

- If I have to treat POST being a 'application/x-www-form-urlencoded'
one, I can just read like this :

read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...

rather than :

binmode(STDIN, ':utf8');
read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...

but never (as multipart example at bottom) :

read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...
and for each value found :
decode('utf8', $value);
Encode::_utf8_on($value) unless Encode::is_utf8($value);

- If I have to treat POST being a 'multipart/form-data' one, this way,
forcing utf8 flag in all cases is wrong :

binmode(STDIN); # or binmode(STDIN, ':raw');
read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...
and for each value found :
decode('utf8', $value);
Encode::_utf8_on($value) unless Encode::is_utf8($value);

What's the right way ?

And, after that, I have three extra questions

to close the thread :

- What's about GET params when browser doesn't send url as UTF-8 (e.g.
user didn't checked the "send as utf-8" in the IE options) : do I have
to decode() them from $ENV{'QUERY_STRING'} ?

- What's about name in name-value pair from web form : do I have to
decode() them, knowing they surely been in the us-ASCII set (a-z, A-Z,
1-9) ?

- What's about paths and filenames : can I create a file with a filename
from a utf8 string or do I have to convert to local server charset prior
to write-it on disk (knowing, in my case, all generated fles I have to
create from my scripts are with lowercase only in the group (a-z, 1-9,
_) ?

That's all

Alan J. Flavell · Jun 5, 2006

Hmn, so even if I've specified the html page containing the form
with a charset='utf-8' and form with accept-charset='utf-8', you
say, I have to do some checking before "$value = decode('utf8',
$value);" ?

Never trust external data! Not only could there be browser bugs, but
in a WWW context, somebody may be submitting deliberately defective
data in the hope of compromising your server-side script.

Do you mean I've to use the optional [CHECK] argument in the
decode() call as explained in the Encode module's source ? If yes,
what' the best CHECK value ? WARN_ON_ERR ?

Well, I mean you need to think about it. The decision what to do may
depend on circumstances. Perhaps it is enough to allow the decode to
insert the bad-character marker; perhaps it is more appropriate to
break off altogether; it depends on your assessment of the
consequences.

Raw server-side warnings aren't much use in the web client/server
context. You need to catch anything that is serious enough to be
caught, and report its implications to the client in terms which make
sense to the client (i.e not just passing-on the text of some
Perl-specific diagnostic).

Well, so, what you say is that :

- If I have to treat POST being a 'application/x-www-form-urlencoded'
one, I can just read like this :

read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...

rather than :

binmode(STDIN, ':utf8');
read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...

What you are reading from STDIN at this point is (or should be) in the
form-urlencoded format, and certainly *NOT* in utf-8 at this protocol
level. The data that you are reading *should* contain only us-ascii
characters, with everything else replaced by %xx hexadecimal notation
as I already said - and as you surely must already know!
http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13.4.1

You can read it in text mode or in raw binmode - the difference will
be in the handling of newlines, but your code can easily handle that.

Having read it, then, logically, you need to decode the
form-urlencoded format into bytes (octets); and then you need to turn
those byte-sequences into Perl's own characters.

If you look in CGI.pm you will also find how to handle EBCDIC, but
perhaps you will never want to do that.

but never (as multipart example at bottom) :

read(STDIN, $data, $ENV{'CONTENT_LENGTH'})
.. here parsing code ...
and for each value found :
decode('utf8', $value);

In general, each unicode character results from a variable number
of bytes in your $data. decode_utf8() knows how many it needs for each
utf8 character which it will output.

There's parsing code in CGI.pm for this, why not take a look at what
it does, though it might be overkill for you since it has to deal with
other encodings as well as with EBCDIC.

(I've lost sight of just what it is that you are doing which CGI.pm
would not do for you anyway...?)

Encode::_utf8_on($value) unless Encode::is_utf8($value);

Please, don't play around with _utf8_on() without *very* good reason,
which you certainly don't seem to have here. Just use Perl's own
natural character formats, and they will take care of the internal
detail, in all but the most specialised of cases.

- If I have to treat POST being a 'multipart/form-data' one,

Sorry, I've run out of time for now; but again, if you make the right
moves with Perl, it *will* give you your text characters, in its
natural representation, you do *not* normally have to do the low level
work of _utf8_on() for yourself, and you can cause harm if you do it
wrongly.

- What's about GET params when browser doesn't send url as UTF-8
(e.g. user didn't checked the "send as utf-8" in the IE options) :
do I have to decode() them from $ENV{'QUERY_STRING'} ?

Query string handling is not the same as encoding in URLs.

Try this experiment: send this sample query to google (by typing it in
the URL bar): http://www.google.co.uk/search?q=При ,
and remember the result.

Then change the "send as utf-8" option to the other setting, re-start
IE, and try it again. I reckon you will get the same as I did: the
same results with either setting. (Three Cyrillic letters
corresponding to Latin "Pri"). The URL in this case is in ASCII.
Even though the ASCII characters are encoding some utf-8 data. See
the distinction?

And this would also apply if you code those same parameters into a
form.

- What's about name in name-value pair from web form : do I have to
decode() them, knowing they surely been in the us-ASCII set (a-z,
A-Z, 1-9) ?

How can you be so sure? In a web context, browser bugs or malicious
users can deliver defective data, this applies just as much to the
names as to the values. Many server-side insecurities have resulted
from scripts which did not sufficiently validate the submitted data.

[...]

Sorry, now I'm really out of time for the moment. Maybe someone else
will want to comment. Good luck.

UTF-8 read & print?	6	Nov 25, 2012
How to get UTF-8 from an urlencoded web form ?	0	Jul 15, 2006
CGI and UTF-8	14	Sep 28, 2009
How get UTF-8 from urlencoded web form	23	Jul 15, 2006
Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Forcing a string to valid UTF-8	2	Apr 26, 2010
Best way to output literal strings as UTF-8 ?	4	Jun 1, 2006
xml::twig - writing utf-8	4	May 25, 2006

How to mark UTF-8 string as being UTF-8

Yohan N. Leder

Dr.Ruud

Bart Van der Donck

Yohan N. Leder

Yohan N. Leder

Dr.Ruud

Yohan N. Leder

Alan J. Flavell

Yohan N. Leder

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads