Differences in UTF-8 html form inputs

Realbot · Jan 8, 2005

Hi,

I'm having some problems with a web application of mine.
To make things clearer here is an html input form which shows it.
It inputs two strings with GET and POST and it uses HTML::Mason.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Test utf</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<form name="formutfget" method="GET">
Enter text (get):<br>
<input type="text" name="textget" size="20" maxlength="30">
</form>
<form name="formutfpost" method="POST">
Enter text (post):<br>
<input type="text" name="textpost" size="20" maxlength="30">
</form>
Value of GET: <% $textget %><br>
Hex of GET: <% $hexget %><br>
Value of POST: <% $textpost %><br>
Hex of POST: <% $hexpost %><br>
</body>
</html>
<%args>
$textget => ''
$textpost => ''
$hexget => ''
$hexpost => ''
</%args>
<%init>
$hexget = unpack('H*', $textget);
$hexpost = unpack('H*', $textpost);
</%init>

The strange thing is that running this form under these environments
Debian Woody - perl 5.6.1 - Mozilla 1.4.3/Firefox 1.0
Debian Sid - perl 5.8.4 - Mozilla 1.4.3/Firefox 1.0
using as input the string "Î”Ï‰Î´ÎµÎºÎ±Î½Î®ÏƒÏ‰Î½" (I don't know what it means btw...), I get as output

Value of GET: Î”Ï‰Î´ÎµÎºÎ±Î½Î®ÏƒÏ‰Î½
Hex of GET: 26233931363b26233936393b26233934383b26233934393b26233935343b26233934353b26233935373b26233934323b26233936333b26233936393b26233935373b
Value of POST: Î”Ï‰Î´ÎµÎºÎ±Î½Î®ÏƒÏ‰Î½
Hex of POST: 26233931363b26233936393b26233934383b26233934393b26233935343b26233934353b26233935373b26233934323b26233936333b26233936393b26233935373b

while in OpenBSD - perl 5.8.0 - Mozilla 1.4.3/Firefox 1.0 with the same input string I get

Value of GET: Î”Ï‰Î´ÎµÎºÎ±Î½Î®ÏƒÏ‰Î½
Hex of GET: ce94cf89ceb4ceb5cebaceb1cebdceaecf83cf89cebd
Value of POST: Î”Ï‰Î´ÎµÎºÎ±Î½Î®ÏƒÏ‰Î½
Hex of POST: ce94cf89ceb4ceb5cebaceb1cebdceaecf83cf89cebd

So, it seems that in the former I get escaped unicode character and in the latter UTF-8 ones.
I thought that it could be a 5.6 vs 5.8 difference but as you can see even under Debian Sid I got the same unicode chars.
Could it be an OpenBSD peculiarity? I've Googled but with no luck, maybe someone can shed some light on it...

Thanks!

Chris Mattern · Jan 8, 2005

Realbot wrote:

using as input the string "???????????" (I don't know what it means
btw...),

"Dodecahedron"--i.e., a solid shape with 12 faces. If you're a gamer
who owns "funny dice", your 12-sided dice are dodecahedrons (or, if
you prefer, dodecahedra).
--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"

Alan J. Flavell · Jan 10, 2005

I'm having some problems with a web application of mine.

Forms submission including characters outside of us-ascii is
non-trivial, and isn't in itself a Perl problem.

OT: commentary of mine at
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

Until one can get that part sorted out to one's satisfaction, any
fiddling around that one might do in one's Perl script would be a bit
pointless, IMHO. And discussion of the web part would be more at home
on comp.infosystems.www.authoring.cgi (beware the automoderation bot).

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

If we assume that the page itself is really coded in utf-8 (note that
in the event of a dispute, the server's actual HTTP Content-type
header wins over anything that you might secrete in a meta
http-equiv), then you can expect current browsers to submit
utf-8-encoded form data. But not-quite-so-new browsers - even some
which support utf-8 display - get utf-8 forms submission sadly wrong.

<form name="formutfget" method="GET">

In -theory- the method GET supports nothing better than the us-ascii
character coding. But see my commentary for further discussion.

The strange thing is that running this form under these environments [...]

So, it seems that in the former I get escaped unicode character and
in the latter UTF-8 ones.

It looks as if somebody is trying to ape the misbegotten behaviour of
MSIE.

In a practical sense there isn't one right answer - there are several
compromises, depending on which browsers support what. But none of
the details here are features of the Perl programming language,
AFAICS.

good luck

Realbot · Jan 10, 2005

Alan said:
On Sat, 8 Jan 2005, Realbot wrote:

Forms submission including characters outside of us-ascii is
non-trivial, and isn't in itself a Perl problem.

OT: commentary of mine at
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

I read it avidly before posting, very well written.

Until one can get that part sorted out to one's satisfaction, any
fiddling around that one might do in one's Perl script would be a bit
pointless, IMHO. And discussion of the web part would be more at home
on comp.infosystems.www.authoring.cgi (beware the automoderation bot).

If we assume that the page itself is really coded in utf-8 (note that
in the event of a dispute, the server's actual HTTP Content-type
header wins over anything that you might secrete in a meta
http-equiv), then you can expect current browsers to submit
utf-8-encoded form data.

I found out that this was the exact problem. Apache installed on all Debian versions is configured with

AddDefaultCharset on

which completely ignores the encoding given in META tag and uses always the default encoding.
In Apache installation under OpenBSD the parameter was not present and so it was correct.
When I removed that nasty parameter everything worked on Debian too...

In a practical sense there isn't one right answer - there are several
compromises, depending on which browsers support what. But none of
the details here are features of the Perl programming language,
AFAICS.

Now I know!

Thanks a lot.

Batch Convert HTML to UTF-8 Files	2	Oct 2, 2023
Cannot form correctly the FORM part of the header when sending mail	42	Sep 3, 2013
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position	58	Sep 29, 2013
Registration Form	7	Aug 30, 2023
Converting my index.pl(cgi) to html::template one	4	Apr 26, 2005
Encoding transformation problems	5	May 27, 2006
CGI and UTF-8	14	Sep 28, 2009
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023

Differences in UTF-8 html form inputs

Realbot

Chris Mattern

Alan J. Flavell

Realbot

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads