Differences in UTF-8 html form inputs

R

Realbot

Hi,

I'm having some problems with a web application of mine.
To make things clearer here is an html input form which shows it.
It inputs two strings with GET and POST and it uses HTML::Mason.

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Test utf</title>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<form name="formutfget" method="GET">
Enter text (get):<br>
<input type="text" name="textget" size="20" maxlength="30">
</form>
<form name="formutfpost" method="POST">
Enter text (post):<br>
<input type="text" name="textpost" size="20" maxlength="30">
</form>
Value of GET: <% $textget %><br>
Hex of GET: <% $hexget %><br>
Value of POST: <% $textpost %><br>
Hex of POST: <% $hexpost %><br>
</body>
</html>
<%args>
$textget => ''
$textpost => ''
$hexget => ''
$hexpost => ''
</%args>
<%init>
$hexget = unpack('H*', $textget);
$hexpost = unpack('H*', $textpost);
</%init>

The strange thing is that running this form under these environments
Debian Woody - perl 5.6.1 - Mozilla 1.4.3/Firefox 1.0
Debian Sid - perl 5.8.4 - Mozilla 1.4.3/Firefox 1.0
using as input the string "Δωδεκανήσων" (I don't know what it means btw...), I get as output

Value of GET: Δωδεκανήσων
Hex of GET: 26233931363b26233936393b26233934383b26233934393b26233935343b26233934353b26233935373b26233934323b26233936333b26233936393b26233935373b
Value of POST: Δωδεκανήσων
Hex of POST: 26233931363b26233936393b26233934383b26233934393b26233935343b26233934353b26233935373b26233934323b26233936333b26233936393b26233935373b

while in OpenBSD - perl 5.8.0 - Mozilla 1.4.3/Firefox 1.0 with the same input string I get

Value of GET: Δωδεκανήσων
Hex of GET: ce94cf89ceb4ceb5cebaceb1cebdceaecf83cf89cebd
Value of POST: Δωδεκανήσων
Hex of POST: ce94cf89ceb4ceb5cebaceb1cebdceaecf83cf89cebd

So, it seems that in the former I get escaped unicode character and in the latter UTF-8 ones.
I thought that it could be a 5.6 vs 5.8 difference but as you can see even under Debian Sid I got the same unicode chars.
Could it be an OpenBSD peculiarity? I've Googled but with no luck, maybe someone can shed some light on it...

Thanks!
 
C

Chris Mattern

Realbot wrote:

using as input the string "???????????" (I don't know what it means
btw...),

"Dodecahedron"--i.e., a solid shape with 12 faces. If you're a gamer
who owns "funny dice", your 12-sided dice are dodecahedrons (or, if
you prefer, dodecahedra).
--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
 
A

Alan J. Flavell

I'm having some problems with a web application of mine.

Forms submission including characters outside of us-ascii is
non-trivial, and isn't in itself a Perl problem.

OT: commentary of mine at
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

Until one can get that part sorted out to one's satisfaction, any
fiddling around that one might do in one's Perl script would be a bit
pointless, IMHO. And discussion of the web part would be more at home
on comp.infosystems.www.authoring.cgi (beware the automoderation bot).
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

If we assume that the page itself is really coded in utf-8 (note that
in the event of a dispute, the server's actual HTTP Content-type
header wins over anything that you might secrete in a meta
http-equiv), then you can expect current browsers to submit
utf-8-encoded form data. But not-quite-so-new browsers - even some
which support utf-8 display - get utf-8 forms submission sadly wrong.
<form name="formutfget" method="GET">

In -theory- the method GET supports nothing better than the us-ascii
character coding. But see my commentary for further discussion.
The strange thing is that running this form under these environments [...]

So, it seems that in the former I get escaped unicode character and
in the latter UTF-8 ones.

It looks as if somebody is trying to ape the misbegotten behaviour of
MSIE.

In a practical sense there isn't one right answer - there are several
compromises, depending on which browsers support what. But none of
the details here are features of the Perl programming language,
AFAICS.

good luck
 
R

Realbot

Alan said:
On Sat, 8 Jan 2005, Realbot wrote:

Forms submission including characters outside of us-ascii is
non-trivial, and isn't in itself a Perl problem.

OT: commentary of mine at
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

I read it avidly before posting, very well written.
Until one can get that part sorted out to one's satisfaction, any
fiddling around that one might do in one's Perl script would be a bit
pointless, IMHO. And discussion of the web part would be more at home
on comp.infosystems.www.authoring.cgi (beware the automoderation bot).




If we assume that the page itself is really coded in utf-8 (note that
in the event of a dispute, the server's actual HTTP Content-type
header wins over anything that you might secrete in a meta
http-equiv), then you can expect current browsers to submit
utf-8-encoded form data.

I found out that this was the exact problem. Apache installed on all Debian versions is configured with

AddDefaultCharset on

which completely ignores the encoding given in META tag and uses always the default encoding.
In Apache installation under OpenBSD the parameter was not present and so it was correct.
When I removed that nasty parameter everything worked on Debian too...
In a practical sense there isn't one right answer - there are several
compromises, depending on which browsers support what. But none of
the details here are features of the Perl programming language,
AFAICS.

Now I know! :)

Thanks a lot.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top