How get UTF-8 from urlencoded web form

Yohan N. Leder · Jul 15, 2006

Hello.

All my tests are done using ActivePerl 5.8.8.817 under Win2K FR and
Apache2.

I'm trying to obtain (and display) user data which come from a web form
with enctype as 'application/x-www-form-urlencoded' and don't succeed. I
can do-it if the form is a 'multipart/form-data' but not a
'application/x-www-form-urlencoded'.

Here is a script to show the difference :

---- BEGIN ----
#!/usr/bin/perl -w
my $this = "utf8_and_webform.pl";

require 5.8.0;
use utf8;
binmode(STDOUT, ':utf8');
print "Content-type: text/html; charset=UTF-8\n\n";
if (defined $ENV{'QUERY_STRING'} && length($ENV{'QUERY_STRING'}) > 0)
{&see;}
else {&ask;}
exit 0;

sub ask
{ # provide web forms for user to enter data
print <<PAGE
<html><head><title>Test about UTF-8 and web form</title></head><body>
Use the form you want and see the resulting data.
<p>
FORM with enctype as 'application/x-www-form-urlencoded' :<br>
<form action='$this?x' method='post' accept-charset='UTF-8'
enctype='application/x-www-form-urlencoded'>
<textarea name='msg' rows='4' cols='30' wrap='virtual'></textarea>
<input type='submit' value='send'>
</form></body></html></p>
<p>
FORM with enctype as 'multipart/form-data' :<br>
<form action='$this?x' method='post' accept-charset='UTF-8'
enctype='multipart/form-data'>
<textarea name='msg' rows='4' cols='30' wrap='virtual'></textarea>
<input type='submit' value='send'></p>

[quoted text muted]

}

sub see
{ # display data which come from user form
my $data='';

binmode(STDIN, ':utf8'); # or ':encoding('UTF-8')'
read(STDIN, $data, $ENV{'CONTENT_LENGTH'});

# OR
#use Encode qw(decode);
#read(STDIN, $data, $ENV{'CONTENT_LENGTH'});
#$data = decode('UTF-8', $data);

print $data;

[quoted text muted]

}
----- END ----

For example, if I submit the 'urlencoded' form (the first one, at top of
generated web page, if you run the script without any url parameter)
with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
A9' displayed in the browser (knowing this has been proceeded through
the see() sub).

While, if I submit the same 'é' from the 'multipart/form-data' form (the
second one, at bottom of generated web page), I get a well interpreted
UTF-8 'é' as expected.

How to get this same UTF-8 'é' when form uses 'application/x-www-form-
urlencoded' enctype ? How to modify the see() sub for this urlencoded
form case ?

Gunnar Hjalmarsson · Jul 15, 2006

Yohan said:
if I submit the 'urlencoded' form (the first one, at top of
generated web page, if you run the script without any url parameter)
with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
A9' displayed in the browser (knowing this has been proceeded through
the see() sub).

While, if I submit the same 'é' from the 'multipart/form-data' form (the
second one, at bottom of generated web page), I get a well interpreted
UTF-8 'é' as expected.

How to get this same UTF-8 'é' when form uses 'application/x-www-form-
urlencoded' enctype ?

The problem is covered by this FAQ entry:
http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG

Yohan N. Leder · Jul 15, 2006

The problem is covered by this FAQ entry:
http://faq.perl.org/perlfaq9.html#How_do_I_decode_a_CG

It doesn't explain the problem, but remove the problem using CGI.pm, and
I would like to understand the problem.

Gunnar Hjalmarsson · Jul 15, 2006

Yohan said:
It doesn't explain the problem, but remove the problem using CGI.pm, and
I would like to understand the problem.

Excellent learning approach.

The browser automatically URI escapes 'unsafe' characters when you make
a GET or an x-www-form-urlencoded POST request. Hence those characters
need to be unescaped by the web server. CGI.pm as well as other modules
for parsing CGI data takes care of that.

You can study the docs for the Perl module URI::Escape for a better
explanation.

I suppose you should also read up on the HTTP protocol.

HTH

Bart Van der Donck · Jul 16, 2006

Yohan said:
All my tests are done using ActivePerl 5.8.8.817 under Win2K FR and
Apache2.

I'm trying to obtain (and display) user data which come from a web form
with enctype as 'application/x-www-form-urlencoded' and don't succeed. I
can do-it if the form is a 'multipart/form-data' but not a
'application/x-www-form-urlencoded'.

[snip code ]

For example, if I submit the 'urlencoded' form (the first one, at top of
generated web page, if you run the script without any url parameter)
with the letter 'é' (accentuated e) inside the textarea, I get 'msg=%C3%
A9' displayed in the browser (knowing this has been proceeded through
the see() sub).

While, if I submit the same 'é' from the 'multipart/form-data' form (the
second one, at bottom of generated web page), I get a well interpreted
UTF-8 'é' as expected.

How to get this same UTF-8 'é' when form uses 'application/x-www-form-
urlencoded' enctype ? How to modify the see() sub for this urlencoded
form case ?

That shouldn't be particularly mysterious. You're specifying the page's
charset as UTF-8 in its header (where you say "Content-type: text/html;
charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
literal 'Ã©' (dec 142/hex 8E/eacute/LATIN SMALL LETTER E WITH ACUTE.
The code point for Ã is C3, and for © it's A9, thus the expected
value becomes %C3%A9.

Encoding é -> Ã© -> %C3%A9 :

#!/usr/bin/perl -w
my $posteddata = <STDIN>;
print <<PAGE
Content-type: text/html; charset=UTF-8

<html><body>
Posted data: $posteddata<hr>
<form action='f.pl' method='post'>
<textarea name='msg'></textarea>
<input type='submit'>
</form></body></html>
PAGE

Whereas the "normal" form encoding would be é -> %E9:

#!/usr/bin/perl -w
my $posteddata = <STDIN>;
print <<PAGE
Content-type: text/html

<html><body>
Posted data: $posteddata<hr>
<form action='f.pl' method='post'>
<textarea name='msg'></textarea>
<input type='submit'>
</form></body></html>
PAGE

P.S. 'application/x-www-form-urlencoded' is the default form encoding
type anyhow, so there is actually no need to set this as a form
argument.

Recommended literature:
http://home.tiscali.nl/t876506/utf8tbl.html (search for string C3A9 on
that page)
Table CPs < 256: http://en.wikipedia.org/wiki/ISO_8859-1
And of course Perl FAQ/docs, as Gunnar pointed out.

Yohan N. Leder · Jul 16, 2006

Excellent learning approach.

Thanks. Better than taking everything as an eternal mysterious box in my
mind.

The browser automatically URI escapes 'unsafe' characters when you make
a GET or an x-www-form-urlencoded POST request. Hence those characters
need to be unescaped by the web server. CGI.pm as well as other modules
for parsing CGI data takes care of that.

Hm, understood !

You can study the docs for the Perl module URI::Escape for a better
explanation.

I'll do it for sure ;-)

Yohan N. Leder · Jul 16, 2006

That shouldn't be particularly mysterious. You're specifying the page's
charset as UTF-8 in its header (where you say "Content-type: text/html;
charset=UTF-8"), causing the 'é'- character to be sent as Unicode's
literal 'Ã©'

Effectively what I want. However the gunnar explanation show the key of
the problem : URI escaping when *urlencoded* enctype for form.

Bart Van der Donck · Jul 16, 2006

Yohan said:
Effectively what I want. However the gunnar explanation show the key of
the problem : URI escaping when *urlencoded* enctype for form.

Yes, the URL encoding is done at the browser's side by default, before
and apart from the sendout of the name/value pairs. This behaviour can
be altered by adding enctype="multipart/form-data" as an extra argument
to <form method="post">. The main reason for this feature to exist, is
the transfer of (binary) files to the gateway software on the server.
Thus, if you want to send 'é', the browser will pass it as "%E9" by
default. It's up to your Perl script to decode it back to 'é'. In the
multipart/form-data encoding type, 'é' is just passed as 'é'. In
UTF-8 sets, the browser looks for the literal equivalent of 'é', and
then passes the URL-encoded value of that literal equivalent.

Bart Van der Donck · Jul 16, 2006

Gunnar said:
[...]
The browser automatically URI escapes 'unsafe' characters when you make
a GET or an x-www-form-urlencoded POST request. Hence those characters
need to be unescaped by the web server. CGI.pm as well as other modules
for parsing CGI data takes care of that.

#!/usr/bin/pedant
I think the correct terminology is actually URL-encoding here (or
percent-encoding) in stead of URI-escaping
(http://en.wikipedia.org/wiki/URL_encoding).

Bart Van der Donck · Jul 16, 2006

A. Sinan Unur said:
[...]
Escaping is a general method of changing the meaning of the characters
following a designated special character. In this case, % is the special
character, and it changes the meaning of the characters following it.
Characters not allowed in URIs are replaced with these escape sequences.

Yes, but escaping would then only refer to the %-sign, not to what
follows. In '%E9', '%' is the escape character and 'E9' the encoded
value of 'é'. E9 has nothing to do with escaping; otherwise it would
have been %é (or \é).

So I think we're both 50% right here

Yohan N. Leder · Jul 16, 2006

Yes, the URL encoding is done at the browser's side by default, before
and apart from the sendout of the name/value pairs. This behaviour can
be altered by adding enctype="multipart/form-data" as an extra argument
to <form method="post">. The main reason for this feature to exist, is
the transfer of (binary) files to the gateway software on the server.
Thus, if you want to send 'é', the browser will pass it as "%E9" by
default. It's up to your Perl script to decode it back to 'é'. In the
multipart/form-data encoding type, 'é' is just passed as 'é'. In
UTF-8 sets, the browser looks for the literal equivalent of 'é', and
then passes the URL-encoded value of that literal equivalent.

Well understood, Bart. Thanks

Bart Van der Donck · Jul 16, 2006

A. Sinan Unur said:
Not really. In Perl, \n is the "escape sequence" for the platform
dependent end-of-line character.

While that is absolutely true, E9 is still an encoded value of é. It
might or might not serve inside a notation that uses a designated
escape character.

Clearly, 'n' in that escape sequence is just like the E9 above.

There is no encoding involved in \n, but there is when you write é as
%E9

Bart Van der Donck · Jul 16, 2006

A. Sinan Unur said:
That does not make sense. 'n' all by itself is not the end of line
character anywhere. In the realm of Perl's interpolates strings, the
letter 'n' that follows '\' is the encoding of the EOL.

You can't compare (n vs. \n) to (é vs. %E9). There is simply no
relation between the characters that are represented by "n" and "\n".
There is no encoding or conversion or whatever.

The idea is totally different when _going_from_ é to %E9. You have a
clear encoding algorithm there that takes its data from some code
table. There is no "going from" involved in "n" versus "\n".

é to %E9 consists of 2 parts:
(1) Encode from é to E9
(2) Put a % before E9 to make clear it's a escape sequence

Gunnar Hjalmarsson · Jul 16, 2006

Bart said:
#!/usr/bin/pedant
I think the correct terminology is actually URL-encoding here (or
percent-encoding) in stead of URI-escaping
(http://en.wikipedia.org/wiki/URL_encoding).

I simply chose to use the term from URI::Escape. My English isn't good
enough for arguing about it. ;-)

Dr.Ruud · Jul 16, 2006

Bart Van der Donck schreef:

You can't compare (n vs. \n) to (é vs. %E9).

You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
translations use a table in memory.

Bart Van der Donck · Jul 16, 2006

Dr.Ruud said:
Bart Van der Donck schreef:

You can compare (<LF> vs. "\n") to (<é> vs. "%E9"). Often, such
translations use a table in memory.

Yes, exactly, like:
LF -> %0A
é -> %E9

<LF> refers to hex 0A by definition, but I'm not sure whether "\n"
always refers to hex 0A on various operating systems.

Dr.Ruud · Jul 16, 2006

Bart Van der Donck schreef:

Dr.Ruud:

Yes, exactly, like:
LF -> %0A
é -> %E9

<LF> refers to hex 0A by definition,

Yes, but when going the other way around, "%E9" can be translated to <é>
(if that's what the character is at position 0xE9 in the current
charset), and "\n" to LF (or CR or CRLF or whatever, depending on the
platform). That the "%E9" and 0xE9 look a lot alike, and "\n" and 0x0A
don't, doesn't really matter.

If the current charset is UTF-8, the "%E9" is translated to a specific
multibyte sequence. In this context of escape characters, you could say
that UTF-8 has an escape bit.
If the current charset is ASCII, the "%E9" might be translated to "?" or
"e", or "'e" or "e'" or whatever is feasible.

but I'm not sure whether "\n"
always refers to hex 0A on various operating systems.

It doesn't, but that doesn't matter. The escape-character, at the start
of the escape sequence, brings up a special mode, that just eats the
escape-sequence and inserts/returns an equally or more specific
translation.

robic0 · Jul 17, 2006

[ You should not snip attributions. You create the impression that I
wrote something I did not. That is not nice. ]

[ You made the statement above. I did not. ]

You can't compare (n vs. \n) to (é vs. %E9). There is simply no
relation between the characters that are represented by "n" and "\n".
There is no encoding or conversion or whatever.

Click to expand...

This is absurd. "\n" is an encoding of EOL.

Since you like quoting Wikipedia:
<http://en.wikipedia.org/wiki/Encoding>

<blockquote>
Encoding is the process of transforming information from one format into
another. The opposite operation is called decoding.
</blockquote>

Any mapping of one thing on to another is an encoding.

This is getting tedious. I am out of this thread.

table. There is no "going from" involved in "n" versus "\n".

Click to expand...

Sinan

The definition of "encoding" from wikipedia is a broad definition.
For example encoding/decoding digital media, say mpeg2 involve a
linear language to, on the encoding side, take out redundency in
adjacent frames, then reconstruct the full frames on the decoding side.
The jpeg compression layer is on top of the mpeg layer. Finally a full
bitmap frame. This is by default the wikipedia definition. This is macro.

In the case of a single document character, its a either/or argument.
Either the character is escaped, in which case it loses its binary form
or its not, in which case it retains it.

There's a "reason" why certain Unicode characters are reserved as control
codes. Here is the reason: THE DATA TRANSFERED ON ALL COMPUTERS ARE BINARY.
There is no spoon Neo, there is no spoon.

I guess in that sence, all files read/written are encoded/decoded.
The formula for encoding is the same as decoding. Think of it as a train
of boxcars. The decoder waits for the marker cars, then grabs the next
series of cars as its formula requires. Those cars grabbed are then sent
on to the next decoder (possibly the same one) where finite smaller cars
are extracted. The itteration continues (possibly several times).

All for what? The first thing done was to standardize on how many bytes in a Unicode char
(there may be small/big). Then the encoding, but what is an encoding statement?
Why its nothing more that an offset into a binary blob of data that contains the
"character bitmap" to display. Of course displaying a different encoding bitmaped characters
doesn't translate into a language translator. The only thing the Rossetta Stone did was
to connect alphabet characters across languages, ie: it didn't do language translation.

Back to the larger issue of html/xml encoding/decoding, and I don't know what the
OP had in mind (looks like html/xml embedded url's) but he would seem to have to
know ahead of time how many times a parser will itterate his data and how many
times and what he needs to escape it.

The idea of an argument on encoding/decoding as it relates to charactersets is just
absurd! Encoding/Decoding was not only done on the very first computer display device
but is a concept over 4,000 years old!

Don't make this extremely simple concept out to be something just invented.

robic0
-get ur act together

his case, given that

John W. Kennedy · Jul 17, 2006

Bart said:
Yes, exactly, like:
LF -> %0A
é -> %E9

<LF> refers to hex 0A by definition, but I'm not sure whether "\n"
always refers to hex 0A on various operating systems.

"\n" is 0x0A on all systems.

However, when writing a file in text mode, 0x0A may be translated on the
file into 0x0D,an 0x0D0A, or (on an IBM mainframe) a logical record end.
Similarly, an 0x0D, an 0x0D0A, or a logical record end can be translated
into an 0x0A when reading a file in text mode.

Add the extra zeroes for Unicode as needed.

John Bokma · Jul 17, 2006

John W. Kennedy said:
"\n" is 0x0A on all systems.

According to the table on p 161-162 of Programming Perl:

"Match the newline character (usually NL, but CR on Macs)."

I doubt that the book talks about file level here, but I have no Mac to
test this :-D.

How to get UTF-8 from an urlencoded web form ?	0	Jul 15, 2006
How to mark UTF-8 string as being UTF-8	9	Jun 2, 2006
I dont get this. Please help me!!	2	Jan 24, 2023
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Check forms With JavaScript	1	Mar 28, 2023
Script to send email not working	1	Apr 10, 2023
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Why is this WordPress comments form not submitting?	1	Jan 12, 2020

How get UTF-8 from urlencoded web form

Yohan N. Leder

Gunnar Hjalmarsson

Yohan N. Leder

Gunnar Hjalmarsson

Bart Van der Donck

Yohan N. Leder

Yohan N. Leder

Bart Van der Donck

Bart Van der Donck

Bart Van der Donck

Yohan N. Leder

Bart Van der Donck

Bart Van der Donck

Gunnar Hjalmarsson

Dr.Ruud

Bart Van der Donck

Dr.Ruud

robic0

John W. Kennedy

John Bokma

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads