CGI.pm and special characters in hidden inputs

T

tsunami

Hello,

I use CGI.pm to parse forms, and I am running into issues with certain
special characters.

Say I have a form element, with a value of "Mom's House". It is a
hidden input, passed in from a previous page, so the HTML is something
like this:

<INPUT TYPE="hidden" NAME="location" VALUE="Mom&apos;s House">

I was given to understand that, for ' " > < and &, you need to use the
encoded value to denote the character when it appears in a tag. I know
this is the case for normal XML files, and the parsers take care of it.
However, CGI.pm's param() function does NOT seem to be interpreting
the special characters. In the CGI script that processes this form, I
would have:

$location = param('location');

and $location would be: "Mom&apos;s House" While I could, in this
instance, simply NOT encode the apostrophe and it would probably work,
if it were a double quote, I know it would break it. Any ideas?
Thanks!
 
A

Alan J. Flavell

I use CGI.pm to parse forms, and I am running into issues

However, you don't appear to have a Perl problem...
with certain special characters.

I'm afraid you've triggered a raw nerve there. Considering the many
thousands of Unicode characters which have been defined, what you you
suppose is so "special" about a us-ascii apostrophe?
Say I have a form element, with a value of "Mom's House". It is a
hidden input, passed in from a previous page, so the HTML is
something like this:

<INPUT TYPE="hidden" NAME="location" VALUE="Mom&apos;s House">

Could be...
I was given to understand that, for ' " > < and &, you need to use
the encoded value to denote the character when it appears in a tag.

Not exactly - for details consult a group with comp.infosystems.www...
in its name. But that's irrelevant, because the client agent has to
parse that. So it makes no difference which of the ways you choose to
represent your characters in the HTML source (the coded character
itself, its numerical character reference, or its character entity).
At submission time they're all the same.
However, CGI.pm's param() function does NOT seem to be interpreting
the special characters.

What do you mean by "interpreting"?
In the CGI script that processes this form, I would have:

$location = param('location');

and $location would be: "Mom&apos;s House"

It would??? Let's have a URL which demonstrates this behaviour!

But you're off-topic here. You'd be better on a WWW authoring group
(namely, comp.infosystems.www.authoring.cgi, but beware its
automoderation bot).
 
I

ioneabu

Hello,

I use CGI.pm to parse forms, and I am running into issues with certain
special characters.

Say I have a form element, with a value of "Mom's House". It is a
hidden input, passed in from a previous page, so the HTML is something
like this:

<INPUT TYPE="hidden" NAME="location" VALUE="Mom&apos;s House">

print hidden(-name=>'location', -value=>"Mom's House");

Should work fine if you use CGI.pm like this.
I was given to understand that, for ' " > < and &, you need to use the
encoded value to denote the character when it appears in a tag. I know
this is the case for normal XML files, and the parsers take care of it.
However, CGI.pm's param() function does NOT seem to be interpreting
the special characters. In the CGI script that processes this form, I
would have:

$location = param('location');

and $location would be: "Mom&apos;s House" While I could, in this
instance, simply NOT encode the apostrophe and it would probably work,
if it were a double quote, I know it would break it. Any ideas?
Thanks!

<quote>
AUTOESCAPING HTML
By default, all HTML that are emitted by the form-generating functions
are passed through a function called escapeHTML():
$escaped_string = escapeHTML("unescaped string");



Provided that you have specified a character set of ISO-8859-1 (the
default), the standard HTML escaping rules will be used. The "<"
character becomes "&lt;", ">" becomes "&gt;", "&" becomes "&amp;", and
the quote character becomes "&quot;". In addition, the hexadecimal 0x8b
and 0x9b characters, which many windows-based browsers interpret as the
left and right angle-bracket characters, are replaced by their numeric
HTML entities ("&#139" and "›"). If you manually change the
charset, either by calling the charset() method explicitly or by
passing a -charset argument to header(), then all characters will be
replaced by their numeric entities, since CGI.pm has no lookup table
for all the possible encodings.

Autoescaping does not apply to other HTML-generating functions, such as
h1(). You should call escapeHTML() yourself on any data that is passed
in from the outside, such as nasty text that people may enter into
guestbooks.

To change the character set, use charset(). To turn autoescaping off
completely, use autoescape():
$charset = charset([$charset]); # Get or set the current character
set.

$flag = autoEscape([$flag]); # Get or set the value of the
autoescape flag.
</quote>

Hope this helps.

wana
 
G

Gunnar Hjalmarsson

<INPUT TYPE="hidden" NAME="location" VALUE="Mom&apos;s House">

In the CGI script that processes this form, I would have:

$location = param('location');

and $location would be: "Mom&apos;s House"

No, it wouldn't. Before submission, that character entity would be
converted by the browser to "'", so you don't have the problem you think
you have. Try and see for yourself!
 
M

Matt Garrish

Gunnar Hjalmarsson said:
No, it wouldn't. Before submission, that character entity would be
converted by the browser to "'", so you don't have the problem you think
you have. Try and see for yourself!

Huh? Did you test that yourself? I've never heard of a browser converting
entities in a hidden form field.

test.htm
------------------------------

<html>
<head>
<title></title>
</head>
<body>
<form name="test" action="/cgi-bin/test.cgi" method="post">
<input type="hidden" name="location" value="what&apos;s wrong with this?" />
<input type="submit" />
</form>
</body>
</html>



test.cgi
------------------

use CGI qw/param/;

my $location = param('location');

print "Content-type: text/plain\n\n";
print $location;


Output:
 
I

ioneabu

In what way is that quote related to the OP's concern?

For example, I put this in my Perl program using CGI.pm:

print textfield({name=>'Name', value=>"bob's"});

When I view source in my browser it looks like this:

<input type="text" name="Name" value="bob's" />

CGI.pm handled the HTML escaping automatically as promised in the
section I quoted. I think that's what he was asking about.

wana
 
G

Gunnar Hjalmarsson

Matt said:
Huh? Did you test that yourself?
No.

I've never heard of a browser converting entities in a hidden form field.


When running your code, I get:
what's wrong with this?

Hmm.. Guess Alan has to clarify again. :)
 
G

Gunnar Hjalmarsson

For example, I put this in my Perl program using CGI.pm:

print textfield({name=>'Name', value=>"bob's"});

When I view source in my browser it looks like this:

<input type="text" name="Name" value="bob's" />

CGI.pm handled the HTML escaping automatically as promised in the
section I quoted. I think that's what he was asking about.

CGI.pm converted the ' character to a character entity.

The OP had already a character entity, and I think he was asking about
how to get the original character back.
 
G

Gunnar Hjalmarsson

Gunnar said:
No, it wouldn't. Before submission, that character entity would be
converted by the browser to "'", so you don't have the problem you think
you have. Try and see for yourself!

Matt's objection made me do some testing, and Firefox understands
"&apos;", while MSIE does not, which explains the confusion. (MSIE does
understand the other: "&quot;", "&lt;", "&gt;" and "&amp;".)

So use the entity number "'" instead of "&apos;" to avoid problems.
 
M

Matt Garrish

Gunnar Hjalmarsson said:
When running your code, I get:
what's wrong with this?

Hmm.. Guess Alan has to clarify again. :)

Something I've never considered before, if that is the case (not that I
spend a lot of time with web forms). I can see some benefit in automatically
converting the entities, however I don't think I'd ever want a browser
making that decision for me.

I'll see if I can find an explanation of this behaviour, even if it is
getting off topic...

Matt
 
M

Matt Garrish

Matt Garrish said:
I'll see if I can find an explanation of this behaviour, even if it is
getting off topic...

Alas, Google has let me down (or I can't find the right combination of
terms, at least). I still can't see much benefit in translating the entities
back to characters automatically when the form is submitted. The only
advantage would seem to be that it means transferring slightly less data on
the form submission. I suspect it has something to do with the attempts to
render entities as characters within visible form fields, but I would have
thought the hidden input type's value would be more along the lines of a
single-quoted string in Perl.

If someone has an official version of this behaviour, however, I'd be
interested in hearing what it is.

Matt
 
P

Peter Wyzl

: Gunnar Hjalmarsson wrote:
: > (e-mail address removed) wrote:
: >>
: >> <INPUT TYPE="hidden" NAME="location" VALUE="Mom&apos;s House">
: >
: > <snip>
: >
: >> In the CGI script that processes this form, I would have:
: >>
: >> $location = param('location');
: >>
: >> and $location would be: "Mom&apos;s House"
: >
: > No, it wouldn't. Before submission, that character entity would be
: > converted by the browser to "'", so you don't have the problem you think
: > you have. Try and see for yourself!
:
: Matt's objection made me do some testing, and Firefox understands
: "&apos;", while MSIE does not, which explains the confusion. (MSIE does
: understand the other: "&quot;", "&lt;", "&gt;" and "&amp;".)
:
: So use the entity number "'" instead of "&apos;" to avoid problems.

Or, given that it's a hidden field, change its name to something which
avoids special characters, and only use those where you need to deal with
displays.
 
A

Alan J. Flavell

Matt's objection made me do some testing, and Firefox understands
"&apos;", while MSIE does not, which explains the confusion. (MSIE
does understand the other: "&quot;", "&lt;", "&gt;" and "&amp;".)

and said:
Hmm.. Guess Alan has to clarify again. :)

Oops. I must admit that for the moment, I forgot this twilight
position of the &apos; character entity. I guess I wasn't properly in
the mood for off-topic details :-}

Thanks for supplying the missing piece. Although &apos; is fairly
widely supported by browsers, for some reason it doesn't seem to be
included in the list of character entities defined in W3C HTML
specifications. So I guess it shouldn't really be used in a WWW
context.
So use the entity number "'" instead of "&apos;" to avoid
problems.

That's true; but I think it's fair to say that it can always be
avoided. If an attribute value contains both " and ' characters, then
it can be enclosed in "...", and the included " characters represented
as &quot; (which -is- in the HTML/4.01 and HTML/2.0 specifications,
and is supported by pretty-much any browser, although it seems to have
been accidentally omitted from HTML/3.2): the ASCII apostrophes can
then be included literally.

Did/does CGI.pm really emit &apos; on its own initiative? Or was this
something that the hon. Usenaut had done deliberately?
 
G

Gunnar Hjalmarsson

Alan said:
Did/does CGI.pm really emit &apos; on its own initiative?

No, the escapeHTML() method in CGI.pm replaces ' with ' (but only
when the charset is ISO-8859-1 or WINDOWS-1252, if I understand it
correctly).
 
A

Alan J. Flavell

No, the escapeHTML() method in CGI.pm replaces ' with '

So it does. I'm sorry, I realise now that I should have taken the
time to look before posting...
(but only when the charset is ISO-8859-1 or WINDOWS-1252, if I
understand it correctly).

You do - pasting from the version of CGI.pm that I happen to have to
hand:

| my $latin = uc $self->{'.charset'} eq 'ISO-8859-1' ||
| uc $self->{'.charset'} eq 'WINDOWS-1252';
| if ($latin) { # bug in some browsers
| $toencode =~ s{'}{'}gso;
| $toencode =~ s{\x8b}{‹}gso;
| $toencode =~ s{\x9b}{›}gso;

But what you omitted to mention was that comment. There is *no
theoretical need* for that code: it's meant to work-around bugs in
specific browsers (probably now outdated, but the workarounds are
harmless to properly-behaved client agents, so there's no particular
need to remove the workarounds).

I distinctly remember the (security-relevant!) bug which the \x8b and
\x9b workarounds are meant to address, and tests confirmed that the
bug indeed seemed to be confined to documents coded in those specific
character encodings; but I must confess I'm not exactly familiar with
the one which prompted L.S to reformulate the apostrophe character.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,132
Latest member
TeresaWcq1
Top