html form metacharacters?

D

Dfenestr8

Hi.

I've written a cgi script that puts funny looking links to other cgi
scripts on a page, like:

<a href = "pptopic.py?what does "better than a turnip" mean?">what does
"better than a turnip" mean?"</a>

Unfortunately, ofcourse ? and " are both metacharacters with special
meaning in html. And this ofcourse completely screws up the execution of
my script.

I just need to write some regular expressions which substitutes these
metacharacters for character combinations a browser better understands,
such as the char combinations on this page.
http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php

But I'd like a list of what ALL the metacharacters used on forms are. Does
anyone know of such a list?
 
T

Toby Inkster

Dfenestr8 said:
<a href = "pptopic.py?what does "better than a turnip" mean?">what does
"better than a turnip" mean?"</a>

You need to URL-escape them. e.g. " becomes %22. See RFC 1738 for an idea
of which characters need to be escaped, and which don't. See "man ascii"
for the hex codes you need.
 
J

Jukka K. Korpela

Dfenestr8 said:
<a href = "pptopic.py?what does "better than a turnip" mean?">what does
"better than a turnip" mean?"</a>

Unfortunately, ofcourse ? and " are both metacharacters with special
meaning in html.

No, the question mark has no special meaning _in HTML_. It has a special
meaning in a URL, though. The quotation mark has a special meaning in HTML,
but _only_ in an attribute value (which you have here).
I just need to write some regular expressions which substitutes these
metacharacters for character combinations a browser better understands,

Hopefully not. If you cannot find a library routine for that (often called
"urlencode" or "urlescape" or something like that), find a library that has
such a routine, or switch to a programming language that has such a
library. Remember the four virtues of a programmer: laziness, impatience,
hubris, and short memory.
But I'd like a list of what ALL the metacharacters used on forms are.
Does anyone know of such a list?

Forms are not the issue here, but URLs. See URL specifications if you
_really_ must know exactly which characters need to be URL encoded and
when. Normally only people who write library routines like "urlencode"
need to know such things. And even they can apply the rules somewhat
simplistically, since it's not wrong to URL encode a character that does
not need to be URL encoded in particular context, provided that it is not
used in a specific meaning where the unencoded character is semantically
different from the encoded character.

The current specification of generic URL syntax, including URL encoding
requirements, is RFC 2396, available as HTMLized by me at
http://www.cs.tut.fi/~jkorpela/rfc/2396/
It superseded the generic part of RFC 1738 in 1998, so RFC 1738 should be
referred to _only_ in matters of specific URL schemes such as the specific
constraints on http: and ftp: URLs.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,570
Members
45,045
Latest member
DRCM

Latest Threads

Top