character encoding in CGI.pm

D

David Lee Lambert

I noticed that, without setting any options, CGI.pm output of a
simple page starts as follows:

Content-Type: text/html; charset=ISO-8859-1

<?xml version="1.0" encoding="utf-8"?>


Now, is the webpage in ISO-8859-1, utf8, or some other encoding? Or
is XML defined such that this is a perfectly valid situation? If I
send a string containing Unicode characters (with \x{}), IE 6 detects
the page as Latin-1 and doesn't show those characters properly; if I
manually tell it that the encoding is UTF-8, it displays the
characters properly.

This is using perl 5.6.1; I'm not sure what verion of CGI.p I have.
 
A

Alan J. Flavell

I noticed that, without setting any options, CGI.pm output of a
simple page starts as follows:

Content-Type: text/html; charset=ISO-8859-1

<?xml version="1.0" encoding="utf-8"?>

Oh dear, does it really? Can we have a CGI.pm version number on that
please?
Now, is the webpage in ISO-8859-1, utf8, or some other encoding?

Well, the only way it can be in both is if it's *really* in
us-ascii. Seriously, that's the truth.
Or is XML defined such that this is a perfectly valid situation?

Absolutely not. Your authoritative reverence (excuse me, I meant
"reference", but the inadvertent typo was too good to take out) is
the XHTML/1.0 specification, Appendix C, since we're dealing here with
the text/html compatibility feature of XHTML/1.0

I personally think leaping into XHTML without an overwhelming cause
was a bit premature. You can tell CGI.pm that you don't want
XHTML-flavoured HTML. But opinions vary, and this is the wrong forum
to dispute that.
This is using perl 5.6.1; I'm not sure what verion of CGI.p I have.

*Upgrade*. 5.6.1 is now old; and the version of CGI.pm that comes
bundled with Perl is generally somewhat back-level compared to the
author's latest version at any given moment. Do I need to refer you
to the FAQ if you need a private version installed due to
foot-dragging by your sysadmin?

Btw. CGI.pm will happily tell you what version it is if you ask it
nicely. It's in the source code too, of course.
 
S

Shawn Corey

David said:
I noticed that, without setting any options, CGI.pm output of a
simple page starts as follows:

Content-Type: text/html; charset=ISO-8859-1

<?xml version="1.0" encoding="utf-8"?>


Now, is the webpage in ISO-8859-1, utf8, or some other encoding? Or
is XML defined such that this is a perfectly valid situation? If I
send a string containing Unicode characters (with \x{}), IE 6 detects
the page as Latin-1 and doesn't show those characters properly; if I
manually tell it that the encoding is UTF-8, it displays the
characters properly.

This is using perl 5.6.1; I'm not sure what verion of CGI.p I have.

The web page is both. The ISO-8859-1 encoding is used for the HTTP
transfer. All bytes, including the web page, while be interpreted as
ISO-8859-1 encoded until handed off to the display engine in the
browser. Then it will be interpreted as UTF-8. This normally does not
mean much since the bytes after the blank line are usually not processed
by the HTTP decoding code; they are simply passed to the next part.

If you are using Perl 5.6, add 'use utf8;' to the code. For any Perl,
you can add:

print handler( -charset => 'UTF-8' );

for the Content-Type handler.

See perldoc CGI for details.

--- Shawn
 
B

Ben Morrow

Quoth "Alan J. Flavell said:
Oh dear, does it really? Can we have a CGI.pm version number on that
please?


Well, the only way it can be in both is if it's *really* in
us-ascii. Seriously, that's the truth.


Absolutely not. Your authoritative reverence (excuse me, I meant
"reference", but the inadvertent typo was too good to take out) is
the XHTML/1.0 specification, Appendix C, since we're dealing here with
the text/html compatibility feature of XHTML/1.0

Correct me if I'm wrong, but surely XHTML cannot be served under a
text/html content type anyway? It isn't valid HTML (take this document,
for example,

<html>
<head>
<link rel="stylesheet" type="text/css" href="css"/>
</head>
<body></body>
</html>

: the '/>' on the <link> is not valid HTML, and validator.w3.org will
reject it under any HTML DTD). This means this header is wrong in three
ways:

1. the content should be labelled application/xhtml+xml

2. the charsets should match

3. the charset shouldn't be specified in the HTTP header anyway, for
precisely this reason (unlike HTML, XML has strict rules for determining
its charset; in this case, the charset given in the HTTP header
overrides that in the document, but this is Not A Good Thing). See
recent discussions on (e-mail address removed) for this; the next version of
RFC3023 (the registration for XML media types) will (probably) state
that XML entities should not be given a charset parameter.

Ben
 
A

Alan J. Flavell

By sheer chance, Google Groups pointed out to me that:

On Wed, 24 Nov 2004, Shawn Corey wrote:

[I'm trimming the comprehensive quote down to what I suppose you must
have interpreted as the significant part. There's no extra charge for
doing this yourself, you know...]
The web page is both.

Impossible, unless it happens to be in us-ascii, in which case it's a
valid instance of all three.
The ISO-8859-1 encoding is used for the HTTP transfer. All bytes,
including the web page, while be interpreted as ISO-8859-1 encoded
until handed off to the display engine in the browser. Then it will
be interpreted as UTF-8. This normally does not mean much since the
bytes after the blank line are usually not processed by the HTTP
decoding code; they are simply passed to the next part.

A truly remarkable castle that you've built in the air there; have you
read XHTML/1.0 Appendix C, by any chance?
See perldoc CGI for details.

Whimper.

Once again, I suppose this brings home the importance of not going
into technical detail on matters that are off-topic for the group.
 
A

Alan J. Flavell

Oh dear, this is desperately off-topic...

Correct me if I'm wrong, but surely XHTML cannot be served under a
text/html content type anyway?

Technically, you're right. Practically, I'd have to refer you to
XHTML/1.0 Appendix C. Well, I already did, but you seem to have
resisted the temptation to mention it.
It isn't valid HTML

Correct. Appendix C is in theory self-contradictory, but in practice
it gets away with it, since almost all "web browsers" implement
tag-soup rather than HTML "per se".

emacs-w3 indeed had to be deliberately broken in order to be
compatible with Appendix C, since it had taken the HTML specification
just a bit more seriously than anyone else (aside from SGML-conforming
browsers such as softquad panorama, but who uses those as www
browsers?).
1. the content should be labelled application/xhtml+xml

"should". Right. XHTML/1.0 Appendix C is a (misguided, IMHO)
exception to that rule.
2. the charsets should match

"must" match, except in a few degenerate cases (since us-ascii can be
validly labelled as iso-8859-anything as well as utf-8, whatever
happens to be convenient).
3. the charset shouldn't be specified in the HTTP header anyway,

Disagree; but this isn't the place to argue the point.

all the best
 
J

John W. Kennedy

Ben said:
Correct me if I'm wrong, but surely XHTML cannot be served under a
text/html content type anyway? It isn't valid HTML (take this document,
for example,

If you want Internet Explorer to display it, you /must/ serve it as
text/html. Internet Explorer refuses outright to render a document that
it knows to be XHTML. Fortunately, most browsers will produce acceptable
results for XHTML 1.0 served as HTML. XHTML 2.0 served as HTML, on the
other hand, will go straight into the toilet.

In short, XHTML is dead, murdered by Bill Gates' arrogance.

Ain't monopolies great?
 
A

Alan J. Flavell

Oh dear. Off topic, but I can't resist at least a reply... with
apologies up-front

If you want Internet Explorer to display it, you /must/ serve it as
text/html.

IE, as normally used, does not support XHTML, and it would be better
not to send it any. Faking XHTML as HTML brings no benefits at all at
the web interface, and adds a few disbenefits. It's sometimes claimed
that XML-based tools at the authoring side are a valuable benefit, and
therefore the result will be XHTML - but that is a half-truth:
XML-based tools can also emit HTML/4.01 as their end-product.
Internet Explorer refuses outright to render a document that it
knows to be XHTML.

Right from the start of the WWW, browsers which can't render a
particular MIME content-type have been configured to fire up a
suitable "helper application" to view that content type.

More recently there's been a tendency to define "plug-ins", which
render certain content types but display them in the window of the
browser.

Either of these mechanisms should be available in IE (after
sacrificing a suitable animal to XP SP2, I suppose). Years back I
configured Windows/IE to use a "helper application" for opening XHTML
MIME-types, and I defined the helper application to be Mozilla. It
worked fine. OK, I'm not promoting it in that form as a practical
solution for end-users, just offering an in-principle refutation that
if the browser-like object doesn't support it then it can't be used.

The original idea of XML was to make a clean break with "tag soup".
Fortunately, most browsers will produce acceptable results for XHTML
1.0 served as HTML.

Unfortunately, that's led to the unwashed masses of web deezyners
simply converting their HTML-flavoured tag soup into XHTML-flavoured
tag soup, and tossing the potential benefits of the clean break out of
the window (no pun intended).
XHTML 2.0 served as HTML, on the other hand, will go
straight into the toilet.

So the bottom line is:

- XHTML/1.0 Appendix C is functionally identical to HTML/4.01, and
almost - but not quite - as compatible with tag-soup slurpers. So
what's the point of deploying XHTML/1.0 to browsers which were never
designed to process it? If the original isn't HTML, XHTML/1.0 can be
converted by rote into HTML/4.01, and the result is slightly more
compatible with the browsers out there.

No other version of XHTML offers that easement. By definition, if you
serve it out as text/html it cannot be XHTML(tm), other than this
pointless, self-contradictory and counter-productive backwater:
XHTML/1.0-Appendix-C. What it would be is XHTML-flavoured tag soup,
which is no kind of improvement from what we already had.

I say choose one of:

* stay with HTML/4.01 - there's no point in XHTML/1.0; or

* make a clean break and move to Real XHTML(tm), with some kind of
Accept-type negotiation for client agents which don't grok it.
In short, XHTML is dead, murdered by Bill Gates' arrogance.

XHTML is alive and well in a subset of client agents, with useful
extras like SVG. Content-type negotiation (Accept: header) has been
working for years; IE contrives (like so much else) to get it only
vaguely right, but with a bit of sleight of hand at the server it can
be made to work with IE's default settings, and the more-aware can
adjust the Accept: header (or have it adjusted for them) to get better
results.

IMHO and YMMV.
 
S

Shawn Corey

Alan said:
By sheer chance, Google Groups pointed out to me that:

On Wed, 24 Nov 2004, Shawn Corey wrote:

[I'm trimming the comprehensive quote down to what I suppose you must
have interpreted as the significant part. There's no extra charge for
doing this yourself, you know...]

[Yes, now the whole world knows what a hero you are.]
A truly remarkable castle that you've built in the air there; have you
read XHTML/1.0 Appendix C, by any chance?

Please explain what XHTML/1.0 Appendix C has to do with HTTP.

--- Shawn
 
C

chris-usenet

Alan J. Flavell said:
Oh dear, does it really? Can we have a CGI.pm version number on that
please?

Perl 5.6.1. CGI 2.752

It's been fixed by 5.8.4 (CGI 3.04)
Chris
 
T

Tad McClellan

Shawn Corey said:
Alan said:
By sheer chance, Google Groups pointed out to me that:

On Wed, 24 Nov 2004, Shawn Corey wrote:

[I'm trimming the comprehensive quote down to what I suppose you must
have interpreted as the significant part. There's no extra charge for
doing this yourself, you know...]

[Yes, now the whole world knows what a hero you are.]


And now the whole world knows what an inconsiderate type of
poster you are. You shift work from yourself to others.
 
S

Shawn Corey

Tad said:
And now the whole world knows what an inconsiderate type of
poster you are. You shift work from yourself to others.

If you don't like these types of comments you should criticize the first
one.

BTW Tad, I thought I was on your permanent kill file.

--- Shawn
 
M

Michele Dondi

[I'm trimming the comprehensive quote down to what I suppose you must
have interpreted as the significant part. There's no extra charge for
doing this yourself, you know...]

[Yes, now the whole world knows what a hero you are.]

I don't think so. OTOH *most* clpmisc users will thank him anyway.
Now, if you could be so gentle and avoid wasting your energies writing
irrelevant cmts with that attitude I, for one, will thank you too, and
I think many others will as well.


Michele
 
M

Matt Garrish

Shawn Corey said:
Please explain what XHTML/1.0 Appendix C has to do with HTTP.

In other words, you haven't read the appendix. See section C.9 if it's so
painful to you to actually read something in its entirety.

Matt
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top