character encoding in CGI.pm

Discussion in 'Perl Misc' started by David Lee Lambert, Nov 24, 2004.

  1. I noticed that, without setting any options, CGI.pm output of a
    simple page starts as follows:

    Content-Type: text/html; charset=ISO-8859-1

    <?xml version="1.0" encoding="utf-8"?>


    Now, is the webpage in ISO-8859-1, utf8, or some other encoding? Or
    is XML defined such that this is a perfectly valid situation? If I
    send a string containing Unicode characters (with \x{}), IE 6 detects
    the page as Latin-1 and doesn't show those characters properly; if I
    manually tell it that the encoding is UTF-8, it displays the
    characters properly.

    This is using perl 5.6.1; I'm not sure what verion of CGI.p I have.

    --
    DLL
     
    David Lee Lambert, Nov 24, 2004
    #1
    1. Advertising

  2. On Wed, 24 Nov 2004, David Lee Lambert wrote:

    > I noticed that, without setting any options, CGI.pm output of a
    > simple page starts as follows:
    >
    > Content-Type: text/html; charset=ISO-8859-1
    >
    > <?xml version="1.0" encoding="utf-8"?>


    Oh dear, does it really? Can we have a CGI.pm version number on that
    please?

    > Now, is the webpage in ISO-8859-1, utf8, or some other encoding?


    Well, the only way it can be in both is if it's *really* in
    us-ascii. Seriously, that's the truth.

    > Or is XML defined such that this is a perfectly valid situation?


    Absolutely not. Your authoritative reverence (excuse me, I meant
    "reference", but the inadvertent typo was too good to take out) is
    the XHTML/1.0 specification, Appendix C, since we're dealing here with
    the text/html compatibility feature of XHTML/1.0

    I personally think leaping into XHTML without an overwhelming cause
    was a bit premature. You can tell CGI.pm that you don't want
    XHTML-flavoured HTML. But opinions vary, and this is the wrong forum
    to dispute that.

    > This is using perl 5.6.1; I'm not sure what verion of CGI.p I have.


    *Upgrade*. 5.6.1 is now old; and the version of CGI.pm that comes
    bundled with Perl is generally somewhat back-level compared to the
    author's latest version at any given moment. Do I need to refer you
    to the FAQ if you need a private version installed due to
    foot-dragging by your sysadmin?

    Btw. CGI.pm will happily tell you what version it is if you ask it
    nicely. It's in the source code too, of course.
     
    Alan J. Flavell, Nov 24, 2004
    #2
    1. Advertising

  3. David Lee Lambert

    Shawn Corey Guest

    David Lee Lambert wrote:
    > I noticed that, without setting any options, CGI.pm output of a
    > simple page starts as follows:
    >
    > Content-Type: text/html; charset=ISO-8859-1
    >
    > <?xml version="1.0" encoding="utf-8"?>
    >
    >
    > Now, is the webpage in ISO-8859-1, utf8, or some other encoding? Or
    > is XML defined such that this is a perfectly valid situation? If I
    > send a string containing Unicode characters (with \x{}), IE 6 detects
    > the page as Latin-1 and doesn't show those characters properly; if I
    > manually tell it that the encoding is UTF-8, it displays the
    > characters properly.
    >
    > This is using perl 5.6.1; I'm not sure what verion of CGI.p I have.
    >
    > --
    > DLL


    The web page is both. The ISO-8859-1 encoding is used for the HTTP
    transfer. All bytes, including the web page, while be interpreted as
    ISO-8859-1 encoded until handed off to the display engine in the
    browser. Then it will be interpreted as UTF-8. This normally does not
    mean much since the bytes after the blank line are usually not processed
    by the HTTP decoding code; they are simply passed to the next part.

    If you are using Perl 5.6, add 'use utf8;' to the code. For any Perl,
    you can add:

    print handler( -charset => 'UTF-8' );

    for the Content-Type handler.

    See perldoc CGI for details.

    --- Shawn
     
    Shawn Corey, Nov 25, 2004
    #3
  4. David Lee Lambert

    Ben Morrow Guest

    Quoth "Alan J. Flavell" <>:
    > On Wed, 24 Nov 2004, David Lee Lambert wrote:
    >
    > > I noticed that, without setting any options, CGI.pm output of a
    > > simple page starts as follows:
    > >
    > > Content-Type: text/html; charset=ISO-8859-1
    > >
    > > <?xml version="1.0" encoding="utf-8"?>

    >
    > Oh dear, does it really? Can we have a CGI.pm version number on that
    > please?
    >
    > > Now, is the webpage in ISO-8859-1, utf8, or some other encoding?

    >
    > Well, the only way it can be in both is if it's *really* in
    > us-ascii. Seriously, that's the truth.
    >
    > > Or is XML defined such that this is a perfectly valid situation?

    >
    > Absolutely not. Your authoritative reverence (excuse me, I meant
    > "reference", but the inadvertent typo was too good to take out) is
    > the XHTML/1.0 specification, Appendix C, since we're dealing here with
    > the text/html compatibility feature of XHTML/1.0


    Correct me if I'm wrong, but surely XHTML cannot be served under a
    text/html content type anyway? It isn't valid HTML (take this document,
    for example,

    <html>
    <head>
    <link rel="stylesheet" type="text/css" href="css"/>
    </head>
    <body></body>
    </html>

    : the '/>' on the <link> is not valid HTML, and validator.w3.org will
    reject it under any HTML DTD). This means this header is wrong in three
    ways:

    1. the content should be labelled application/xhtml+xml

    2. the charsets should match

    3. the charset shouldn't be specified in the HTTP header anyway, for
    precisely this reason (unlike HTML, XML has strict rules for determining
    its charset; in this case, the charset given in the HTTP header
    overrides that in the document, but this is Not A Good Thing). See
    recent discussions on for this; the next version of
    RFC3023 (the registration for XML media types) will (probably) state
    that XML entities should not be given a charset parameter.

    Ben

    --
    "The Earth is degenerating these days. Bribery and corruption abound.
    Children no longer mind their parents, every man wants to write a book,
    and it is evident that the end of the world is fast approaching."
    -Assyrian stone tablet, c.2800 BC
     
    Ben Morrow, Nov 25, 2004
    #4
  5. By sheer chance, Google Groups pointed out to me that:

    On Wed, 24 Nov 2004, Shawn Corey wrote:

    [I'm trimming the comprehensive quote down to what I suppose you must
    have interpreted as the significant part. There's no extra charge for
    doing this yourself, you know...]

    > > Now, is the webpage in ISO-8859-1, utf8, or some other encoding?

    >
    > The web page is both.


    Impossible, unless it happens to be in us-ascii, in which case it's a
    valid instance of all three.

    > The ISO-8859-1 encoding is used for the HTTP transfer. All bytes,
    > including the web page, while be interpreted as ISO-8859-1 encoded
    > until handed off to the display engine in the browser. Then it will
    > be interpreted as UTF-8. This normally does not mean much since the
    > bytes after the blank line are usually not processed by the HTTP
    > decoding code; they are simply passed to the next part.


    A truly remarkable castle that you've built in the air there; have you
    read XHTML/1.0 Appendix C, by any chance?

    > See perldoc CGI for details.


    Whimper.

    Once again, I suppose this brings home the importance of not going
    into technical detail on matters that are off-topic for the group.
     
    Alan J. Flavell, Nov 25, 2004
    #5
  6. Oh dear, this is desperately off-topic...

    On Thu, 25 Nov 2004, Ben Morrow wrote:

    > Correct me if I'm wrong, but surely XHTML cannot be served under a
    > text/html content type anyway?


    Technically, you're right. Practically, I'd have to refer you to
    XHTML/1.0 Appendix C. Well, I already did, but you seem to have
    resisted the temptation to mention it.

    > It isn't valid HTML


    Correct. Appendix C is in theory self-contradictory, but in practice
    it gets away with it, since almost all "web browsers" implement
    tag-soup rather than HTML "per se".

    emacs-w3 indeed had to be deliberately broken in order to be
    compatible with Appendix C, since it had taken the HTML specification
    just a bit more seriously than anyone else (aside from SGML-conforming
    browsers such as softquad panorama, but who uses those as www
    browsers?).

    > 1. the content should be labelled application/xhtml+xml


    "should". Right. XHTML/1.0 Appendix C is a (misguided, IMHO)
    exception to that rule.

    > 2. the charsets should match


    "must" match, except in a few degenerate cases (since us-ascii can be
    validly labelled as iso-8859-anything as well as utf-8, whatever
    happens to be convenient).

    > 3. the charset shouldn't be specified in the HTTP header anyway,


    Disagree; but this isn't the place to argue the point.

    all the best
     
    Alan J. Flavell, Nov 25, 2004
    #6
  7. Ben Morrow wrote:
    > Correct me if I'm wrong, but surely XHTML cannot be served under a
    > text/html content type anyway? It isn't valid HTML (take this document,
    > for example,


    If you want Internet Explorer to display it, you /must/ serve it as
    text/html. Internet Explorer refuses outright to render a document that
    it knows to be XHTML. Fortunately, most browsers will produce acceptable
    results for XHTML 1.0 served as HTML. XHTML 2.0 served as HTML, on the
    other hand, will go straight into the toilet.

    In short, XHTML is dead, murdered by Bill Gates' arrogance.

    Ain't monopolies great?

    --
    John W. Kennedy
    "Only an idiot fights a war on two fronts. Only the heir to the throne
    of the kingdom of idiots would fight a war on twelve fronts"
    -- J. Michael Straczynski. "Babylon 5", "Ceremonies of Light and Dark"
     
    John W. Kennedy, Nov 26, 2004
    #7
  8. Oh dear. Off topic, but I can't resist at least a reply... with
    apologies up-front

    On Thu, 25 Nov 2004, John W. Kennedy wrote:

    > Ben Morrow wrote:
    > > Correct me if I'm wrong, but surely XHTML cannot be served under a
    > > text/html content type anyway? It isn't valid HTML (take this
    > > document, for example,

    >
    > If you want Internet Explorer to display it, you /must/ serve it as
    > text/html.


    IE, as normally used, does not support XHTML, and it would be better
    not to send it any. Faking XHTML as HTML brings no benefits at all at
    the web interface, and adds a few disbenefits. It's sometimes claimed
    that XML-based tools at the authoring side are a valuable benefit, and
    therefore the result will be XHTML - but that is a half-truth:
    XML-based tools can also emit HTML/4.01 as their end-product.

    > Internet Explorer refuses outright to render a document that it
    > knows to be XHTML.


    Right from the start of the WWW, browsers which can't render a
    particular MIME content-type have been configured to fire up a
    suitable "helper application" to view that content type.

    More recently there's been a tendency to define "plug-ins", which
    render certain content types but display them in the window of the
    browser.

    Either of these mechanisms should be available in IE (after
    sacrificing a suitable animal to XP SP2, I suppose). Years back I
    configured Windows/IE to use a "helper application" for opening XHTML
    MIME-types, and I defined the helper application to be Mozilla. It
    worked fine. OK, I'm not promoting it in that form as a practical
    solution for end-users, just offering an in-principle refutation that
    if the browser-like object doesn't support it then it can't be used.

    The original idea of XML was to make a clean break with "tag soup".

    > Fortunately, most browsers will produce acceptable results for XHTML
    > 1.0 served as HTML.


    Unfortunately, that's led to the unwashed masses of web deezyners
    simply converting their HTML-flavoured tag soup into XHTML-flavoured
    tag soup, and tossing the potential benefits of the clean break out of
    the window (no pun intended).

    > XHTML 2.0 served as HTML, on the other hand, will go
    > straight into the toilet.


    So the bottom line is:

    - XHTML/1.0 Appendix C is functionally identical to HTML/4.01, and
    almost - but not quite - as compatible with tag-soup slurpers. So
    what's the point of deploying XHTML/1.0 to browsers which were never
    designed to process it? If the original isn't HTML, XHTML/1.0 can be
    converted by rote into HTML/4.01, and the result is slightly more
    compatible with the browsers out there.

    No other version of XHTML offers that easement. By definition, if you
    serve it out as text/html it cannot be XHTML(tm), other than this
    pointless, self-contradictory and counter-productive backwater:
    XHTML/1.0-Appendix-C. What it would be is XHTML-flavoured tag soup,
    which is no kind of improvement from what we already had.

    I say choose one of:

    * stay with HTML/4.01 - there's no point in XHTML/1.0; or

    * make a clean break and move to Real XHTML(tm), with some kind of
    Accept-type negotiation for client agents which don't grok it.

    > In short, XHTML is dead, murdered by Bill Gates' arrogance.


    XHTML is alive and well in a subset of client agents, with useful
    extras like SVG. Content-type negotiation (Accept: header) has been
    working for years; IE contrives (like so much else) to get it only
    vaguely right, but with a bit of sleight of hand at the server it can
    be made to work with IE's default settings, and the more-aware can
    adjust the Accept: header (or have it adjusted for them) to get better
    results.

    IMHO and YMMV.
     
    Alan J. Flavell, Nov 26, 2004
    #8
  9. David Lee Lambert

    Shawn Corey Guest

    Re: [OT] Difference between XHTML and HTTP (WAS: character encodingin CGI.pm)

    Alan J. Flavell wrote:
    > By sheer chance, Google Groups pointed out to me that:
    >
    > On Wed, 24 Nov 2004, Shawn Corey wrote:
    >
    > [I'm trimming the comprehensive quote down to what I suppose you must
    > have interpreted as the significant part. There's no extra charge for
    > doing this yourself, you know...]


    [Yes, now the whole world knows what a hero you are.]

    > A truly remarkable castle that you've built in the air there; have you
    > read XHTML/1.0 Appendix C, by any chance?


    Please explain what XHTML/1.0 Appendix C has to do with HTTP.

    --- Shawn
     
    Shawn Corey, Nov 26, 2004
    #9
  10. David Lee Lambert

    Guest

    Alan J. Flavell <> wrote:
    > Oh dear, does it really? Can we have a CGI.pm version number on that
    > please?


    Perl 5.6.1. CGI 2.752

    It's been fixed by 5.8.4 (CGI 3.04)
    Chris
     
    , Nov 26, 2004
    #10
  11. Re: [OT] Difference between XHTML and HTTP (WAS: character encoding in CGI.pm)

    Shawn Corey <> wrote:
    > Alan J. Flavell wrote:
    >> By sheer chance, Google Groups pointed out to me that:
    >>
    >> On Wed, 24 Nov 2004, Shawn Corey wrote:
    >>
    >> [I'm trimming the comprehensive quote down to what I suppose you must
    >> have interpreted as the significant part. There's no extra charge for
    >> doing this yourself, you know...]

    >
    > [Yes, now the whole world knows what a hero you are.]



    And now the whole world knows what an inconsiderate type of
    poster you are. You shift work from yourself to others.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 26, 2004
    #11
  12. David Lee Lambert

    Shawn Corey Guest

    Re: [OT] Difference between XHTML and HTTP (WAS: character encodingin CGI.pm)

    Tad McClellan wrote:
    > And now the whole world knows what an inconsiderate type of
    > poster you are. You shift work from yourself to others.


    If you don't like these types of comments you should criticize the first
    one.

    BTW Tad, I thought I was on your permanent kill file.

    --- Shawn
     
    Shawn Corey, Nov 26, 2004
    #12
  13. Re: [OT] Difference between XHTML and HTTP (WAS: character encoding in CGI.pm)

    Shawn Corey <> wrote:

    > BTW Tad, I thought I was on your permanent kill file.



    I went "slumming".


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Nov 26, 2004
    #13
  14. Re: [OT] Difference between XHTML and HTTP (WAS: character encoding in CGI.pm)

    On Fri, 26 Nov 2004 07:08:09 -0500, Shawn Corey
    <> wrote:

    >> [I'm trimming the comprehensive quote down to what I suppose you must
    >> have interpreted as the significant part. There's no extra charge for
    >> doing this yourself, you know...]

    >
    >[Yes, now the whole world knows what a hero you are.]


    I don't think so. OTOH *most* clpmisc users will thank him anyway.
    Now, if you could be so gentle and avoid wasting your energies writing
    irrelevant cmts with that attitude I, for one, will thank you too, and
    I think many others will as well.


    Michele
    --
    {$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
    (($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^<R<Y]*YB='
    ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER<Z`S(G.DZZ9OX0Z')=~/./g)x2,$_,
    256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,
     
    Michele Dondi, Nov 26, 2004
    #14
  15. David Lee Lambert

    Matt Garrish Guest

    Re: [OT] Difference between XHTML and HTTP (WAS: character encoding in CGI.pm)

    "Shawn Corey" <> wrote in message
    news:A0Fpd.52178$...
    > Alan J. Flavell wrote:
    >> A truly remarkable castle that you've built in the air there; have you
    >> read XHTML/1.0 Appendix C, by any chance?

    >
    > Please explain what XHTML/1.0 Appendix C has to do with HTTP.
    >


    In other words, you haven't read the appendix. See section C.9 if it's so
    painful to you to actually read something in its entirety.

    Matt
     
    Matt Garrish, Nov 27, 2004
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hardy Wang

    Encoding.Default and Encoding.UTF8

    Hardy Wang, Jun 8, 2004, in forum: ASP .Net
    Replies:
    5
    Views:
    18,869
    Jon Skeet [C# MVP]
    Jun 9, 2004
  2. Replies:
    1
    Views:
    23,370
    Real Gagnon
    Oct 8, 2004
  3. raavi
    Replies:
    2
    Views:
    913
    raavi
    Mar 2, 2006
  4. Stefan Fischer
    Replies:
    2
    Views:
    287
    Stefan Fischer
    Feb 23, 2010
  5. sy crisp

    mod_perl/cgi character encoding issues

    sy crisp, Jul 29, 2005, in forum: Perl Misc
    Replies:
    1
    Views:
    186
    sy crisp
    Jul 29, 2005
Loading...

Share This Page