Odd character display / UTF issue ?

Discussion in 'HTML' started by still me, Oct 15, 2007.

  1. still me

    still me Guest

    I am working with a simple Paypal shopping cart and having an issue
    with an odd character. The cart is called from a web page (what Paypal
    normally expects) with a FORM/SUBMIT. I am also calling it via POST
    from a cgi program that mimics the FORM submit and just passes the
    HTTP headers and content back as received. Tests are from MSIE and
    Firefox on Windows XP.

    It all works fine, the pages return identically, with one little
    glitch. In the case of the call from the CGI program, I see a few
    funky  characters displayed on screen. The code that is causing them
    is easy to find in the source:

    <tr class="summary">
    <td> </td>
    <td> </td>

    Here's the strange part: the exact same characters appear in the
    source that returns from the regular FORM/SUBMIT, yet the characters
    don't appear in either browser with the FORM/POST. They only appear in
    the CGI call. I've verified that the returned page is identical, with
    identical source code. HTTP Headers are the same. The funky characters
    are the same at the hex level. Both pages use style files, but the
    references are all absolute from the server on down and I think they
    should resolve the same. Both pages contain both an HTTP content type
    header and a META header that specify UTF8, as follows:
    HTTP:
    Content-Type: text/html; charset=UTF-8
    HTML:
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

    Any thoughts? Is this a character set issue? I could change the header
    before I issue the page from the CGI of there is a character set that
    would work better.

    Thanks for any insight,
     
    still me, Oct 15, 2007
    #1
    1. Advertisements

  2. Scripsit still me:
    Since you don't include a URL, you're effectively asking for a lump of
    guesses.
    You have some error in your code.
    Excellent! Now remove those funky characters! Problem solved.
    Sounds like character encoding confusion. Anything that _looks_ like "Â " is
    probably something UTF-8 encoded (or distorted UTF-8) interpreted by some
    8-bit encoding.
    Now that might be relevant, but is this really the case? Which encoding do
    browsers actually use in interpreting the data?

    That is, why don't you include some URLs so that the actual HTTP headers as
    well as page content can be analyzed?
    If you are thinking about "trying" different charset parameters, you surely
    have an issue with character sets. What is the _actual_ encoding of the
    pages? That is, the encoding of the data itself, as opposite to what headers
    or meta tags say about it.
     
    Jukka K. Korpela, Oct 15, 2007
    #2
    1. Advertisements

  3. still me

    Andy Dingley Guest

    No, characters in a UTF-8 encoding interpreted by a tool using non-
    UTF-8 encoding will generally generate garbage characters that are
    still displayable (the tool thinks that it received two good
    characters, they just don't mean anything). Typically it's a pair of
    characters, the first of these is some variant of an accented
    "A" (they won't all be, but if you see lots of spurious "A"s on a
    page, look to UTF-8).

    To get the unrecognizable character "?" displayed, then your tool must
    have been able to automatically recognise garbage, i.e. bad encodings,
    not just bad characters. This usually indicates non UTF-8 characters
    being served as UTF-8, then the tool being unable to process them as
    UTF-8. As ASCII is also simultaneously UTF-8 and ISO-8859-*, this is
    caused (most likely) by non-ASCII characters with ISO-8859-* encodings
    and a UTF-8 content-type.
     
    Andy Dingley, Oct 15, 2007
    #3
  4. Scripsit Andy Dingley:
    That's what I wrote about, using the (iso-8859-1 encoded) character Â
    (letter A with circumflex accent) as in the original question. I wonder what
    piece of software munged it, but it wasn't anything I was using.
    Two, three or four.
    Yes, at least when the 8-bit encoding is ISO-8859-1.

    The combination "Â " also indicates some other error, since the octet
    combination C2 20 must not appear in UTF-8 encoded data. We have little way
    of knowing what happened, but I'd guess that 20 (which looks like space when
    interpreted according to ISO-8859-1) was some octet in the range 80..9F,
    maybe something that isn't allocated in windows-1252.
    Which unrecognizable "?"? The question mark is recognizable, and so is the
    character "Â", which is what was actually included in the original question.
     
    Jukka K. Korpela, Oct 15, 2007
    #4
  5. still me

    still me Guest

    What's confusing me is that these same characters are present in the
    HTML page that returns from Paypal via a standard FORM/POST as when I
    simulate that with my CGI call. However, they only display in the
    browser as the accented A on the page from the CGI; they don't show up
    at all on the FORM/POST page. I've checked at the hex level and it's
    definitely the same character so there's no difference there.

    Any thoughts as to what would cause these characters to display on
    screen in one case but not in the other? Would overriding the encoding
    spec with something else help with the problem?
     
    still me, Oct 15, 2007
    #5
  6. Scripsit still me:
    Yeah, you're doing something wrong.

    To get more specific help, post a more specific question. That means the
    URL. It might not be enough - we might need the CGI script code as well -
    but it would be a start. Without a URL, this is basically just babbling, and
    not even particularly entertaining.
     
    Jukka K. Korpela, Oct 18, 2007
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.