Odd character display / UTF issue ?

S

still me

I am working with a simple Paypal shopping cart and having an issue
with an odd character. The cart is called from a web page (what Paypal
normally expects) with a FORM/SUBMIT. I am also calling it via POST
from a cgi program that mimics the FORM submit and just passes the
HTTP headers and content back as received. Tests are from MSIE and
Firefox on Windows XP.

It all works fine, the pages return identically, with one little
glitch. In the case of the call from the CGI program, I see a few
funky  characters displayed on screen. The code that is causing them
is easy to find in the source:

<tr class="summary">
<td> </td>
<td> </td>

Here's the strange part: the exact same characters appear in the
source that returns from the regular FORM/SUBMIT, yet the characters
don't appear in either browser with the FORM/POST. They only appear in
the CGI call. I've verified that the returned page is identical, with
identical source code. HTTP Headers are the same. The funky characters
are the same at the hex level. Both pages use style files, but the
references are all absolute from the server on down and I think they
should resolve the same. Both pages contain both an HTTP content type
header and a META header that specify UTF8, as follows:
HTTP:
Content-Type: text/html; charset=UTF-8
HTML:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Any thoughts? Is this a character set issue? I could change the header
before I issue the page from the CGI of there is a character set that
would work better.

Thanks for any insight,
 
J

Jukka K. Korpela

Scripsit still me:
I am working with a simple Paypal shopping cart and having an issue
with an odd character.

Since you don't include a URL, you're effectively asking for a lump of
guesses.
In the case of the call from the CGI program, I see a few
funky  characters displayed on screen.

You have some error in your code.
The code that is causing them
is easy to find in the source:

<tr class="summary">
<td>Â </td>
<td>Â </td>

Excellent! Now remove those funky characters! Problem solved.
Here's the strange part: the exact same characters appear in the
source that returns from the regular FORM/SUBMIT, yet the characters
don't appear in either browser with the FORM/POST.

Sounds like character encoding confusion. Anything that _looks_ like "Â " is
probably something UTF-8 encoded (or distorted UTF-8) interpreted by some
8-bit encoding.
HTTP:
Content-Type: text/html; charset=UTF-8
HTML:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Now that might be relevant, but is this really the case? Which encoding do
browsers actually use in interpreting the data?

That is, why don't you include some URLs so that the actual HTTP headers as
well as page content can be analyzed?
Any thoughts? Is this a character set issue? I could change the header
before I issue the page from the CGI of there is a character set that
would work better.

If you are thinking about "trying" different charset parameters, you surely
have an issue with character sets. What is the _actual_ encoding of the
pages? That is, the encoding of the data itself, as opposite to what headers
or meta tags say about it.
 
A

Andy Dingley

Sounds like character encoding confusion. Anything that _looks_ like "? " is
probably something UTF-8 encoded (or distorted UTF-8) interpreted by some
8-bit encoding.

No, characters in a UTF-8 encoding interpreted by a tool using non-
UTF-8 encoding will generally generate garbage characters that are
still displayable (the tool thinks that it received two good
characters, they just don't mean anything). Typically it's a pair of
characters, the first of these is some variant of an accented
"A" (they won't all be, but if you see lots of spurious "A"s on a
page, look to UTF-8).

To get the unrecognizable character "?" displayed, then your tool must
have been able to automatically recognise garbage, i.e. bad encodings,
not just bad characters. This usually indicates non UTF-8 characters
being served as UTF-8, then the tool being unable to process them as
UTF-8. As ASCII is also simultaneously UTF-8 and ISO-8859-*, this is
caused (most likely) by non-ASCII characters with ISO-8859-* encodings
and a UTF-8 content-type.
 
J

Jukka K. Korpela

Scripsit Andy Dingley:
No, characters in a UTF-8 encoding interpreted by a tool using non-
UTF-8 encoding will generally generate garbage characters that are
still displayable

That's what I wrote about, using the (iso-8859-1 encoded) character Â
(letter A with circumflex accent) as in the original question. I wonder what
piece of software munged it, but it wasn't anything I was using.
(the tool thinks that it received two good
characters, they just don't mean anything).

Two, three or four.
Typically it's a pair of
characters, the first of these is some variant of an accented
"A"

Yes, at least when the 8-bit encoding is ISO-8859-1.

The combination "Â " also indicates some other error, since the octet
combination C2 20 must not appear in UTF-8 encoded data. We have little way
of knowing what happened, but I'd guess that 20 (which looks like space when
interpreted according to ISO-8859-1) was some octet in the range 80..9F,
maybe something that isn't allocated in windows-1252.
To get the unrecognizable character "?" displayed,

Which unrecognizable "?"? The question mark is recognizable, and so is the
character "Â", which is what was actually included in the original question.
 
S

still me

No, characters in a UTF-8 encoding interpreted by a tool using non-
UTF-8 encoding will generally generate garbage characters that are
still displayable (the tool thinks that it received two good
characters, they just don't mean anything). Typically it's a pair of
characters, the first of these is some variant of an accented
"A" (they won't all be, but if you see lots of spurious "A"s on a
page, look to UTF-8).

What's confusing me is that these same characters are present in the
HTML page that returns from Paypal via a standard FORM/POST as when I
simulate that with my CGI call. However, they only display in the
browser as the accented A on the page from the CGI; they don't show up
at all on the FORM/POST page. I've checked at the hex level and it's
definitely the same character so there's no difference there.

Any thoughts as to what would cause these characters to display on
screen in one case but not in the other? Would overriding the encoding
spec with something else help with the problem?
 
J

Jukka K. Korpela

Scripsit still me:
Any thoughts as to what would cause these characters to display on
screen in one case but not in the other?

Yeah, you're doing something wrong.

To get more specific help, post a more specific question. That means the
URL. It might not be enough - we might need the CGI script code as well -
but it would be a start. Without a URL, this is basically just babbling, and
not even particularly entertaining.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top