accidental chinese char

W

Wim Roffal

When I look at my website one piece of text is accidentally replaced by some
chinese sign.

The text is (some spaces inserted in the second version to prevent
accidents): toccata+fugue-bwv (toccata + fugue - bwv)
The word fugue is replaced by a chinese sign.

I have chinese fonts installed in my browser, so that may be part of the
problem.

I tried to repair this with <html lang=en> but that didn't help. What should
I do to get this page displayed correct?

Thanks,
Wim
 
S

Steve Pugh

Wim Roffal said:
When I look at my website one piece of text is accidentally replaced by some
chinese sign.

The text is (some spaces inserted in the second version to prevent
accidents): toccata+fugue-bwv (toccata + fugue - bwv)
The word fugue is replaced by a chinese sign.

Interesting. But you need to post the URL. In order to disect this
problem we need to able to see things such what HTTP headers the
server is sending with the page. Nothing short of the URL will do.
I have chinese fonts installed in my browser, so that may be part of the
problem.

Well, if you didn't have chinese fonts (and support for the
appropriate character sets) installed you wouldn't be able to see
chinese characters, so it's certainly part of the problem in that
sense.
I tried to repair this with <html lang=en> but that didn't help.

Language and character set are two different things. Setting the
language will have no effect on what characters are displayed.

Steve
 
P

PeterMcC

I see it ok here - "toccata+fugue-bwv" - but are you sure that the test
replicates the conditions that produce the problem on the real page? Can you
see "toccata+fugue-bwv" on your machine when you look at the test page from
the server?
 
S

Steve Pugh

Wim Roffal said:
http://classiccat.net/test2.html

I guess you see this effect only with chinese fonts installed.

Please don't top post.

Initially, see the effect in IE but not in Opera.

Looking at your pages HTTP headers I see that you're not setting a
charset.
http://www.delorie.com/web/headers.cgi?url=http://classiccat.net/test2.html

The browser is then guessing which character encoding to use. IE
guesses UTF-7. If I manually select UTF-7 in Opera then I see the same
thing. If I manually select anything else (within reason) in IE then I
see the characters properly.

I'm no expert on these things so I have no idea why some of the text
in your page is interpreted as chinese characters under UTF-7.

The solution is simple - make sure that your server is sending out an
appropriate charset in the HTTP headers. How you go about this depends
on how much control you have over the server.

Steve
 
A

Andrew Tang

Steve Pugh said:
"Wim Roffal" <[email protected]> wrote:
I'm no expert on these things so I have no idea why some of the text
in your page is interpreted as chinese characters under UTF-7.

Steve

I think it all depends on your operating system. I have my Windows XP use
the Chinese chararcter set as default, and on pages where no charset is
defined, it automatically assumes a chinese font set and would accidently
interpret certain combinations of characters as a single chinese character.

Of course, this only affects IE cause it integrated into Windows. :)

Andy
 
A

Andreas Prilop

Steve Pugh said:
The browser is then guessing which character encoding to use. IE
guesses UTF-7.

One of the zillion bugs in Internet Explorer. It not only guesses
the MIME type of documents but also their encoding ("charset").
If possible, IE takes UTF-7:
<http://groups.google.com/groups?q=IE+UTF-7>
<http://schneegans.de/bugs/ie-utf-7/>

Even more grotesque are the bugs of Outlook Express with UTF-7:
<http://groups.google.com/groups?q=ADW+ACQ>
<http://groups.google.com/groups?th=d0332b6e31d94e97>
See the "References" header line in
<http://google.com/[email protected]&output=gplain>
 
B

Bertilo Wennergren

One of the zillion bugs in Internet Explorer. It not only guesses
the MIME type of documents but also their encoding ("charset").

Well, what's it supposed to do then, when there is no info about the
encoding, neither in the HTTP header nor in the HTML code? It must
guess, doesn't it?

Therefore, user agents must not assume any default value for the
"charset" parameter.
If possible, IE takes UTF-7:

UTF-7 is of course a bad choice, most of the time. Bug IE still has to
guess. That's not a bug.

The MIME type guessing is another matter. IE sometimes guesses and
overrides what the HTTP header does say. That's a bug.
 
A

Andreas Prilop

Bertilo Wennergren said:
Well, what's it supposed to do then, when there is no info about the
encoding, neither in the HTTP header nor in the HTML code? It must
guess, doesn't it?

No, the fallback is, of course, ASCII. Remember, such documents that
*could* be UTF-7 theoretically contain only ASCII characters.
UTF-7 is of course a bad choice, most of the time. Bug IE still has to
guess. That's not a bug.

It *is* a bug. It is nonsense to treat a document with ASCII characters
only (i.e. x00 to x7F) as UTF-7 when the encoding UTF-7 is nowhere
specified.

Of course you would need to guess if you have some bytes above x7F
without any encoding information. But here we have only bytes in the
range x00 to x7F.
 
B

Bertilo Wennergren

No, the fallback is, of course, ASCII.

Of course? "Fallback" sounds very much like "assuming a default
character set", something user agents _must not do_ according to the
HTML 4.01 specification.
Remember, such documents that
*could* be UTF-7 theoretically contain only ASCII characters.

Contain only bytes that could be represent ASCII characters if ASCII is
assumed.

What about guessing UTF-8 if certain combinations of bytes are present?
It *is* a bug. It is nonsense to treat a document with ASCII characters
only

It can't be said to have any ASCII characters at all untill ASCII has
been assued as the encoding.
Of course you would need to guess if you have some bytes above x7F
without any encoding information. But here we have only bytes in the
range x00 to x7F.

So guessing is not a bug, per se, only the actual choice of UTF-7 in
that guessing? That's what you mean?

Which encodings could it guess at without it being a bug? Which should
it never guess? IS UTF-7 the only guess that should never happen?
 
L

Leif K-Brooks

Andreas said:
No, the fallback is, of course, ASCII. Remember, such documents that
*could* be UTF-7 theoretically contain only ASCII characters.

The fallback according to who?

It *is* a bug. It is nonsense to treat a document with ASCII characters
only (i.e. x00 to x7F) as UTF-7 when the encoding UTF-7 is nowhere
specified.

May be nonsense, but it isn't a bug. If you don't specify an encoding, a
browser can do *whatever* it likes. That can mean guessing ASCII, or
guessing the encoding I invented the other day that won't work with any
existing text.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top