accidental chinese char

Wim Roffal · Nov 13, 2003

When I look at my website one piece of text is accidentally replaced by some
chinese sign.

The text is (some spaces inserted in the second version to prevent
accidents): toccata+fugue-bwv (toccata + fugue - bwv)
The word fugue is replaced by a chinese sign.

I have chinese fonts installed in my browser, so that may be part of the
problem.

I tried to repair this with <html lang=en> but that didn't help. What should
I do to get this page displayed correct?

Thanks,
Wim

Steve Pugh · Nov 13, 2003

Wim Roffal said:
When I look at my website one piece of text is accidentally replaced by some
chinese sign.

The text is (some spaces inserted in the second version to prevent
accidents): toccata+fugue-bwv (toccata + fugue - bwv)
The word fugue is replaced by a chinese sign.

Interesting. But you need to post the URL. In order to disect this
problem we need to able to see things such what HTTP headers the
server is sending with the page. Nothing short of the URL will do.

I have chinese fonts installed in my browser, so that may be part of the
problem.

Well, if you didn't have chinese fonts (and support for the
appropriate character sets) installed you wouldn't be able to see
chinese characters, so it's certainly part of the problem in that
sense.

I tried to repair this with <html lang=en> but that didn't help.

Language and character set are two different things. Setting the
language will have no effect on what characters are displayed.

Steve

Wim Roffal · Nov 13, 2003

http://classiccat.net/test2.html

I guess you see this effect only with chinese fonts installed.

Wim

PeterMcC · Nov 13, 2003

I see it ok here - "toccata+fugue-bwv" - but are you sure that the test
replicates the conditions that produce the problem on the real page? Can you
see "toccata+fugue-bwv" on your machine when you look at the test page from
the server?

Andrew Tang · Nov 13, 2003

Wim Roffal said:
http://classiccat.net/test2.html

I guess you see this effect only with chinese fonts installed.

Wim

Try using the meta tag content-type to define the character set that you
want:
Example:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

Steve Pugh · Nov 13, 2003

Wim Roffal said:
http://classiccat.net/test2.html

I guess you see this effect only with chinese fonts installed.

Please don't top post.

Initially, see the effect in IE but not in Opera.

Looking at your pages HTTP headers I see that you're not setting a
charset.
http://www.delorie.com/web/headers.cgi?url=http://classiccat.net/test2.html

The browser is then guessing which character encoding to use. IE
guesses UTF-7. If I manually select UTF-7 in Opera then I see the same
thing. If I manually select anything else (within reason) in IE then I
see the characters properly.

I'm no expert on these things so I have no idea why some of the text
in your page is interpreted as chinese characters under UTF-7.

The solution is simple - make sure that your server is sending out an
appropriate charset in the HTTP headers. How you go about this depends
on how much control you have over the server.

Steve

Andrew Tang · Nov 13, 2003

Steve Pugh said:
"Wim Roffal" <[email protected]> wrote:

I'm no expert on these things so I have no idea why some of the text
in your page is interpreted as chinese characters under UTF-7.

Steve

I think it all depends on your operating system. I have my Windows XP use
the Chinese chararcter set as default, and on pages where no charset is
defined, it automatically assumes a chinese font set and would accidently
interpret certain combinations of characters as a single chinese character.

Of course, this only affects IE cause it integrated into Windows.

Andy

Andreas Prilop · Nov 13, 2003

Steve Pugh said:
The browser is then guessing which character encoding to use. IE
guesses UTF-7.

One of the zillion bugs in Internet Explorer. It not only guesses
the MIME type of documents but also their encoding ("charset").
If possible, IE takes UTF-7:
<http://groups.google.com/groups?q=IE+UTF-7>
<http://schneegans.de/bugs/ie-utf-7/>

Even more grotesque are the bugs of Outlook Express with UTF-7:
<http://groups.google.com/groups?q=ADW+ACQ>
<http://groups.google.com/groups?th=d0332b6e31d94e97>
See the "References" header line in
<http://google.com/[email protected]&output=gplain>

Bertilo Wennergren · Nov 13, 2003

One of the zillion bugs in Internet Explorer. It not only guesses
the MIME type of documents but also their encoding ("charset").

Well, what's it supposed to do then, when there is no info about the
encoding, neither in the HTTP header nor in the HTML code? It must
guess, doesn't it?

Therefore, user agents must not assume any default value for the
"charset" parameter.

If possible, IE takes UTF-7:

UTF-7 is of course a bad choice, most of the time. Bug IE still has to
guess. That's not a bug.

The MIME type guessing is another matter. IE sometimes guesses and
overrides what the HTTP header does say. That's a bug.

Andreas Prilop · Nov 13, 2003

Bertilo Wennergren said:
Well, what's it supposed to do then, when there is no info about the
encoding, neither in the HTTP header nor in the HTML code? It must
guess, doesn't it?

No, the fallback is, of course, ASCII. Remember, such documents that
*could* be UTF-7 theoretically contain only ASCII characters.

UTF-7 is of course a bad choice, most of the time. Bug IE still has to
guess. That's not a bug.

It *is* a bug. It is nonsense to treat a document with ASCII characters
only (i.e. x00 to x7F) as UTF-7 when the encoding UTF-7 is nowhere
specified.

Of course you would need to guess if you have some bytes above x7F
without any encoding information. But here we have only bytes in the
range x00 to x7F.

Bertilo Wennergren · Nov 13, 2003

No, the fallback is, of course, ASCII.

Of course? "Fallback" sounds very much like "assuming a default
character set", something user agents _must not do_ according to the
HTML 4.01 specification.

Remember, such documents that
*could* be UTF-7 theoretically contain only ASCII characters.

Contain only bytes that could be represent ASCII characters if ASCII is
assumed.

What about guessing UTF-8 if certain combinations of bytes are present?

It *is* a bug. It is nonsense to treat a document with ASCII characters
only

It can't be said to have any ASCII characters at all untill ASCII has
been assued as the encoding.

Of course you would need to guess if you have some bytes above x7F
without any encoding information. But here we have only bytes in the
range x00 to x7F.

So guessing is not a bug, per se, only the actual choice of UTF-7 in
that guessing? That's what you mean?

Which encodings could it guess at without it being a bug? Which should
it never guess? IS UTF-7 the only guess that should never happen?

Leif K-Brooks · Nov 13, 2003

Andreas said:
No, the fallback is, of course, ASCII. Remember, such documents that
*could* be UTF-7 theoretically contain only ASCII characters.

The fallback according to who?

It *is* a bug. It is nonsense to treat a document with ASCII characters
only (i.e. x00 to x7F) as UTF-7 when the encoding UTF-7 is nowhere
specified.

May be nonsense, but it isn't a bug. If you don't specify an encoding, a
browser can do *whatever* it likes. That can mean guessing ASCII, or
guessing the encoding I invented the other day that won't work with any
existing text.

Chinese characters in IE6 now showing correctly	7	Jan 12, 2008
Displaying Chinese chars with ASP.NET 2.0	0	Oct 4, 2006
content-type and unicode	15	Apr 8, 2007
Problem processing Chinese	1	Oct 14, 2005
Repost: displaying non-latin chars	0	Oct 5, 2006
how to display/input/write Chinese Text in java	6	Feb 20, 2008
Help with my responsive home page	2	Dec 14, 2022
Using JS to verify registration info?	1	Mar 19, 2020

accidental chinese char

Wim Roffal

Steve Pugh

Wim Roffal

PeterMcC

Andrew Tang

Steve Pugh

Andrew Tang

Andreas Prilop

Bertilo Wennergren

Andreas Prilop

Bertilo Wennergren

Leif K-Brooks

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads