Setting language to UTF-8

T

Terence Parker

I currently have at the beginning of my sites:

<html lang="utf-8">
<head>
<title>Some imaginative title....</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

Scouring the web for several other websites that set the character set, this
is the exact tag used. However, it doesn't work for me. And not just me -
when my code is viewed on any browser by any person the system doesn't use
UTF-8 to render the page... it uses whatever default is on that system.

How do I force a browser to use the correct character set? This seems to
work with other languages... just not Unicode.

Any ideas anyone?

Thanks,
Terence
 
L

Leif K-Brooks

Terence said:
I currently have at the beginning of my sites:

<html lang="utf-8">
<head>
<title>Some imaginative title....</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

The lang attribute is for human languages (english, french, etc.), not
character sets. Set the character set in the HTTP Content-Type header or
your XML decleration.
How do I force a browser to use the correct character set? This seems to
work with other languages... just not Unicode.

You can't force a browser to do anything, period.
 
T

Terence Parker

The <html lang="utf-8"> tag was something I added in more recently in
desperation, seeing as the Content-Type tag didn't work. I do, separately,
have a tag that reads:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

- but that doesn't do anything. Yes, I can't 'force' a browser to do
anything, but assuming that one has their browser configured to
automatically detect the character set it should change to UTF-8 upon seeing
the above tag. But it doesn't.

The text (Chinese) on the page *does* work... i.e. if you manually set the
encoding to UTF-8 on the browser - but it's just not selected automatically,
which the browser does seem to do for other sets (like big5 for example).

What I don't understand is why no browser would set the character set to
UTF-8 when viewing my pages.

Terence
 
T

Toby A Inkster

Terence said:
The <html lang="utf-8"> tag was something I added in more recently in
desperation,

Well, get rid of it quick! As Leif said, lang is for human languages, eg
"en-GB" (English), "fr" (French) or "de" (German).
seeing as the Content-Type tag didn't work.

Then set the Content-Type HTTP header.
 
J

Jukka K. Korpela

Terence Parker said:
The <html lang="utf-8"> tag was something I added in more recently in
desperation, seeing as the Content-Type tag didn't work.

It's generally not productive to throw tag sallad around just because you
don't know what's going on. You will just make things worse. And why did you
write the Subject line the way you did? It does not describe the problem at
all, just a misguided attempt to solve an unspecified problem. More hints on
how to post constructively:
http://www.cs.tut.fi/~jkorpela/usenet/dont.html
I do,
separately, have a tag that reads:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

- but that doesn't do anything.

Why don't you post the URL instead of code snippets?

Believe me, the URL _is_ relevant. You may not realize this yet, but that
would indicate that you don't understand the basics of this "charset" thing.
Yes, I can't 'force' a browser to do
anything, but assuming that one has their browser configured to
automatically detect the character set it should change to UTF-8 upon
seeing the above tag.

No, a browser _must not_ do that _except_ when the HTTP headers do not
indicate the encoding. And the headers _should_ indicate the encoding.
The text (Chinese) on the page *does* work... i.e. if you manually set
the encoding to UTF-8 on the browser - but it's just not selected
automatically, which the browser does seem to do for other sets (like
big5 for example).

Well, _if_ the HTTP headers fail to indicate the encoding, then browsers
_should_ use the plastic imitation, the <meta> tag. Now there's the
possibility that your actual document contains a typo in that tag (sorry,
the crystal ball is dim). Or there might something odd in the situation, but
we really need the URL for a starter. In future, please save a few rounds of
iteration and post the URL in the original question.
 
T

Terence Parker

Not to be ungrateful, but I think the productivity of this original post
hasn't been what I originally planned.
It's generally not productive to throw tag sallad around just because you
don't know what's going on. You will just make things worse. And why did you
write the Subject line the way you did? It does not describe the problem at
all, just a misguided attempt to solve an unspecified problem. More hints on
how to post constructively:
http://www.cs.tut.fi/~jkorpela/usenet/dont.html

When people ask questions on forums before they are willing to try out
various things and try the obvious, they are flamed for not taking the time
to attempt to solve it themselves. And when I experiment by trying out
another tag which I feel might solve the problem? The result is that people
still complain. It seems that nobody can be pleased these days.

My use of the lang clause in the HTML tag did not casue the problem nor did
in interfere with the problem - and since it did not fix the problem either,
it has since been removed from my code. I really don't see what a big deal
it is... it's not going to cause the self-descruction of my website or
anything.

And regarding the 'unspecified problem' - I feel it was quite well specified
to begin with. The offending tag was included in the code, with the exact
problem described in my original posting : browsers aren't using UTF-8 to
display the page despite the Content-Type setting telling it to do so. What
more does one need to know? While i'm at it there are several PHP related
questions I can ask, but they would probably be of no interest to anyone in
this forum.

And no doubt I would only get bashed for asking something off-topic as a
result then anyway.
Why don't you post the URL instead of code snippets?
Believe me, the URL _is_ relevant. You may not realize this yet, but that
would indicate that you don't understand the basics of this "charset"
thing.

The reason I don't post a URL is because i'm working on something that
interfaces a database and as the code was in its early stages at the time
the site was not something I wanted to be public accessible. Even so, I do
not see how any of the other tags on the page would be of any use /
relavence to anyone. I have, actually, gone through the trouble of looking
at other sites first before posting my original message - just in case
anyone thinks I posted it straight away and was so lazy I couldn't even be
bothered to do research. All websites looked at (including google groups -
which uses utf-8) use the exact same tag I used, and nothing else anywhere
in the page. Do you mean to tell me that by some remarkably strange
coincidence all my pages happen to require an extra tag which nobody else's
does? I hardly doubt it.

So, at the moment, I really don't see how the URL is relavent, beyond being
able to view the source and see the same tag which I have included in my
original posting anyway. However, for the benefit of everyone's curiosity,
the pages can be found at http://intranet.shatincollege.edu.hk/prm/index.php
No, a browser _must not_ do that _except_ when the HTTP headers do not
indicate the encoding. And the headers _should_ indicate the encoding.

Thank-you... well... at least I learned something in that paragraph.

Though okay, i'm still puzzled as to why the headers i'm using failed to
indicate the encoding.
Well, _if_ the HTTP headers fail to indicate the encoding, then browsers
_should_ use the plastic imitation, the <meta> tag. Now there's the
possibility that your actual document contains a typo in that tag (sorry,
the crystal ball is dim). Or there might something odd in the situation, but
we really need the URL for a starter. In future, please save a few rounds of
iteration and post the URL in the original question.

I believe I went through the reasons why I didn't post the URL already but,
again, I have included it above if it really is going to be of help. Unless
I am totally blind (which I wouldn't entirely rule out) I cannot see a typo
in the META tag used as compared to META tags used in most other websites.

Terence
 
T

Terence Parker

.... and please excuse the typos. It's past midnight and way past my bedtime!

Terence
 
S

Steve Pugh

Terence Parker said:
So, at the moment, I really don't see how the URL is relavent, beyond being
able to view the source and see the same tag which I have included in my
original posting anyway. However, for the benefit of everyone's curiosity,
the pages can be found at http://intranet.shatincollege.edu.hk/prm/index.php

Good now we can see what the HTTP headers are, and so can you.

http://www.delorie.com/web/headers.cgi?url=http://intranet.shatincollege.edu.hk/prm/index.php
Thank-you... well... at least I learned something in that paragraph.

Now all you need to do is apply it.
Though okay, i'm still puzzled as to why the headers i'm using failed to
indicate the encoding.

They headers you use say that the page is ISO-8859-1. So that's the
encoding that browsers use.
Unless I am totally blind (which I wouldn't entirely rule out) I cannot see a typo
in the META tag used as compared to META tags used in most other websites.

Remember that paragraph that you learned something from?
Your meta tag is 100% irrelevant.
Browsers MUST ignore it as there is a character set specified in the
HTTP header.

The solution to your problem is to change your HTTP header.

Steve
 
D

DU

Terence Parker wrote:

http://intranet.shatincollege.edu.hk/prm/index.php

Terence, I don't understand why you are not setting the charset via http
headers to "en" instead of utf-8 and then setting lang attribute of the
single line of chinese text to BIG5. Many so far told you that lang
attribute only takes human languages as defined by iso-639 norm; yet,
you keep using utf-8.

http://lcweb.loc.gov/standards/iso639-2/langcodes.html

Only 1 sentence is in Chinese in your file. So there is really no need
to set the whole document charset to utf-8 in the first place.
Finally, for the sake of web interoperability across multiple charset
and language, I really think you should write an entirely validated html
file. As written, your file is not valid and resort to a bad design
technique (tables) and several deprecated html elements (center, font).

My 2 cents

DU
 
J

Jukka K. Korpela

DU said:
Terence Parker wrote:

http://intranet.shatincollege.edu.hk/prm/index.php

Terence, I don't understand why you are not setting the charset via http
headers to "en" instead of utf-8 and then setting lang attribute of the
single line of chinese text to BIG5.
[ corrected to lang="zh" in a later posting - I wonder why you didn't
supersede ]

Sorry, but this is astonishingly strange. We've discussed the issue at
length, and the page _still_ contains the nonsensical lang="utf-8", and now
you are adding to the confusion that charset be set to "en", which would be
just as nonsensical but with more serious consequences.
Many so far told you that lang
attribute only takes human languages as defined by iso-639 norm;

Well, basically so, although in principle you can use x-whatever-you-like
too, it just won't (normally) be useful at all.
yet, you keep using utf-8.

That's strange indeed, but hardly causes much damage. Setting charset to
"en" in HTTP would mean setting it to undefined value and letting browser
play its guessing game.

What the page (the server) does _right_ is the HTTP header that specifies
UTF-8. This is absolutely the right thing, when the data is UTF-8 encoded,
no matter what language (if any) the content is.

And setting lang="zh" would have nothing to do with the character encoding
issue. It would be adequate in principle, even a priority 1 requirement in
WAI guidelines, to declare the language of a fragment that way. But that
does _not_ affect the encoding issues.
Only 1 sentence is in Chinese in your file. So there is really no need
to set the whole document charset to utf-8 in the first place.

Yes there is. The encoding is a property of a document, and it's always the
same for the entire document. You cannot switch the encoding. (It is true
that there is an encoding that permits certain switching _inside_ it, but
that's encoding-level issue, not very useful, not much used, and has nothing
to do with HTML markup.)

What _could_ be done is writing the document in, say, US-Ascii encoding,
using character references () for anything outside US-Ascii.
But there's no special reason to do that, especially after the page has been
written in UTF-8.

And since the page contains a form, there can be a special reason to use
UTF-8. Browsers normally send form data in the encoding of the page
character containing the form. In fact, this is the only way in practice to
set the character encoding of the form data. In this case, the data is
probably all Ascii, so this doesn't matter, but it's important on pages
containing e.g. search forms.
Finally, for the sake of web interoperability across multiple charset
and language, I really think you should write an entirely validated html
file. As written, your file is not valid and resort to a bad design
technique (tables) and several deprecated html elements (center, font).

It's tag soup, alright. But this doesn't really affect the encoding issues.
 
D

DU

Jukka said:
Terence Parker wrote:

http://intranet.shatincollege.edu.hk/prm/index.php

Terence, I don't understand why you are not setting the charset via http
headers to "en" instead of utf-8 and then setting lang attribute of the
single line of chinese text to BIG5.

[ corrected to lang="zh" in a later posting - I wonder why you didn't
supersede ]

Sorry, but this is astonishingly strange. We've discussed the issue at
length, and the page _still_ contains the nonsensical lang="utf-8", and now
you are adding to the confusion that charset be set to "en", which would be
just as nonsensical but with more serious consequences.

DOH!! I got confused, mixed up myself!
Well, basically so, although in principle you can use x-whatever-you-like
too, it just won't (normally) be useful at all.




That's strange indeed, but hardly causes much damage. Setting charset to
"en" in HTTP would mean setting it to undefined value and letting browser
play its guessing game.

Sorry. Meant to say charset=iso-8859-1
and then only use character entities for the single line of Chinese.
That is what I would have tried.
What the page (the server) does _right_ is the HTTP header that specifies
UTF-8. This is absolutely the right thing, when the data is UTF-8 encoded,
no matter what language (if any) the content is.

And setting lang="zh" would have nothing to do with the character encoding
issue. It would be adequate in principle, even a priority 1 requirement in
WAI guidelines, to declare the language of a fragment that way. But that
does _not_ affect the encoding issues.




Yes there is. The encoding is a property of a document, and it's always the
same for the entire document. You cannot switch the encoding. (It is true
that there is an encoding that permits certain switching _inside_ it, but
that's encoding-level issue, not very useful, not much used, and has nothing
to do with HTML markup.)

What _could_ be done is writing the document in, say, US-Ascii encoding,
using character references () for anything outside US-Ascii.
But there's no special reason to do that, especially after the page has been
written in UTF-8.

And since the page contains a form, there can be a special reason to use
UTF-8. Browsers normally send form data in the encoding of the page
character containing the form. In fact, this is the only way in practice to
set the character encoding of the form data. In this case, the data is
probably all Ascii, so this doesn't matter, but it's important on pages
containing e.g. search forms.




It's tag soup, alright. But this doesn't really affect the encoding issues.


Sorry I got mixed up! What I meant to say is this:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

<html lang="en">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Language" content="en">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="Content-Script-Type" content="text/javascript">

<title>Sha Tin College PRM:: Main Menu</title>
</head>

<body>

(...)

<b><span lang="zh">沙田學院 :: 家長通訊系統</span></b><br>

and sorry I couldn't reply to your post earlier. My apologies!

DU
 
D

DU

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">

<html lang="en">

<head>

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta http-equiv="Content-Language" content="en">
<meta http-equiv="Content-Style-Type" content="text/css">
<meta http-equiv="Content-Script-Type" content="text/javascript">

<title>Sha Tin College PRM:: Main Menu</title>
</head>

<body>

(...)

<b><span lang="zh">沙田學院 :: 家長通訊系統</span></b><br>

Argh... I fumbled again !! Gulp!

<b><span lang="zh">沙田學院 ::
家長通訊系統</span></b><br>

:)

DU
 
T

Toby A Inkster

DU said:
<b><span lang="zh">沙田學院 :: 家長通訊系統</span></b><br>

I am very pleased with my newsreader being able to display this correctly!
And my browser too.
 
T

Terence Parker

This thread is quite old now but I forgot to check back on it -
actually I went on a trip abroad.

I admit defeat in that the header is defined by the webserver - when
it was first mentioned, the word 'webserver' wasn't used and 'header'
to me simply meant the header tags of the HTML - not the HTML header
sent out by Apache. I have now fixed the problem by, as someone
suggested, setting the header explicitly in my Apache configuration.
However - a few responses to the replies I received.

1. lang="utf-8" - yes, alright, I set it wrong ... but as someone did
mention that shouldn't cause any problems. And indeed it didn't. It
may confuse the browser, but it won't screw anything up. Anyways that
has now been corrected.

2. Of course I have to define the language set even if just for one
sentence. If I don't define the character set, even that one sentence
won't be visible. Then... what's the point?

3. My problem is because I have upgraded to Apache2 from Apache 1.3.x
- and I notice that Apache now explicitly sends off language
information to the browser. I actually find this annoying. Why?
Because for some sites I host there are multiple languages between the
pages - and if Apache were to force the browser to be one language,
then it effectively can't serve different pages of differing languages
under the same virtual host (no, i'm not talking about multiple
languages within one page here - I know you can't do that). Ideally I
want this shut off completely and for my HTML pages to resume the job
of defining the charset. I don't want Apache doing it for me.

And why don't I use UTF-8 for everything? Because, while that is the
ideal for compatibility between languages, fact of the matter is UTF-8
has entered the world too late. Languages such as BIG5 / GB have
become so dominant in Asia that these are native to most software, NOT
UTF. And that goes for websites in this part of the world too.

Anyways - thanks to all that replied. At least my problem is partially
solved now.

Terence
 
J

Jukka K. Korpela

1. lang="utf-8" - yes, alright, I set it wrong ...

Honestly, I think a period would be the right punctuation here, not
ellipsis (three dots).
2. Of course I have to define the language set even if just for one
sentence. If I don't define the character set, even that one
sentence won't be visible. Then... what's the point?

Presumable "language" means "character" here. Otherwise the statement
does not make sense. And you should _always_ make sure your server
sends character encoding information (charset parameter), though the
need becomes really apparent if you use an encoding other than
iso-8859-1 or relatives.
3. My problem is because I have upgraded to Apache2 from Apache
1.3.x - and I notice that Apache now explicitly sends off language
information to the browser.

Which language information? I think you are confusing language with
character encoding, again. This is actually _very_ common, but that
doesn't make the confusion any less problematic.

I don't see any _language_ headers (Content-Language) if I access e.g.
http://parker.com.hk (which resides on an Apache 2 server). Just quite
normal and common HTTP headers.
I actually find this annoying. Why?

A good question. You shouldn't be annoyed, if it's really the charset
you mean. It should always be included. If your problem is that the
server does not send the _correct_ parameter value, then this needs to
be fixed, in a server-dependent manner, which is probably rather easy
as soon as you have the correct documentation and have a picture
(figuratively speaking) of your web site structure. You cannot override
the HTTP charset parameter in any HTML tag, since the former by
definition has preference.
Because for some sites I host there are multiple languages between
the pages

Again, languages are not the issue; character encodings are, though
naturally the language has an impact on the repertoire of feasible
encodings. If you have pages with different encodings, then the
simplest way, on Apache, is to put files in one encoding into one
directory and create a .htaccess file into that directory, with a
suitable directive to Apache in it, e.g.
AddType text/html;charset=utf-8 HTML
Ideally I want this shut off completely and for my
HTML pages to resume the job of defining the charset.

Whether you can do that depends on Apache 2. Have you checked its
documentation? I would guess that using an AddType without a charset
parameter would do it. But that's really _not_ the WWW way. The WWW way
is to specify the encoding in actual HTTP headers, and <meta> tags are
just surrogates that some people need to resort to (and that _might_ be
including for certain reasons even when you have made the server send
adequate headers).
And why don't I use UTF-8 for everything? Because, while that is
the ideal for compatibility between languages, fact of the matter
is UTF-8 has entered the world too late.

Or too early. But it is true that UTF-8 is _inefficient_ for most East
Asian languages.
Languages such as BIG5 /
GB have become so dominant in Asia that these are native to most
software, NOT UTF.

Again, encodings, not languages. And the software needs to grow up.
UTF-8 is the way the WWW and the Internet are going, in the sense that
support to UTF-8 is the primary goal (according to official IEFT
policy) - any new protocols and software _should_ support it and
_may_ support other encodings.

Support to BIG5 and GB is probably so widespread in situations where
Chinese can be read in the first place that it's probably practical to
encode your documents in Chinese using either of them, so I'm not
arguing against the point that there are good reasons to use different
encodings for pages on a server.
 
A

Andreas Prilop

Jukka K. Korpela said:
But it is true that UTF-8 is _inefficient_ for most East
Asian languages.

No, why? Because you need three bytes instead of two bytes for one
character? Nested tables, e.g., are a heavier crime than big files.
And a simple image will usually be bigger than your HTML text.
Let's not forget that some editors blow up your source by inserting
90 % space characters.
All this is, IMHO, more severe than using three bytes for one
Chinese character.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top