Content-Language or lang attribute?

M

Michael Wilcox

When sending an HTML 4.01 document, which is preferable for identifying
the primary language of the document body, the Content-Language header
field or the lang attribute for the <html> element?

In my case, I can send the Content-Language with the PHP header()
function, but I cannot do anything more sophisticated.
 
J

Jukka K. Korpela

Michael Wilcox said:
When sending an HTML 4.01 document, which is preferable for
identifying the primary language of the document body, the
Content-Language header field or the lang attribute for the <html>
element?

The lang attribute, because it specifies by definition the language of
the content and is known to be recognized by some user agents, though not
without problems. The Content-Language header specifies - if you read it
carefully - the language(s) that the intended audience is expected to
know. The distinction might sound fairly theoretical, of course. But on
the practical side, is there any user agent that reacts to a
Content-Language header?

Besides, the lang attribute is preserved when the document is saved
locally. The HTTP headers are usually not preserved, regrettably.
 
M

Michael Wilcox

Jukka said:
But on
the practical side, is there any user agent that reacts to a
Content-Language header?

None that I know of, but I don't know much about a browser's inner
workings. Is there a good reason to send the Content-Language header, then?
Besides, the lang attribute is preserved when the document is saved
locally. The HTTP headers are usually not preserved, regrettably.

Would specifying the attribute and the header be overkill?
 
D

David Dorward

Jukka said:
But on the practical side, is there any user agent that reacts to a
Content-Language header?

You're tempting me to fire up vim and write one.

Does anybody need a tool that will go through their bookmarks, make a few
HEAD requests, and then inform them that nobody bothered putting language
information there? :)
 
T

Toby A Inkster

Michael said:
When sending an HTML 4.01 document, which is preferable for identifying
the primary language of the document body, the Content-Language header
field or the lang attribute for the <html> element?

What's wrong with using both?
 
M

Mitja

Michael Wilcox said:
None that I know of, but I don't know much about a browser's inner
workings. Is there a good reason to send the Content-Language header, then?

Would specifying the attribute and the header be overkill?
It'd just mean being sure. No specs say anything against it, AFAIK - why
would they? Although I must admitt I don't really understand why supplying
the content language is so crucial. It is a Thing To Be Done, all right, so
I do it, too, but often ask myself why really.
 
I

Inger Helene Falch-Jacobsen

Michael said:
This is discussed at
http://diveintoaccessibility.org/day_7_identifying_your_language.html

It only mentions the lang attribute, but it points out that
screenreaders and search engines will benefit.

I have been told that the content language should be in the header if
possible.
So now I have added <?php header("Content-Language: en"); ?>.
This is working fine, but -
Google does not obey <HTML lang="en">, which I have used all along on my
pages http://home.no.net/ingernet/
They are Danish and German (shrug) in Google's opinion. Most names and
places are Norwegian, and Danish is very similar to Norwegian, but the
"skeleton" is in English.Why is this happening?
 
J

Jukka K. Korpela

Michael Wilcox said:
http://diveintoaccessibility.org/day_7_identifying_your_language.html

It only mentions the lang attribute, but it points out that
screenreaders and search engines will benefit.

I'm afraid the search engine part is wishful thinking. I have seen no
evidence of search engines actually paying any attention to lang
attributes, or to Content-Language headers. My tests have shown just the
opposite: Google guesses the language from the content, even against an
author-supplied lang attribute.

The screenreader part is real, though of limited applicability. For the
page's language as a whole, most screenreaders that support several
languages probably expect the user to set the language manually after
hearing that the content sounds odd. And I guess IBM Home Page Reader is
still the only speech browser that makes real use of lang attributes,
switching reading methods in the midst of a document automatically when
it encounters lang attributes.
 
B

brucie

in post: <
Jukka K. Korpela said:
Google guesses the language from the content, even against an
author-supplied lang attribute.

and instead of looking at the accept-language header when returning
results they decide what the language should be based on IP and
automatically redirect.

its bloody annoying.
 
J

Jukka K. Korpela

Inger Helene Falch-Jacobsen said:
Google does not obey <HTML lang="en">, which I have used all along on
my pages http://home.no.net/ingernet/
They are Danish and German (shrug) in Google's opinion.

Google classifies the main page as English but the rest as something
else, indeed.
Most names
and places are Norwegian, and Danish is very similar to Norwegian,
but the "skeleton" is in English.Why is this happening?

Actually your markup _should_ contain lang="no" (or maybe an attribute
indicating one of the two forms of Norwegian - they have separate codes)
for the parts that list Norwegian names. They are Norwegian words and
should be marked up accordingly. Using <HTML lang="en"> is OK, but you
should indicate all language changes. It won't matter much in practice,
I'm afraid, but it would matter e.g. to a speech browser that can
pronounce both English and Norwegian.

Google's guesswork is understandable up to a point. Authors seldom give
language information in markup, and if they do, it might be wrong.
The names should be treated as Norwegian in indexing. Google's feature of
limiting searches to pages "in" a particular language can be useful, but
it can also be misleading. What we would really need quite often is an
ability to search for _words_ in a particular language, no matter what
the overall language of the page is.

Your surname list pages are in fact a good example of the different
between lang="..." and Content-Language: ... in practice. Using <HTML
lang="en"> is OK, but you could just as well use <HTML lang="no"> and
declare lang="en" for the short texts in English. In any case, the
appropriate header would be Content-Language: en, since one needs to
understand English to understand page's content, and Norwegian is not
needed for that, even though most words (names) are in Norwegian.
 
J

Jukka K. Korpela

Jukka K. Korpela said:
In any case, the
appropriate header would be Content-Language: en, since one needs to
understand English to understand page's content, and Norwegian is not
needed for that, even though most words (names) are in Norwegian.

Well, that's at least my understanding of RFC 2616, the HTTP/1.1
protocol, see
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.12

On the other hand, the newer RFC 3282, which is flagged as obsoleting RFC
1766 but with no mention of obsoleting any part of RFC 2616, describes
the Content-Language header as specifying the language(s) of the
_document_. This results in conflicts in some cases.

I think RFC 2616, as a more specific protocol, trumps RFC 3282,
especially since RFC 3282 had the opportunity to declare part of RFC 2616
as obsolete but didn't do so.
 
I

Inger Helene Falch-Jacobsen

Jukka said:
Actually your markup _should_ contain lang="no" (or maybe an attribute
indicating one of the two forms of Norwegian - they have separate
codes) for the parts that list Norwegian names. They are Norwegian
words and should be marked up accordingly. Using <HTML lang="en"> is
OK, but you should indicate all language changes. It won't matter
much in practice, I'm afraid, but it would matter e.g. to a speech
browser that can pronounce both English and Norwegian.

It would be too much work to do that with the approx. 8000 Norwegian names!
And what about those who aren't Norwegians?
The dates are in English format, too. I thought I was smart when I
removed "Norway" from the Norwegian place names to save space... Maybe I
have to use that, even though I tell my visitors that "My ancestors are
mainly from ... Norway" on the index page. And of course replace Danmark,
Sverige and Tyskland with Denmark, Sweden and Germany. My priority now is to
optimize for Google.
Google's guesswork is understandable up to a point. Authors seldom
give language information in markup, and if they do, it might be
wrong.
The names should be treated as Norwegian in indexing. Google's
feature of limiting searches to pages "in" a particular language can
be useful, but it can also be misleading. What we would really need
quite often is an ability to search for _words_ in a particular
language, no matter what the overall language of the page is.

But you do have the option to search for pages *located* in a specific
country...
Your surname list pages are in fact a good example of the different
between lang="..." and Content-Language: ... in practice. Using <HTML
lang="en"> is OK, but you could just as well use <HTML lang="no"> and
declare lang="en" for the short texts in English. In any case, the
appropriate header would be Content-Language: en, since one needs to
understand English to understand page's content, and Norwegian is not
needed for that, even though most words (names) are in Norwegian.

So the problem is partly that the English text is too short? I don't want to
put in too much stuff, either.
 
I

Inger Helene Falch-Jacobsen

brucie said:
in post: <

and instead of looking at the accept-language header when returning
results they decide what the language should be based on IP and
automatically redirect.

its bloody annoying.

My pages are on a Norwegian server, or what?
http://home.no.net/ingernet/
If so:
Why Danish and German, and not Norwegian?
 
J

Jukka K. Korpela

Inger Helene Falch-Jacobsen said:
My pages are on a Norwegian server, or what?
http://home.no.net/ingernet/

Probably Google makes no guess based on the domain name other than the
top-level domain (here ".net").

Google plays a guessing game without telling us the rules.
If so:
Why Danish and German, and not Norwegian?

Beats me. Google's "heuristics" (i.e., guessing game) mostly gives the
right guess for "big" languages, but it may have difficulties in
distinguishing between "small" languages that are close to each other. It
might use character frequency tables, common word frequencies, etc.
 
J

Jukka K. Korpela

Inger Helene Falch-Jacobsen said:
It would be too much work to do that with the approx. 8000 Norwegian
names!

Not at all. On page http://home.no.net/ingernet/surnfreq.php for example
you have the list of name frequencies inside a DIV element. It is
And what about those who aren't Norwegians?

To comply with the recommendations, you would need to set the lang
attribute for those <a> elements that contain such names, then.
The language of names is often debatable, though - but not the names are
in English, which is what the markup now says.
The dates are in English format, too.

Then lang="en" is the adequate markup for the part where the dates
appear. On the other hand, on the Web, especially on pages that are more
or less bi- or multilingual, the ISO 8601 notation (e.g., 2004-04-06) is
often superior, since it is unique and understandable irrespectively of
language.
I thought I was smart when I
removed "Norway" from the Norwegian place names to save space...
Maybe I have to use that, even though I tell my visitors that "My
ancestors are mainly from ... Norway" on the index page. And of
course replace Danmark, Sverige and Tyskland with Denmark, Sweden and
Germany. My priority now is to optimize for Google.

Sorry, but that's worse than pointless. Use the language that best serves
the purpose of the pages, instead of trying to tune it to please Google.
Besides, it might well fail - unless you put lots of English on the
pages, creating more distraction among human readers. Anyone who wants to
find genealogy information on the Web and uses a language filter in
Google will miss quite a lot anyway
But you do have the option to search for pages *located* in a
specific country...

No I don't. Google just misleads us into thinking that way. It offers a
language filter and a domain filter, but neither of them is a country
filter. Pick up a .com or .org domain and try to deduce the country from
the domain name, or from the language (as announced or as actually used),
and you get wrong guesses. Does someone think that .to pages are really
located in Tonga, or .tv pages in Tuvalu?
So the problem is partly that the English text is too short?

For the purposes of language markup, the question is irrelevant.

As regards to Google's guessing game, the answer is obviously that to the
extent it uses the actual text content, having a larger proportion of
English text increases the odds of its guessing English. But I would
strongly advice against such games. A little reductio ad absurdum:
To maximize the odds of its guessing English, the entire page should
contain nothing but very simple statements in English, using the most
common words. Such as
"All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
All work and no play makes Jack a dull boy.
..."
This would not do good to the purpose of the page. And the same applies
to less drastic moves in that direction.
 
I

Inger Helene Falch-Jacobsen

Jukka said:
The language of names is often debatable, though

Indeed! I have 1 American Smith, the rest are Norwegians. von Rasbech sounds
German, but Jørgen Nielsen von Rasbech was a Dane. Grüner is probably of
German orgin, but my Grüners are Norwegians. -dotter and -son can be
Norwegian, but more often Swedish. And so on. There are quite a
lot of names that are "international", and thus impossible to say it's this
or that language.
- but not the names
are in English, which is what the markup now says.

I believe that only names that can be translated, can be in a specific
language. That includes country names, and kings' names, like William the
Conqueror (Vilhelm Erobreren in Norwegian).
on the Web, especially on pages that are
more or less bi- or multilingual, the ISO 8601 notation (e.g.,
2004-04-06) is often superior, since it is unique and understandable
irrespectively of language.

I think it is easier to inderstand dates with a month name, but that's just
me!
Use the language that best
serves the purpose of the pages, instead of trying to tune it to
please Google. Besides, it might well fail - unless you put lots of
English on the pages, creating more distraction among human readers.

I should present the complete location, and it's natural to have the English
version of a country's name on an English page.

How about this page, does this look German:
http://www.rideau-info.com/ken/genealogy/idxa.htm
That's what Google thinks!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top