Setting dynamically the Greek charset in Firefox ?

J

Julien

Hi,

Trying to set dynamically the charset in an inline framed document, I
observed that the following script works in MSIE but not in Firefox. I
thought that document.charset was more universal...
Any idea what I can do?

<head> /* This is in the inline framed document */
(skip)
<script language="JavaScript" type="text/javascript"><!--
if (document.frames.parent.lang=='el') /* if parent document is
in Greek language */
{document.charset='iso-8859-7';} /* Has no effect in Firefox
2.0 */
else
{charset='iso-8859-1';} /* Use latin-1 character set if
content not in Greek */
</script>
</head>

Thanks.

N.B.: I need to set the charset dynamically. I cannot use a <meta>
markup.
 
J

Julien

Julien said the following on 11/15/2007 4:20 AM:



Yes. Set it on the server.

Thanks. No other way?
The problem is that there are 4 versions of the parent document
(because 4 languages), but only one inner frame for the 4 languages,
which is a Javascript powered multilingual wizard. This is for making
the maintenance easier.

If possible, it would be nice that when the user switches from one
language to another the inner frame remains the same document and is
not loaded again.

An idea could be to just refresh the display (without reloading the
document) once the charset is set. But I cannot find how? Maybe what I
ask is impossible to do with Firefox, but I would rather be sure as
all works perfectly with MSIE.
 
G

Gregor Kofler

Julien meinte:
N.B.: I need to set the charset dynamically. I cannot use a <meta>
markup.

And why can't you use UTF-8 and forget about all this encoding-switching?

Gregor
 
J

Julien

Julien meinte:


And why can't you use UTF-8 and forget about all this encoding-switching?

Gregor

I could but would prefer avoid it for 3 reasons:
1) On my computer, my favorite code editor (Crimson) does not save
UTF-8
(although it tells it can do).
2) UTF-8 would increase the size of the multilingual Javascript wizard
which already is quite large.
I don't want to increase download times.
3) Greek users often have their browser's encoding set to
'windows-1253' or 'iso-8859-1'. Depending on their browser settings,
the text might display bad. Many people don't know how to switch from
one encoding to another.

Currently, the Greek "parent" page is UTF-8 encoded because it has mix-
language content. I'll maybe pass it one day to 'iso-8859-1'. Would
this solve the problem? Does inline frames inherit the language from
their parent frame in Firefox?

For those interested and having Microsoft Internet Explorer, the Greek
page is here:
http://www.altipoint.com/cgiContouring/docContouring.el.htm
With the small flags, one can switch from one language to another.
The wizard (i.e. the inner frame making DHTML with Javascript) is the
same whichever the language.
 
T

Thomas 'PointedEars' Lahn

Julien said:
Julien meinte:
N.B.: I need to set the charset dynamically. I cannot use a <meta>
markup.
And why can't you use UTF-8 and forget about all this encoding-switching?
[...]

Please don't quote signatures and other stuff you are not referring to.
I could but would prefer avoid it for 3 reasons:
1) On my computer, my favorite code editor (Crimson) does not save
UTF-8
(although it tells it can do).

Use another editor. Even Wordpad is able to save in a UTF.
2) UTF-8 would increase the size of the multilingual Javascript wizard
which already is quite large.

How did you get that idea?

http://www.unicode.org/faq/
I don't want to increase download times.

It is unlikely that you will.
3) Greek users often have their browser's encoding set to
'windows-1253' or 'iso-8859-1'. Depending on their browser settings,
the text might display bad. Many people don't know how to switch from
one encoding to another.

Declare the correct encoding, and Greek users' preferences do not matter
anymore. Even in IE.
Currently, the Greek "parent" page is UTF-8 encoded because it has mix-
language content. I'll maybe pass it one day to 'iso-8859-1'. Would
this solve the problem?

Wake up, even NNTP is UTF-8 safe by nowadays.
Does inline frames inherit the language from their parent frame in Firefox?

A language is not an encoding. And no, every document resource has its own
encoding.


PointedEars
 
G

Gregor Kofler

Julien meinte:
I could but would prefer avoid it for 3 reasons:
1) On my computer, my favorite code editor (Crimson) does not save
UTF-8

Use Notepad++ instead. I've replaced Crimson some time ago with it;
pretty good.
2) UTF-8 would increase the size of the multilingual Javascript wizard
which already is quite large.

Why? Now you have to maintain two or several different encodings. Sounds
much "larger" to me.
3) Greek users often have their browser's encoding set to
'windows-1253' or 'iso-8859-1'. Depending on their browser settings,
the text might display bad. Many people don't know how to switch from
one encoding to another.

The header of your document (I'm not refering to the head section of
your html document) can be set to whatever you want. No user needs to
adjust anything then.
Currently, the Greek "parent" page is UTF-8 encoded because it has mix-
language content. I'll maybe pass it one day to 'iso-8859-1'.

Given my suggestion and remarks: Why would you want to do that?

For those interested and having Microsoft Internet Explorer, the Greek
page is here:
http://www.altipoint.com/cgiContouring/docContouring.el.htm

An I thought we're well past the "browser-specific-websites". Perhaps
not in Greece...

Gregor
 
J

Julien

And why can't you use UTF-8 and forget about all this encoding-switching?
Use another editor. Even Wordpad is able to save in a UTF.

As far as I know Wordpad doesn't make colored syntax, nor Notepad.
They also cannot edit in column mode.
How did you get that idea?
http://www.unicode.org/faq/

I don't have time to read the whole faq but I thought that with utf-8
latin characters would be encoded with one byte and Greek characters
with 2 bytes. Isn't it?
Declare the correct encoding, and Greek users' preferences do not matter
anymore. Even in IE.
Thanks.


Wake up, even NNTP is UTF-8 safe by nowadays.

I know that UTF-8 is safe nowadays, excepted by spiders. Google
displays bad utf-8 pages.
Also, most Greek crawlers looking for Greek content do not accept
UTF-8.
I was telling about the below question.
A language is not an encoding. And no, every document resource has its own
encoding.
Thanks. I was thinking to encoding, not language.

If I could specifify the encoding of the linked ressource, it could
solve the problem.
Something like this
<iframe src="blabla.el.htm" charset="iso-8859-7">
But <iframe> doesn't have this property.
 
T

Thomas 'PointedEars' Lahn

Except the attribution lines, of course.
As far as I know Wordpad doesn't make colored syntax, nor Notepad.
They also cannot edit in column mode.

Use vim(1); or Eclipse 3.3+, WST, and JSEclipse; both can do UTF-8 and SHL.

What do you mean by "edit in column mode"?
I don't have time to read the whole faq but I thought that with utf-8
latin characters would be encoded with one byte and Greek characters
with 2 bytes. Isn't it?

Not quite correct. With UTF-8, characters within the ASCII range will be
encoded with one byte/code unit, and characters outside of that range with
two or more bytes/code units.

http://people.w3.org/rishida/scripts/uniview/conversion

However, you do not need to put the Greek locale into the same resource as
the script that accesses it. In fact, I advise you not to, so as to allow
for uncomplicated extension with another language. That is, unless you use
identifiers outside the ASCII range for your identifiers (which ECMAScript 3
allows but is not universally supported anyway), there is no significant
increase to be expected.
Thanks. I was thinking to encoding, not language.

If I could specifify the encoding of the linked ressource, it could
solve the problem.
Something like this
<iframe src="blabla.el.htm" charset="iso-8859-7">
But <iframe> doesn't have this property.

However, your server can declare the proper encoding. For example with
Apache in httpd.conf or .htaccess:

AddLanguage el .el
AddCharset ISO-8859-7 .el

http://httpd.apache.org/docs/2.2/mod/mod_mime.html#addcharset

I use server-side PHP for most cases, though:

<?php header('Content-Type: text/html; charset=iso-8859-7'); ?>


HTH

PointedEars
 
V

VK

Trying to set dynamically the charset in an inline framed document, I
observed that the following script works in MSIE but not in Firefox. I
thought that document.charset was more universal...

IE has scriptable document.charset property which is officially read/
write - never used it in my life so cannot comment on its behavior.

Gecko-based browsers do have read-only document.characterSet property.
It returns the currently effective encoding, that may differ from the
specified by server/META (if overridden by View - Charachter
Encoding). From MDC: "The related, nonstandard method document.charset
and the property document.defaultCharset are not supported by Gecko."
Any idea what I can do?

If it is a question of a JavaScript-driven wizard, not of the page
itself, then you could specify encoding for the external script
itself:

<script type="text/javascript" charset="iso-8859-7"
src="MyScript.js"></script>

Otherwise the only option I see - if keeping the current approach - is
by using XSLT transformers so applying relevant XSL on the same XML
data.
 
T

Thomas 'PointedEars' Lahn

VK said:
If it is a question of a JavaScript-driven wizard, not of the page
itself, then you could specify encoding for the external script
itself:

<script type="text/javascript" charset="iso-8859-7"
src="MyScript.js"></script>

Otherwise the only option I see - if keeping the current approach - is
by using XSLT transformers so applying relevant XSL on the same XML
data.

Unsurprisingly, both are probably one of the worst possible solutions to
this problem.


PointedEars
 
B

Bart Van der Donck

Thomas said:
Use another editor. Even Wordpad is able to save in a UTF.

You can't use Wordpad for programming javascript - it saves documents
in a binary encoding. Notepad should be okay, as XP (maybe earlier)
allows Notepad to save in Unicode, Big Endian and UTF-8.
Javascript wizard which already is quite large.

How did you get that idea?
http://www.unicode.org/faq/

He's right; this is even considered to be one of the most serious
drawbacks of Unicode. Suppose the following text in Greek (*):

&gamma;&epsilon;&lambda;

With HTTP-header and <meta> set to ISO/IEC 8859-7, you send 3 bytes of
traffic in a 8-bit range; so total 24 bits. But under UTF-8 you need 2
bytes to represent a Greek character. In this example, you need to
send (**):

γεÎ>>

So, consuming twice as much traffic (3*2*8 bits). Suppose you would
need 3 bytes (or exceptionally 4), the traffic will be correspondingly
multiplied by three or four.
It is unlikely that you will.

He will. By definition.

I think you mean 'iso-8859-7'.
Declare the correct encoding, and Greek users' preferences
do not matter anymore. Even in IE.

UTF-8 is probably the best choice for a webpage nowadays, but ISO/IEC
8859-7 would be okay too.
Wake up, even NNTP is UTF-8 safe by nowadays.

No it isn't.
http://groups.google.com/group/comp.lang.perl.misc/msg/7d2619d69e59fb40
A language is not an encoding. And no, every document
resource has its own encoding.

That is correct.

(*) HTML character entities for maximum Usenet-compatibility.
(**) These characters can be written on usenet as such (ISO/IEC
8859-1).
 
J

Julien

Thanks to all.

As often with computer stuff, I consider that there is not one and
unique good solution. There are several possibilities and the best
depends on what was already done, the size of files, a.s.o.
Often, it comes to me to think that I would have done things
differently if I had knew something.
"Experience is the sum of all errors that a man makes during his
life." (I can unfortunately not remember the famous person wo told
this.)

Thank you Mr Van der Donk for the details about Unicode. I knew that
Greek characters required more than 1 byte in UTF-8 and assumed they
required 2 (not 4) as they are quite commonly used but could not give
the details.
If I remember, Unicode is a set of character maps (tables) and UTF-8
works like it was using shortcuts to Unicode character maps. The the
first 127 chars are the same as in ANSI and are then coded on one
byte. For other characters like Greek ones, at least one other byte is
required to tell which Unicode character map to use. Then, UTF-8 and
Unicode are not exactly the same. UTF-8 is more compact as it uses 1
byte every time it can.

Thank you Mr Kofler for mentioning Notepad++.
Concerning the encodings, I prefer having all my 4 translations in one
file with switch{} instructions to serve the right language.
I know that this makes the file more than 4 times bigger than a
monolingual one.
But this lets me more easy change the wizard, without having to open 4
files with the risk to forget to change something in one of the files.
I just open the wizard script, copy some switch and change the
sentences simultaneously in the 4 languages.
This makes the translation process much more easy as I can see the
translations just above and below each others.
I also avoid mistakes linking the files, as always the same file is
linked whichever the language.
I won't change the wizard to Unicode, as it is already saved in
ANSI. Greek texts display improperly and would still display bad if
copy-pasted in a UTF-8 editor. To edit the file, I use Notepad as
external editor to write the sentences in Greek and then copy-paste
them in the ANSI-editor.
All works already fine. I just have to change the encoding for
Firefox.

Unless a better solution is found, I'll make 4 copies of the small
HTML document that calls the multilingual JavaScript wizard ...
docDisplayArea.fr.htm
docDisplayArea.en.htm
docDisplayArea.de.htm
docDisplayArea.el.htm
(or just 2 : one for Greek and one for other languages)

....set different <meta> tags for these documents...
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />

....and call the same multilingual script.

I "played" quite a lot with my code. I tried to put an id to the meta
and to set its content or just the charset later, but it was
impossible.

I also observed that it is possible to specify the charset a <script>
element. So I tried this :
<script><!--
function GiveMeTheCharset() {
if (window.parent.lang=='el')
return 'iso-8859-7';
else
return 'iso-8859-1';
}
--></script>
<script type="text/javascript" charset="JavaScript:GiveMeTheCharset()"
src="wizard.js" />

.... but this did not work. Setting the charset statically works:
<script type="text/javascript" charset="iso-8859-7" src="wizard.js" />
Use vim(1); or Eclipse 3.3+, WST, and JSEclipse; both can do UTF-8 and SHL.

I heard that Eclipse is a quite large software. I prefer smaller
editors. I know VIM but prefer Crimson.
What do you mean by "edit in column mode"?

A software that works as if text was stored in a matrix. It is
possible to highlight a rectangle of text, for example from rows 31 to
35 and colums 5 to 8 and replace text in this block only when typing.
The edition is simultaneous on all rows. Crimson Editor (free) does
this. UltraEdit also can.
UltraEdit is also a very good text editor, but I really dislike the
fact that it wants to be the default editor for many file types. For
instance, after installation, it opens instead of Notepad when doing
"View source" in MSIE.

One last thing for Mr. Kofler who wrote
"An I thought we're well past the "browser-specific-websites".
Perhaps
not in Greece... "

My answer is:
1) I am not in Greece, not Greek and my site is not a Greek website
but a multilingual one.
2) I wouldn't have asked my question if I wanted to keep my site
browser-specific. It already works fine with MSIE. As you could see,
my question concerned how to make it Firefox-compatible.
3) By extension, we could say: "we're past the English-specific-
websites" or "we're past the German-specific-websites", ...
4) What's better: a web site that works with a browser and later with
others OR no web site?

Please, don't take it bad.

Julien
 
T

Thomas 'PointedEars' Lahn

Julien said:
If I remember, Unicode is a set of character maps (tables)

You remember incorrectly. Unicode (currently version 4.0) is *one*
character set (a code-point-to-glyph mapping) that is composed of subsets
for different scripts (as in writing, not as in programming) and kinds of
punctuation and symbols. It is only that those subsets can be viewed online
as separate tables (in PDF documents).
and UTF-8 works like it was using shortcuts to Unicode character maps.

UTF-8 is the Unicode Transformation Format that uses code units of 8 bit
length. A sequence of UTF-8 code units designates the code point for a
character or glyph in the Unicode character set. Details can be found in
the Unicode FAQ which you really should read (at least the Basic section)
*before* you continue.
The the first 127 chars are the same as in ANSI and are then coded on one
byte.

ANSI is no character set or encoding, it is a standards-overseeing (but not
making) organization (American National Standards Institute). It is a
common misconception (supported by continued Microsoft's mislabelling) that
Windows-1252 and the like would have been standardized by ANSI, that there
was something like an "ANSI code page". AFAIK, the only real "ANSI
standard" that made it into computing are ANSI escape codes which are e.g.
used to format text in text terminals.

http://en.wikipedia.org/wiki/Windows-1252
http://en.wikipedia.org/wiki/ANSI_escape_code

As I have explained already, characters with code points within the range of
the ASCII (American Standard Code for Information Interchange) character set
require only one UTF-8 code unit; those are the characters at the code
points from 0x00 to 0x7F (decimal 127), only insofar you are correct.
or other characters like Greek ones, at least one other byte is
required to tell which Unicode character map to use.

The other byte is required to encode the higher code point. I repeat, that
is _not_ another character map.
Then, UTF-8 and Unicode are not exactly the same.

Of course they are not. The former is a transformation encoding, and the
other one is the character set that provides the corresponding code points
and glyphs.
UTF-8 is more compact as it uses 1 byte every time it can.
Wrong.

Concerning the encodings, I prefer having all my 4 translations in one
file with switch{} instructions to serve the right language.
I know that this makes the file more than 4 times bigger than a
monolingual one.
But this lets me more easy change the wizard, without having to open 4
files with the risk to forget to change something in one of the files.
I just open the wizard script, copy some switch and change the
sentences simultaneously in the 4 languages.

Your reasoning still is not sound. Not considering the counter-arguments
already presented, here is another one: What if there is no script support?
This makes the translation process much more easy as I can see the
translations just above and below each others.
I also avoid mistakes linking the files, as always the same file is
linked whichever the language.
I won't change the wizard to Unicode, as it is already saved in
ANSI.

You have yet to grasp the difference between character encoding and
character set.
...set different <meta> tags for these documents...
<meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />

This looks like XHTML, however IE still does not support XHTML, and the
proper declaration of the encoding looks different in XML documents:

<?xml version="1.0" encoding="iso-8859-1"?>

That has to come before any other markup in the message body but only if the
server not already provides the encoding in the Content-Type header. By
default, XML documents use UTF-8 as document encoding.

However, I think you do not want XHTML:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Still, this declaration is merely defined to be a *fallback* when the
resource is not served via HTTP (hence "http-equiv" -- HTTP equivalent).
User agents are expected to ignore it in all other cases, so that hardly
serves your purpose.
...and call the same multilingual script.

I "played" quite a lot with my code. I tried to put an id to the meta
and to set its content or just the charset later, but it was
impossible.

I also observed that it is possible to specify the charset a <script>
element. So I tried this :
<script>
http://validator.w3.org/

<!--

Unnecessary in HTML. Error-prone always, especially in XHTML where it
indeed comments the script out for the markup parser.
function GiveMeTheCharset() {

It is not a constructor, so the identifier should not start with an
uppercase letter.
if (window.parent.lang=='el')
return 'iso-8859-7';
else
return 'iso-8859-1';
}

See below.

This is a syntax error, remove it.
</script>
<script type="text/javascript" charset="JavaScript:GiveMeTheCharset()"

The value of the `charset' attribute here constitutes plain fantasy syntax
instead of informed programming. IOW: That attribute value is _not supposed
to be executed_ by any client-side script engine at all.
src="wizard.js" />

IE's tagsoup parser will choke on that because you have not closed the
`script' element in its eyes. As I have said, IE does _not_ support XHTML.
... but this did not work.

Of course it did not.
Setting the charset statically works:
<script type="text/javascript" charset="iso-8859-7" src="wizard.js" />

See above.
Thomas wrote:

Thomas who?
I heard that Eclipse is a quite large software.

The download is quite large and so is the memory footprint. However, AIUI
Eclipse's advantage over other editors is not being small; it is being Open
Source, cross-platform (because it is Java-based), very extensible, and most
production-quality plugins are for free. And it is not only an editor but
an IDE for all kinds of code (with the Java IDE built in the Platform SDK).
A software that works as if text was stored in a matrix. It is
possible to highlight a rectangle of text, for example from rows 31 to
35 and colums 5 to 8 and replace text in this block only when typing.
The edition is simultaneous on all rows. Crimson Editor (free) does
this. UltraEdit also can.

So can vim, and IIRC Eclipse since version 3.2.
UltraEdit is also a very good text editor, but I really dislike the
fact that it wants to be the default editor for many file types. For
instance, after installation, it opens instead of Notepad when doing
"View source" in MSIE.

I stopped using UltraEdit and DreamWeaver in favor of Eclipse and I have not
been disappointed by the latter to date.
Please, don't take it bad.

Why, in the end *you* /will have/ the problems caused by *your* design
decisions driven by the uninformed guesses and irrational reasoning *you*
are making. The only problem for the regulars here would probably be that
you have wasted their precious time then.


Next time, please reply to each followup individually. This is not a Web
forum or bulletin board, it is Usenet. The way you replied you are ripping
the discussion apart.

http://www.jibbering.com/faq/faq_notes/clj_posts.html


PointedEars
 
J

Julien

You remember incorrectly. Unicode (currently version 4.0) is *one*
character set (a code-point-to-glyph mapping) that is composed of subsets
for different scripts (as in writing, not as in programming) and kinds of
punctuation and symbols. It is only that those subsets can be viewed online
as separate tables (in PDF documents).


UTF-8 is the Unicode Transformation Format that uses code units of 8 bit
length. A sequence of UTF-8 code units designates the code point for a
character or glyph in the Unicode character set. Details can be found in
the Unicode FAQ which you really should read (at least the Basic section)
*before* you continue.


ANSI is no character set or encoding, it is a standards-overseeing (but not
making) organization (American National Standards Institute). It is a
common misconception (supported by continued Microsoft's mislabelling) that
Windows-1252 and the like would have been standardized by ANSI, that there
was something like an "ANSI code page". AFAIK, the only real "ANSI
standard" that made it into computing are ANSI escape codes which are e.g.
used to format text in text terminals.

http://en.wikipedia.org/wiki/Windows-1252http://en.wikipedia.org/wiki/ANSI_escape_code

As I have explained already, characters with code points within the range of
the ASCII (American Standard Code for Information Interchange) character set
require only one UTF-8 code unit; those are the characters at the code
points from 0x00 to 0x7F (decimal 127), only insofar you are correct.

Thank you for all details above. I'm really not a guru of encoding,
but even less of terminology!
I did mistake telling that Unicode is a set of character maps. You are
right, it's *one* character set. I can imagine that miscellaneous
character sets like the Greek one were "pasted" one after another when
building the Unicode charset. That's why in my mind was Unicode like a
set of character maps.
The other byte is required to encode the higher code point. I repeat, that
is _not_ another character map.


Of course they are not. The former is a transformation encoding, and the
other one is the character set that provides the corresponding code points
and glyphs.


Your reasoning still is not sound. Not considering the counter-arguments
already presented, here is another one: What if there is no script support?

My website is not a standard *information* website. It requires
Javascript and could not work without. People that disabled Javascript
are informed about the problem.
As explained (see below), having all translations one above each other
is very comfortable when programming and limits mistakes. I keep my
opinion. Things could be different for another web site; mine is
different from most that you can see and this explains my choice.
You have yet to grasp the difference between character encoding and
character set.

This looks like XHTML, however IE still does not support XHTML, and the
proper declaration of the encoding looks different in XML documents:

My website was originally written with a XHTML syntax. I also did a
HTML version as it was easier to debug with the w3 validator. Maybe IE
doesn't support XTHML but it displays it very well as HTML and I did
not encounter any problem. My site also works fine with Firefox now
(offline version); my server will be updated soon.
<?xml version="1.0" encoding="iso-8859-1"?>

That has to come before any other markup in the message body but only if the
server not already provides the encoding in the Content-Type header. By
default, XML documents use UTF-8 as document encoding.

Yes, I also put the <meta> at the top of the <head>, before the
<title>.
Honnestly, my web site does not really require to be XML (XHTML).
But I thought that this could make maintenance easier in the future as
XML parsers can display elements as nodes. This could help editing.
However, I think you do not want XHTML:

<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">

Still, this declaration is merely defined to be a *fallback* when the
resource is not served via HTTP (hence "http-equiv" -- HTTP equivalent).
User agents are expected to ignore it in all other cases, so that hardly
serves your purpose.




Unnecessary in HTML. Error-prone always, especially in XHTML where it
indeed comments the script out for the markup parser.

Thanks! I'll note what you told for XTHML. I had read that those
comments were useful to prevent the script-disabled browsers to
display the scripts as they ignore the said:
It is not a constructor, so the identifier should not start with an
uppercase letter.

I'll remember.
See below.


This is a syntax error, remove it.


The value of the `charset' attribute here constitutes plain fantasy syntax
instead of informed programming. IOW: That attribute value is _not supposed
to be executed_ by any client-side script engine at all.

Learning is also trying...
IE's tagsoup parser will choke on that because you have not closed the
`script' element in its eyes. As I have said, IE does _not_ support XHTML.

OK. So <script ...></script> would be better.

(Skiped discussion about editors and posting rules.)
 
T

Thomas 'PointedEars' Lahn

Julien said:
My website is not a standard *information* website. It requires
Javascript and could not work without. People that disabled Javascript
are informed about the problem.

But they cannot necessarily do anything about that (maybe the network admin
has configured the proxy to filter out some content?), so ultimately you are
annoying and excluding visitors when not necessary. Instead, you could
provide a Web site that always works, with an additional benefit for
visitors that use user agents with sufficient client-side script support
without the need for maintaining more than one document for the same content.
As explained (see below), having all translations one above each other is
very comfortable when programming and limits mistakes.

Server-side scripting and maybe database access would be even more
convenient, and would allow for graceful degradation.
I keep my opinion. Things could be different for another web site; mine
is different from most that you can see and this explains my choice.

So far uninformedness explains your choices.
My website was originally written with a XHTML syntax. I also did a HTML
version as it was easier to debug with the w3 validator.

XHTML, as it is parsed by the Validator's XML parser, would be much easier
to debug than any HTML document. Of course one would need to understand the
restrictions XML imposes to allow documents to be well-formed. For example
one would have to understand that in XHTML the content model of the `script'
element is PCDATA, not CDATA (as in HTML) and so special precautions have to
be taken if it could be valid markup.
Maybe IE doesn't support XTHML but it displays it very well as HTML and

You are easily deceived. For example, you may have to script for a HTML DOM
in one UA and an XHTML DOM in another for it to work.
I did not encounter any problem.

The simple reason for that is that you have not understood the actual
problem and the potential problems yet.
My site also works fine with Firefox now (offline version); my server
will be updated soon.

Why use a feature that is not universally supported and introduces a number
of problems when that feature is not required?
Yes, I also put the <meta> at the top of the <head>, before the <title>.

But an XML parser does not care about that. It only can work in IE because
you are serving XHTML with the wrong media type and so the standard tag-soup
parser is used. And all the benefits that X(HT)ML provides regarding
parsing are gone. You are serving more markup to achieve less. That does
not sound to me as being very reasonable.
Honnestly, my web site does not really require to be XML (XHTML).

Don't use it then.
But I thought that this could make maintenance easier in the future as
XML parsers can display elements as nodes. This could help editing.

The question whether in the future maintenance would be easier with an XML
editor becomes moot when thinking about the present disadvantages of XHTML.
If, and only if, you have been accustomed to an XML editor *now*, that
would be a valid argument in favor of using XHTML.

That said, the parser of Eclipse's Web Standard Tools plugin can display the
document tree for HTML as well, and it is probably not the only one.
Thanks! I'll note what you told for XTHML. I had read that those comments
were useful to prevent the script-disabled browsers to display the
scripts as they ignore the <script></script> markups.

Those comments were either written long ago or by people who don't know what
they are writing about. The `script' element is standardized since HTML 3.2
(1997-01 CE) and probably supported even longer as that standard is defined
to be a reflection of current practice at the time; in fact, client-side
scripting was introduced in 1996-03 (JavaScript 1.0) and 1996-08 (JScript
1.0). Any UA that displays the content of that element (especially when
within the `head' element) is utterly broken and should not be used anymore.
Meaning that most certainly it was never necessary in the history of Web
development to pseudo-comment out scripts that way, and certainly it is not
necessary for "script-disabled browsers" as they must know the `script'
element in order to ignore it when `script' support was disabled.
Learning is also trying...

Trying a futile thing might help someone to learn whether it is futile or
not, but RTFM first and then trying out what could be learned from that
usually helps understanding a great deal more and therefore prevents
misconceptions in the first place.
OK. So <script ...></script> would be better.

It is *required* if your document is served as text/html.
(Skiped discussion about editors and posting rules.)

It is a pity that you did not read the FAQ Notes section I pointed you to,
because if you did you could have avoided the next mistake made here of
quoting a large amount of text as is, especially when it was irrelevant to
your reply.


PointedEars
 
B

Bart Van der Donck

Thomas said:

You appear to miss the fundamentals. UTF-8 was created only for hat
purpose.

You would be right under UTF-32, as every character is represented
there by a 32-bit byte. The Unicode organization obviously wanted to
avoid that all internet traffic would be quadrupled - and that's the
sole reason why UTF-7, 8 and 16 were created.

UTF-7 is the safest since it uses a representation of 1 to 4 bytes of
7bit each. UTF-8 has more options and the eighth bit is mostly
considered safe nowadays [*]. UTF-16 would already mean a doubling of
all traffic.

[*] Though I remember
http://groups.google.com/group/comp.lang.javascript/msg/e2b25cf9e7e99de8
where the eighth bit was not transferred correctly, but hey, this is
Usenet, probably the most ancient in that regard :)
 
B

Bart Van der Donck

Julien said:
Thank you Mr Van der Donk for the details about Unicode. I knew that
Greek characters required more than 1 byte in UTF-8 and assumed they
required 2 (not 4) as they are quite commonly used but could not give
the details.

You're welcome. Yes, a Greek character consumes two bytes in UTF-8 and
one in ISO 8859-7.
If I remember, Unicode is a set of character maps (tables) and UTF-8
works like it was using shortcuts to Unicode character maps. The the
first 127 chars are the same as in ANSI and are then coded on one
byte. For other characters like Greek ones, at least one other byte is
required to tell which Unicode character map to use. Then, UTF-8 and
Unicode are not exactly the same. UTF-8 is more compact as it uses 1
byte every time it can.

I'm afraid it's not that simple. The core reason, as with all low-
level computing, can be brought down to the binary system.

Take 3 sets of two lamps and you have created a 2-bit computer having
3 bytes of RAM. Every byte can hold 4 values:
00
01
10
11

Then you can "a-locate" (allocate) your brain (memory) to the RAM of
that primitive computer: you can put the lamps into 00 10 11 (100%
CPU) and go take a shower. Your computer will have remembered what you
stored when you come back.

You can't jump very far with these 2 bits of course. Suppose you would
extend this 2^2 (2-bit) to 2^7 (7-bit), then you have 128
possibilities, allowing the full Latin set, numbers and symbols
(ASCII). The only thing you need, is an agreement so that everybody
knows what's the meaning of each combination. We could agree that 11
stands for 'b' under 2^2. Taking this into real-life ASCII, 0100001 is
agreed to correspond to letter 'a'.

7-bit became too narrow; so 2^8 opens new doors to 256. Just add '0'
to the old 7-bit characters on the eighth bit, and turn this into '1'
for characters of the 128-256 byte range. Then you need a character
set to define how you want to use the 128-256 code points. ISO 8859
made 16 possible maps for that; starting from 8859-1 ('Western
European') to 8859-16 ('South-Eastern European').

This is where Unicode (not UTF) begins. If 256 isn't enough, you could
do 2^9 and repeat the same game. But developers didn't want to make
the same mistake again, so they went to 32-bits in one jump.
Letter 'a' in 7-bit: 0100001
Letter 'a' in 8-bit: 01000010
Letter 'a' in Unicode: 01000010000000000000000000000000

So a computer supporting Unicode needs four times more memory than
when it would use a classic 8-bit range!

In the old days every bit was represented by 'bulb on' (1/close
circuit) or 'bulb off' (0/open circuit). My father's first job was
replacing broken bulbs in a giant computer hall filled to the ceiling
with bulbs flikkering on and off :) But IMHO it's still a very clear
image to grasp the inner working of bits and bytes.

Also regarding internet, 32-bit would be an unwise strategy; since all
traffic would be quadrupled in one move.

This is were the story of UTF begins with its credo "only use what you
need". Much of the sent content is ASCII-safe anyway: e.g. markup like
HTML, ASCII content, headers, programming code... It would have been
an irresponsible waste of resources to put everything into 32-bit;
most of the time 75% would have been pure overhead, serving to
nothing.

Today it looks like UTF-8 is winning the battle on internet, I think
because it's the most profitable byte consumer combined with the
almost-universally support for 8-bit transfers.

I hope you haven't fallen off your chair. Use UTF-8 or ISO 8859-7 and
you'll probably be okay :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,065
Latest member
OrderGreenAcreCBD

Latest Threads

Top