javascript strings not 7 bit like I expected

S

Stevo

I'd always been under the misimpression that JavaScript strings were
7-bit ASCII like C strings and the issue had never come up before. It
seems that I'm wrong (happily). Does anyone know if I can assume this
will work regardless of code page, HTML doctype, quirks mode etc? Is it
only because I've made this a super simple page running on Windows XP
US-en that this works? Would it fail in a lot of configurations (e.g.
Mac, Linux, Asian codebases) ?

<html>
<body>
<script>
var s="Hellö Würld";
document.write(s);
alert(s);
</script>
</body>
</html>

That s string contains two German characters (just in case usenet is
still restricted to 7-bit characters like it was 20 years ago).
 
B

Bart Van der Donck

Stevo said:
I'd always been under the misimpression that JavaScript strings were
7-bit ASCII like C strings and the issue had never come up before. It
seems that I'm wrong (happily). Does anyone know if I can assume this
will work regardless of code page, HTML doctype, quirks mode etc? Is it
only because I've made this a super simple page running on Windows XP
US-en that this works? Would it fail in a lot of configurations (e.g.
Mac, Linux, Asian codebases) ?

<html>
<body>
<script>
var s="Hellö Würld";
document.write(s);
alert(s);
</script>
</body>
</html>

ö and ü should be safe in the Western world if nothing else is
specified; they are still in the ISO/IEC 8859-1 range ("Latin-1"),
which was the historical default character set on the internet.

There are 3 elements:
- the charset in which the page was served (not visible in HTML)
- how was it saved (an Unicode set or not)
- the charset in a meta-tag in the header

If none of those are present, ö and ü can normally not go wrong for
Western systems. But maybe there are browsers that assume a non-
ISO-8859-1 for the 8th byte (Russian, Turkish...), or even these days
dare to assume UTF-8 if nothing else is said (I would be surprised,
but you never know).

If you would, for example, save in ANSI and add <meta http-
equiv="Content-Type" content="text/html; charset=KOI8-R">, then the
code points show the corresponding characters for ö and ü (00f6 and
00fc) out of the KOI8-R table:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
http://en.wikipedia.org/wiki/KOI8-R

Another issue is UTF-8, e.g.:
- save file as ANSI
- serve with <meta http-equiv="Content-Type" content="text/html;
charset=UTF-8">
The test fails because the browser cannot map a valid byte sequence,
if you would save as UTF-8, the problem disappears.

Suggested solution for your code:

1. Traditional approach (easiest):
- <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
- serve the page from server as ISO/IEC 8859-1 (AFAIK this is the
assumed default for Western web servers, normally no need to change)

2. Modern approach
- <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
- save file in UTF-8
- serve the page from server as UTF-8

Hope this helps,
 
S

Stevo

Bart said:
ö and ü should be safe in the Western world if nothing else is
specified; they are still in the ISO/IEC 8859-1 range ("Latin-1"),
which was the historical default character set on the internet.

There are 3 elements:
- the charset in which the page was served (not visible in HTML)
- how was it saved (an Unicode set or not)
- the charset in a meta-tag in the header

If none of those are present, ö and ü can normally not go wrong for
Western systems. But maybe there are browsers that assume a non-
ISO-8859-1 for the 8th byte (Russian, Turkish...), or even these days
dare to assume UTF-8 if nothing else is said (I would be surprised,
but you never know).

If you would, for example, save in ANSI and add <meta http-
equiv="Content-Type" content="text/html; charset=KOI8-R">, then the
code points show the corresponding characters for ö and ü (00f6 and
00fc) out of the KOI8-R table:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
http://en.wikipedia.org/wiki/KOI8-R

Another issue is UTF-8, e.g.:
- save file as ANSI
- serve with <meta http-equiv="Content-Type" content="text/html;
charset=UTF-8">
The test fails because the browser cannot map a valid byte sequence,
if you would save as UTF-8, the problem disappears.

Suggested solution for your code:

1. Traditional approach (easiest):
- <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">
- serve the page from server as ISO/IEC 8859-1 (AFAIK this is the
assumed default for Western web servers, normally no need to change)

2. Modern approach
- <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
- save file in UTF-8
- serve the page from server as UTF-8

Hope this helps,

Bart

That is an incredibly helpful response Bart. Thanks a lot. As far as the
meta tags go, there's nothing I can do about that as my code may appear
(with it's foreign characters in strings) on any page. I have no control
over that. I do, however, have total control over how my files are
written and served and will give this info to our server guys to make
sure we're on the right track.

Thanks again :)
 
B

Bart Van der Donck

Stevo said:
That is an incredibly helpful response Bart. Thanks a lot. As far as the
meta tags go, there's nothing I can do about that as my code may appear
(with it's foreign characters in strings) on any page. I have no control
over that. I do, however, have total control over how my files are
written and served and will give this info to our server guys to make
sure we're on the right track.

It depends which foreign characters you mean. As long as you stay
inside ISO-8859-1, you should normally be fine. The charset in which
the page is served, should be more important than the <meta>
information. But I would also run browser tests while assuming non-
ISO-8859-1 as default, to make sure that everything is also correct in
those cases.

Thanks again :)

You're welcome... Alan Flavell has done more interesting studies in
this field (see Google Groups archive).
 
B

Bart Van der Donck

Bart said:
It depends which foreign characters you mean. As long as you stay
inside ISO-8859-1, you should normally be fine. The charset in which
the page is served, should be more important than the <meta>
information.

....of which here some examples:

http://www.dotinternet.be/temp/a1.pl
http://www.dotinternet.be/temp/a2.pl

They are served as ISO-8859-1 with an attempt to overwrite from
<meta>; ISO-8859-1 should win according to the docs, and it does, as
tested here in IE/FF/Opera/Chrome.

One thing I'm not entirely sure of is the default charset in which
pages are served. I would assume ISO-8859-1 or just nothing on most
web servers (thus leaving the actual decision to <meta>).

Conclusion: you should normally be fine, but adding ISO-8859-1 in
<meta> would still be a bit better, if possible. If you're working
with external files, you might also add <script type="text/javascript"
charset="ISO-8859-1">.
 
S

Stevo

Bart said:
...of which here some examples:

http://www.dotinternet.be/temp/a1.pl
http://www.dotinternet.be/temp/a2.pl

They are served as ISO-8859-1 with an attempt to overwrite from
<meta>; ISO-8859-1 should win according to the docs, and it does, as
tested here in IE/FF/Opera/Chrome.

One thing I'm not entirely sure of is the default charset in which
pages are served. I would assume ISO-8859-1 or just nothing on most
web servers (thus leaving the actual decision to <meta>).

Conclusion: you should normally be fine, but adding ISO-8859-1 in
<meta> would still be a bit better, if possible. If you're working
with external files, you might also add <script type="text/javascript"
charset="ISO-8859-1">.

Bart

Thanks for the additional info. That last suggestion with the charset on
the script tag *IS* something I can do. This is great info.

Oh, and the characters I'll be expecting are just western european
umlauts and accents (Germany, Spain, France, Italy and maybe Benelux and
Scandinavian too). We won't be getting any asian or arabic characters.
 
D

Dr J R Stockton

In comp.lang.javascript message said:
I'd always been under the misimpression that JavaScript strings were
7-bit ASCII like C strings and the issue had never come up before. It
seems that I'm wrong (happily). Does anyone know if I can assume this
will work regardless of code page, HTML doctype, quirks mode etc? Is it
only because I've made this a super simple page running on Windows XP
US-en that this works? Would it fail in a lot of configurations (e.g.
Mac, Linux, Asian codebases) ?

<html>
<body>
<script>
var s="Hellö Würld";
document.write(s);
alert(s);
</script>
</body>
</html>

That s string contains two German characters (just in case usenet is
still restricted to 7-bit characters like it was 20 years ago).

You should not assume that something which looks like "Hellö Würld"
(umlauted) on your Windows, in an unspecified editor or viewer, will
necessarily have the right international representation for the umlaut-
bearers, although it probably will.

However, if that source code generates the umlauted characters when
executed in a standards-compliant browser in the USA or Germany, then it
will do so in such everywhere : characters such as Asian are often
built-in but are otherwise AFAIK add-ons rather than substitutes.

JavaScript and browsers use, or consistently appear to use, four-byte
Unicode internally; but Windows copy'n'paste translates those to what
the destination can handle.

CAVEAT : I don't know how [languages like] Chinese work in Unicode,
needing a different character for every word.

The following code should show the characters available on your system;
I'd expect a US version to have fewest, unless it has add-on Native
American. Beware - Unicode has what might or might not be a design
fault, in that the last "now write forwards" character precedes the last
for "now write forwards". Smarter code might suppress that effect.
However, as most people can read Urdu, Tamil, etc. just as ineffectively
in either direction, it matters little in the present context.

B = ["<pre>"]
for (K=0 ; K<1024 ; K++) { A = [(1e6+K)+" "]
for (J=0 ; J<64 ; J++) A.push(String.fromCharCode(64*K + J))
B.push(A.join("")) }
B.push("<\/pre>")
document.write(B.join("\n"))
document.close()

NOTE : in the reversed output line apparently numbered 041201 I see
.... IIIIIIIVVVIVIIVIIIIXXXIXIILCDMiiiiiiivvviviiviiiixxxixiilcdm
(de-reversed by copy'n'paste) where the roman numerals for I to XII,
i to xii are single characters (those are Number Forms, \u2160 - \u217F.
Interesting.

As I wrote before, some people like to write dates like "31 III 2009" or
"2009-III-31" ; that opens up a new class of Date Formatting &
Validation. The DATE2 Object can now read and write those, with single-
character months.

Alas, those characters are not constant width in monospace, and IE7 only
has I-XII i-x (FF has I-XII i-xii).


Refer to the Unicode site to find out more.
 
D

Dr J R Stockton

In comp.lang.javascript message <08433ccc-530a-40d6-a6c0-040c44550066@o6
g2000yql.googlegroups.com>, Tue, 31 Mar 2009 01:48:37, Bart Van der
Donck said:
You're welcome... Alan Flavell has done more interesting studies in
this field (see Google Groups archive).

See also
24 Feb 2007
Message-ID: <[email protected]> in
comp.infosystems.www.authoring.html, Subject: Alan J Flavell, RIP ;
<http://bytes.com/groups/html/834141-announce-alan-flavells-web-pages>
<http://www.alanflavell.org.uk/>
 
T

Thomas 'PointedEars' Lahn

Bart said:
Suggested solution for your code:

1. Traditional approach (easiest):
- <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1">the
- serve the page from server as ISO/IEC 8859-1 (AFAIK this is the
assumed default for Western web servers, normally no need to change)

2. Modern approach
- <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
- save file in UTF-8
- serve the page from server as UTF-8

As the `meta' element is rendered irrelevant to a conforming implementation
once a real HTTP Content-Type header says otherwise, that is (still) bad
advice. The recommended way is to use one encoding, and declare, in the
HTTP Content-Type header that the resource is served with, the same encoding.

Also, as for ISO-8859-1 vs. UTF-8, there is nothing traditional vs. modern
about it. You should use UTF-8 (or another Unicode Transformation Format,
but UTF-8 is probably supported best) when you expect glyphs from different
Unicode/UCS-2 character ranges than Basic Latin to be formed through raw
sequences of code units in the resource (as opposed to character [entity]
references like &uuml; where the corresponding glyphs will be displayed
correctly regardless of the used encoding -- UCS-2 is already defined as the
HTML Document Character Set per HTML 4.01 Specification), and ISO-8859-1
when you are sure all possible non-ISO-8859-1 glyphs are formed through
character references or character entity references (programming languages
like PHP provide convenience functions to that end).

Obviously, in a globally deployed Web application you can't be sure of
anything about the user's environment, and you would want to speed up
server-side processing, so using a UTF would be the wise, if not even
logical, course of action.


PointedEars
 
B

Bart Van der Donck

Thomas said:
As the `meta' element is rendered irrelevant to a conforming implementation
once a real HTTP Content-Type header says otherwise, that is (still) bad
advice.  The recommended way is to use one encoding, and declare, in the
HTTP Content-Type header that the resource is served with, the same encoding.

True, but I think that this is exactly the problem. Apache doesn't
serve any charset by default [*]. But it appears that some distro's
(Red Hat, Ubuntu, Debian) might ship their Apache versions with UTF-8
or ISO-8859-1. One does this, other does that... and often Apache is
pointed as the culprit, not the distro which actually made the change
[**]. Imagine the army of web developers searching for bugs why their
charsets don't work :)
Also, as for ISO-8859-1 vs. UTF-8, there is nothing traditional vs. modern
about it.

Well, ISO-8859-1 is older and used to be the default charset on
internet (and still is for HTTP/1.1); UTF-8 is more recent and
attempts to take over this role. I understand your point of view of
course; UTF-8 or ISO-8859-1 should be chosen depending on the data,
and not on a traditional or modern style.

[*] http://httpd.apache.org/docs/2.0/mod/core.html section
'AddDefaultCharset Directive', see 'Default: AddDefaultCharset Off'.
[**] E.g. http://padawan.info/2004/07/debugging-chars.html and
http://www.deez.info/sengelha/2003/03/17/apache-content-type-nightmare/
Both report that their Apache recently switched from no charset to
ISO-8859-1.
 
S

slebetman

Thomas said:
Bart Van der Donck wrote:
As the `meta' element is rendered irrelevant to a conforming implementation
once a real HTTP Content-Type header says otherwise, that is (still) bad
advice.  The recommended way is to use one encoding, and declare, in the
HTTP Content-Type header that the resource is served with, the same encoding.

True, but I think that this is exactly the problem. Apache doesn't
serve any charset by default [*]. But it appears that some distro's
(Red Hat, Ubuntu, Debian) might ship their Apache versions with UTF-8
or ISO-8859-1. One does this, other does that... and often Apache is
pointed as the culprit, not the distro which actually made the change
[**]. Imagine the army of web developers searching for bugs why their
charsets don't work :)

Only those who serve static content. Those who work with dynamic
content can simply print "Content-type: $mimetype; charset=utf-8" and
not have to worry much about how the webserver is configured. In
really restrictive environments (no htaccess etc.) I even serve static
content by proxying via CGI scripts.
<snip> UTF-8 or ISO-8859-1 should be chosen depending on the data,
and not on a traditional or modern style.

see http://www.joelonsoftware.com/articles/Unicode.html
 
S

slebetman

I'd always been under the misimpression that JavaScript strings were
7-bit ASCII like C strings and the issue had never come up before. It
seems that I'm wrong (happily).

(Warning, OT:)
Actually, you're also wrong about C strings. Strings in C are simply
arrays of bytes (Now, what a 'byte' actually is, the C specs says
depends on the platform. For most of us it is 8 bits).

Try the following:

#include <stdio.h>
int main (void) {
/* Assuming ISO-8859-1 encoded terminal/shell */
char *str = "Hell\366 W\374rld\n";
printf(str);
}
 
T

Thomas 'PointedEars' Lahn

Bart said:
Thomas said:
As the `meta' element is rendered irrelevant to a conforming implementation
once a real HTTP Content-Type header says otherwise, that is (still) bad
advice. The recommended way is to use one encoding, and declare, in the
HTTP Content-Type header that the resource is served with, the same encoding.

True, but I think that this is exactly the problem. Apache doesn't
serve any charset by default [*].

True. However, you seem to have overlooked that it did once and that this
was *removed* afterwards (in version 2.1, on 2004-12-11) because it
complicated matters more than it helped.
But it appears that some distro's (Red Hat, Ubuntu, Debian) might ship
their Apache versions with UTF-8 or ISO-8859-1.

Default configurations of older Apache packages do; Debian GNU/Linux (which
I am using and working with) certainly doesn't since quite a while. It is
the mistake of the webmasters if they use outdated packages or leave this
setting in spite of a diverse runtime environment.
One does this, other does that... and often Apache is
pointed as the culprit, not the distro which actually made the change
[**].

In this case, the problem was within the Apache default configuration,
regardless of the (GNU/)Linux distribution.
Imagine the army of web developers searching for bugs why their
charsets don't work :)

BTDT. There's server-side scripting to work around it when necessary, or
you can try to LART to your webmaster. The `meta' element declaration
should only be used as a last resort and for validating local files.
Well, ISO-8859-1 is older and used to be the default charset on
internet (and still is for HTTP/1.1);

ACK (cf. RFC 1945, section 3.6.1, and RFC 2616, section 3.7.1.)
UTF-8 is more recent

"Recent" as in "18 years old"? Are you a winemaker?
and attempts to take over this role. I understand your point of view of
course; UTF-8 or ISO-8859-1 should be chosen depending on the data,
and not on a traditional or modern style.
ACK

[*] http://httpd.apache.org/docs/2.0/mod/core.html section
'AddDefaultCharset Directive', see 'Default: AddDefaultCharset Off'.

That's a Good Thing, believe it or not.
[**] E.g. http://padawan.info/2004/07/debugging-chars.html and
http://www.deez.info/sengelha/2003/03/17/apache-content-type-nightmare/
Both report that their Apache recently switched from no charset to
ISO-8859-1.

You really have a strange understanding of "recent", haven't you; those
posting are, as their URIs already indicate, from 2003/2004. You can read at

<https://issues.apache.org/bugzilla/show_bug.cgi?id=23421>

(which is broken-linked from the first posting) that and why it was
removed/commented out again.


PointedEars
 
B

Bart Van der Donck

Thomas said:
Bart said:
Apache doesn't serve any charset by default [*].

True.  However, you seem to have overlooked that it did once and that this
was *removed* afterwards (in version 2.1, on 2004-12-11) because it
complicated matters more than it helped.

I don't understand why Apache added such a thing in the first place.
"Recent" as in "18 years old"?  Are you a winemaker?

I'm better at drinking it. 'more recent' was a factual observation in
comparison to ISO-8859-1. Not a statement on its own.
That's a Good Thing, believe it or not.

I certainly believe that. Hence the not-so-irrelevant <meta> in most
cases with static html. Many web builders don't bother to look in conf
files, are not confident with such matters, or simply do not know the
importance of a charset in the header. You will now probably argue
that they should, but in practice it's only a small percentage.
[**] E.g.http://padawan.info/2004/07/debugging-chars.htm
http://www.deez.info/sengelha/2003/03/17/apache-content-type-nightmare/
Both report that their Apache recently switched from no charset to
ISO-8859-1.

You really have a strange understanding of "recent", haven't you; those
posting are, as their URIs already indicate, from 2003/2004.  You can read at

https://issues.apache.org/bugzilla/show_bug.cgi?id=23421

That was the link I was looking for. All clear now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top