unprintable characters in a javascript produced msgbox

E

emrefan

I am wondering a bit about what I should see in a message box (or in a
webpage, for that matter) when I include an unprintable ASCII
character, say ASCII 255, in there. I experimented a bit on my PC
running Traditional Chinese Windows 98SE and found that the following
javascript code produced a message that seemed to have ASCII
represented as "y".

alert( 'the following char is ASCII FF: \xff. So what does it
look like to you?' );

I had this line in the <HEAD> section of the relevant HTML file where
I put that javascript code:

<meta http-equiv='Content-Type' content='text/html; charset=Big5-
HKSCS'>

But even if I try to figure that into the picture, I still can't see
why it should come out as "y".

Can anybody please enlighten this thick mind?
 
T

Thomas 'PointedEars' Lahn

emrefan said:
I am wondering a bit about what I should see in a message box (or in a
webpage, for that matter) when I include an unprintable ASCII character,
say ASCII 255, in there.

The (7-bit US-)ASCII character set ranges from code points 0 (0x00) to 127
(0x7F). Everything else is _not_ part of (US-)ASCII code:

I experimented a bit on my PC running Traditional Chinese Windows 98SE
and found that the following javascript code produced a message that
seemed to have ASCII represented as "y".char

You are getting the LATIN SMALL LETTER Y WITH DIAERESIS character ("ÿ"; note
that there are two dots in the ascent) because this is the character at code
point U+00FF in the Unicode character set as defined in the Unicode
Standard, versions 2.1 and later (a conforming implementation of ECMAScript
Edition 3 must implement the latter), and at code point 255 (0xFF) of
several other character sets, most notably ISO/IEC 8859-1 and Windows-1252:

<http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Related_character_maps>
<http://unicode.org/>
alert( 'the following char is ASCII FF: \xff. So what does it look like
to you?' );

Should be window.alert(...) so as to rely less on the UA's scope chain.
I had this line in the <HEAD> section of the relevant HTML file where I
put that javascript code:

<meta http-equiv='Content-Type' content='text/html; charset=Big5- HKSCS'>

But even if I try to figure that into the picture, I still can't see why
it should come out as "y".

The display behavior for the code point 0xFF of the *proposed* character
encoding Big5-HKSCS (which uses the Big5 Character Set with Hong Kong
Supplementary Character Set), even if written properly, is undefined:

<http://en.wikipedia.org/wiki/Big5#HKSCS>
<http://www.iana.org/assignments/charset-reg/>

You should also check the HTTP response message's headers for a
`Content-Type' header that says differently, for it takes precedence then:

<http://www.w3.org/TR/1999/REC-html401-19991224/charset.html#h-5.2.2>


HTH

PointedEars
 
B

Bart Van der Donck

emrefan said:
I am wondering a bit about what I should see in a message box (or in a
webpage, for that matter) ...

Character encoding in message boxes or web pages are two totally
different things.
... when I include an unprintable ASCII character, say ASCII 255,
in there.  

Code points above 127 are not ASCII anymore. And why would it be
unprintable ?
I experimented a bit on my PC running Traditional Chinese Windows
98SE and found that the following javascript code produced a
message that seemed to have ASCII represented as "y".

Google Groups probably replaced your "y-umlaut" by "y".
     alert( 'the following char is ASCII FF: \xff. So what does it
look like to you?' );

This always looks the same for everyone, namely a y with an umlaut on.
No other display is possible here.
I had this line in the <HEAD> section of the relevant HTML file where
I put that javascript code:

     <meta http-equiv='Content-Type' content='text/html; charset=Big5-
HKSCS'>

That line does not affect javascript's internal code point table (like
eg. \xff). It defines which character set must be used on the web
page. For displaying y-umlaut on a web page, you probably want:

<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-1">

If you want both ISO-8859-1 and Chinese on a same page, I would
definitely go for UTF-8.
But even if I try to figure that into the picture, I still can't see
why it should come out as "y".

Because you get what you define :) If you say ISO-8859-1, then the
browser ties code point 255 to y-umlaut. If you say ISO-8859-2, then
you get an upper dot, etc.
http://en.wikipedia.org/wiki/ISO_8859-1
http://en.wikipedia.org/wiki/ISO_8859-2

Hope this helps,
 
T

Thomas 'PointedEars' Lahn

Bart said:
Character encoding in message boxes or web pages are two totally
different things.

Not true.
This always looks the same for everyone, namely a y with an umlaut on.
No other display is possible here.

You are mistaken. The \x string literal escape sequence may or may not
specify a Unicode character, depending on the ECMAScript implementation.
That line does not affect javascript's internal code point table (like
eg. \xff).

It could affect it if there was no corresponding HTTP header present that
says otherwise. There is no "javascript", BTW.
It defines which character set must be used on the web page.

Unless a corresponding HTTP header is present that says otherwise. There
are no "web pages", BTW.


PointedEars
 
B

Bart Van der Donck

Thomas said:
Bart Van der Donck wrote:

Not true.

It is true, because the character encoding is done at a different
level. Message boxes -like in this example- are actually much easier.
There can only be one possible representation. But when trying to
write y-umlaut in a web page, you have a bunch of possibilities, on
the top of my head, at least 10 - for which of course some are more
preferred than others.
You are mistaken.  The \x string literal escape sequence may or may not
specify a Unicode character, depending on the ECMAScript implementation.

But I was only saying that alert('\xff') always shows y-umlaut in any
browser. y-umlaut is the character that is tied to code point 255 in
any ECMAScript implementation.
That line does not affect javascript's internal code point table (like

It could affect it if there was no corresponding HTTP header present that
says otherwise.  

Untrue. The display of \x.. (and \u....) can never be influenced by
any HTTP-header. The notation is ASCII-safe, and is passed to the
javascript engine to tie it to a fixed character. I think you're
mixing up the character set of a web page with javascript's consistent
internal code point table.
There is no "javascript", BTW.

Is that so.
Unless a corresponding HTTP header is present that says otherwise.

That is far from sure, and could easily vary from browser to browser.
Anyway - it would be unwise to specify a charset on the web page that
contradicts the HTTP header (coder's fault, not browser's fault).
There are no "web pages", BTW.

Is that so :)
 
T

Thomas 'PointedEars' Lahn

Bart said:
It is true, because the character encoding is done at a different level.
Message boxes -like in this example- are actually much easier. There can
only be one possible representation.

You are mistaken. It depends on the user agent which characters are
supported in a message box. However, it has been observed that message
boxes use the character set of their document, regardless of the encoding
that the ECMAScript implementation supports. We have discussed this here
before.
But when trying to write y-umlaut in a web page, you have a bunch of
possibilities, on the top of my head, at least 10 - for which of course
some are more preferred than others.

I don't think the OP wanted to write "y-umlaut" at all.
But I was only saying that alert('\xff') always shows y-umlaut in any
browser.

But you are dead wrong.
y-umlaut is the character that is tied to code point 255 in any
ECMAScript implementation.

However, there are implementations that do not support Unicode.
Untrue. The display of \x.. (and \u....) can never be influenced by any
HTTP-header.

\x definitely can. Obviously, \u cannot.
The notation is ASCII-safe,

\x cannot be ASCII-safe as if it allows characters to be represented that
are outside the range of the ASCII character set.
Is that so.

Yes, there are different ECMAScript implementations (some of which don't
even deserve that designation), and versions thereof.
That is far from sure, and could easily vary from browser to browser.

It has been observed that user agents honor the Specification in that
regard. This was the reason why AddDefaultCharset was disabled in newer
Apache versions.
Anyway - it would be unwise to specify a charset on the web page that
contradicts the HTTP header (coder's fault, not browser's fault).

Nowadays, no argument there.


PointedEars
 
B

Bart Van der Donck

Thomas said:
You are mistaken.  It depends on the user agent which characters are
supported in a message box.  However, it has been observed that message
boxes use the character set of their document, regardless of the encoding
that the ECMAScript implementation supports.  We have discussed this here
before.

That is not the point here. It is clear that the original poster was
talking about alert('\xff') versus the encoding of y-umlaut in an HTML-
document. In that regard the representation of \xff has nothing to do
with the representation of y-umlaut outside javascript.

[...]
But you are dead wrong.

Well, let's see then. Could you show a case where alert('\xff') does
not show y-umlaut ?
However, there are implementations that do not support Unicode.

Irrelevant. y-umlaut does not need Unicode at all.
\x definitely can.  Obviously, \u cannot.

Let's see. Could you show an example where \x.. is displayed
differently depending on a varying HTTP-header ?
\x cannot be ASCII-safe as if it allows characters to be represented that
are outside the range of the ASCII character set.

That's why I said the *notation* is ASCII-safe. What is *represented*
by that notation, is a different job; that is decided by the
javascript engine.
Yes, there are different ECMAScript implementations (some of which don't
even deserve that designation), and versions thereof.

That's like saying that cars don't exist, but only implementations of
fuel engines.
 
T

Thomas 'PointedEars' Lahn

Bart said:
That is not the point here. It is clear that the original poster was
talking about alert('\xff') versus the encoding of y-umlaut in an HTML-
document. In that regard the representation of \xff has nothing to do
with the representation of y-umlaut outside javascript.

Yes, it has.
[...]
But you are dead wrong.

Well, let's see then. Could you show a case where alert('\xff') does
not show y-umlaut ?

Wasting my time supporting your logical fallacy? I don't think so.

Ask something living in Bosnia, Croatia, Czech Republic, Hungaria, Poland,
Romania, Serbia, Slovakia, Slovenia, Malta, Estonia, Latvia, Lithuania,
Greenland, Bulgaria, Belarus, Russia, Macedonia, Greece, Israel, or any
other country where the character set designed for their main language does
not have "y-umlaut", as you put it (you really don't know what an umlaut
is), at decimal code point 255 (*except* with Unicode support), instead.
Irrelevant.

Not at all.
y-umlaut does not need Unicode at all.

True, it is also contained in ISO-8859-1. However, as ASCII does not
provide this character, if the \x string escape sequence is used and Unicode
support is not present, the locale encoding (or the encoding of the
document/file) must be used to determine which character to display for
decimal code points beyond 127. (If Unicode is not supported, "\uhhhh" is
interpreted as "uhhhh".)
That's why I said the *notation* is ASCII-safe.

It would seem whether that is true depends on how one defines "ASCII-safe".
What is *represented* by that notation, is a different job; that is
decided by the javascript engine.
See?


That's like saying that cars don't exist, but only implementations of
fuel engines.

As a matter of fact, there are JavaScript and JScript versions that are not
fully ECMAScript-compliant, and therefore do not provide Unicode support.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Bart said:
That is not the point here. It is clear that the original poster was
talking about alert('\xff') versus the encoding of y-umlaut in an HTML-
document. In that regard the representation of \xff has nothing to do
with the representation of y-umlaut outside javascript.

Yes, it has.
[...]
But you are dead wrong.

Well, let's see then. Could you show a case where alert('\xff') does not
show y-umlaut ?

Wasting my time supporting your logical fallacy? I don't think so.

Ask someone living in Bosnia, Croatia, Czech Republic, Hungaria, Poland,
Romania, Serbia, Slovakia, Slovenia, Malta, Estonia, Latvia, Lithuania,
Greenland, Bulgaria, Belarus, Russia, Macedonia, Greece, Israel, or any
other country where the character set designed for their main language does
not have "y-umlaut", as you put it (you really don't know what an umlaut
is), at decimal code point 255 (*except* with Unicode support), instead.
Irrelevant.

Not at all.
y-umlaut does not need Unicode at all.

True, it is also contained in ISO-8859-1. However, as ASCII does not
provide this character, if the \x string escape sequence is used and Unicode
support is not present, the locale encoding (or the encoding of the
document/file) must be used to determine which character to display for
decimal code points beyond 127. (If Unicode is not supported, "\uhhhh" is
interpreted as "uhhhh" rather than a single character.)
That's why I said the *notation* is ASCII-safe.

It would seem whether that is true depends on how one defines "ASCII-safe".
What is *represented* by that notation, is a different job; that is
decided by the javascript engine.
See?


That's like saying that cars don't exist, but only implementations of
fuel engines.

As a matter of fact, there are JavaScript and JScript versions that are not
fully ECMAScript-compliant, and therefore do not provide Unicode support.


PointedEars
 
B

Bart Van der Donck

Thomas said:
Bart Van der Donck wrote:

Wasting my time supporting your logical fallacy?  I don't think so.

Ask something living in Bosnia, Croatia, Czech Republic, Hungaria, Poland,
Romania, Serbia, Slovakia, Slovenia, Malta, Estonia, Latvia, Lithuania,
Greenland, Bulgaria, Belarus, Russia, Macedonia, Greece, Israel, or any
other country where the character set designed for their main language does
not have "y-umlaut", as you put it (you really don't know what an umlaut
is), at decimal code point 255 (*except* with Unicode support), instead.

You are simply wrong; all of those will display y-umlaut with
alert('\xff'). You keep talking about Unicode but it has nothing to do
with it. As I said, just give me one example, and I'll be immediately
convinced of your point. But there is no such example.
Not at all.


True, it is also contained in ISO-8859-1.  However, as ASCII does not
provide this character, if the \x string escape sequence is used and Unicode
support is not present, the locale encoding (or the encoding of the
document/file) must be used to determine which character to display for
decimal code points beyond 127.  

You just wrote the core of your misconception. In the (nowadays highly
unlikely) case that Unicode support would not be present in the
browser's script engine, the locale is NOT used as lookup-table for
\x. It's always the internal lookup table of the script engine. It has
nothing to do with the document or its encoding !

[...]
It would seem whether that is true depends on how one defines "ASCII-safe"..

You have the nasty habit to give a silly twist to a position that you
cannot longer hold. ASCII-safe is code-point 0 to 127, as you
perfectly know. There is no room for other interpretations.

See what then ?
As a matter of fact, there are JavaScript and JScript versions that are not
fully ECMAScript-compliant, and therefore do not provide Unicode support.

I'm not going to reply on your arguments like "there is no
javascript", "you don't know what an umlaut is", "web pages don't
exist", etc. I made my point clear enough. You already conveniently
snipped my question "Could you show an example where \x.. is displayed
differently depending on a varying HTTP-header" which was one of your
basic points.
 
B

Bart Van der Donck

Thomas said:
if the \x string escape sequence is used and Unicode support is
not present, the locale encoding (or the encoding of the
document/file) must be used to determine which character to
display for decimal code points beyond 127.

I believe much of your arguments in this thread were based on this
false pre-assumption. A small investigation:

| \xXX: The character with the Latin-1 encoding specified by the
| two hexadecimal digits XX between 00 and FF. For example, \xA9
| is the hexadecimal sequence for the copyright symbol. [*]

In your view, this \xA9 would then become Latin S with Caron [**]
under a Latin-2 [***] locale. This is not true, as \x uses its own
independent lookup-table, namely Latin-1 [*].
\x originates from the same time as other pre-Unicode instructions
like escape() or unescape(), which also use Latin-1.

| It [ISO-8859-1] is less formally referred to as Latin-1. [****]

[*] http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals
[**] http://en.wikipedia.org/wiki/Š
[***] http://en.wikipedia.org/wiki/ISO_8859-2
[****] http://en.wikipedia.org/wiki/ISO_8859-1
 
T

Thomas 'PointedEars' Lahn

Bart said:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
You are simply wrong; all of those will display y-umlaut with
alert('\xff').

No, they won't. ISO-8859-2, for example, does not have y with diaerhesis at
code point 255. Neither has ISO-8859-3 or any other ISO-8859-x family
encoding but ISO-8859-11. And I am not even mentioning more exotic
character sets and encodings.
You keep talking about Unicode but it has nothing to do with it.

You are mistaken, and I'm tired explaining to you why. There is *nothing*
in the ECMAScript Specification that specifies what should happen with \x
escape sequences if Unicode support is not there, because ECMAScript Ed. 1
already introduced Unicode support. However, as we know that there are
JavaScript and JScript versions that are not ECMAScript-compliant, that
therefore don't have Unicode support or the operating system's API they are
running on is not Unicode-compliant, it is locale/encoding-dependent what
happens with \x80 to \xFF then.
As I said, just give me one example, and I'll be immediately
convinced of your point. But there is no such example.

As I indicated, you are trying to shift the burden of proof and I will not
support that.
You just wrote the core of your misconception. In the (nowadays highly
unlikely) case that Unicode support would not be present in the
browser's script engine, the locale is NOT used as lookup-table for
\x. It's always the internal lookup table of the script engine.

There is no "internal lookup table of the script engine", that is a fantasy
of yours. window.alert() especially, is a host object's method which
behavior is defined by the UA's API.
It has nothing to do with the document or its encoding !

If that were so, it would be *you* who would have to prove *that*, not
vice-versa.


PointedEars
 
B

Bart Van der Donck

Thomas said:
Bart Van der Donck wrote:

No, they won't.  ISO-8859-2, for example, does not have y with diaerhesis at
code point 255.  Neither has ISO-8859-3 or any other ISO-8859-x family
encoding but ISO-8859-11.  And I am not even mentioning more exotic
character sets and encodings.

The character set doesn't matter. \x always works with Latin-1 (=ISO
8859-1), regardless of the character set of the web page.
You are mistaken, and I'm tired explaining to you why.  There is *nothing*
in the ECMAScript Specification that specifies what should happen with \x
escape sequences if Unicode support is not there, because ECMAScript Ed. 1
already introduced Unicode support.  However, as we know that there are
JavaScript and JScript versions that are not ECMAScript-compliant, that
therefore don't have Unicode support or the operating system's API they are
running on is not Unicode-compliant, it is locale/encoding-dependent what
happens with \x80 to \xFF then.

What you write in this paragraph is correct, except your last
conclusion (after your last comma). When the ECMAScript Specification
says nothing about \x outside of Unicode, you should obviously look at
the javascript docs themselves.
As I indicated, you are trying to shift the burden of proof and I will not
support that.

Here is the proof.

(1) The documentation says:

| \xXX: The character with the Latin-1 encoding specified by the
| two hexadecimal digits XX between 00 and FF.

http://developer.mozilla.org/en/docs/Core_JavaScript_1.5_Guide:Literals

(2) Verify in browser without Unicode support:

I have installed Netscape 2.0 from:
http://netscape.1command.com/client_archive20x.php

Give it the following code:

<html>
<head>
<meta http-equiv="Content-Type"
content="text/html; charset=ISO-8859-2">
</head>
<body>
<script type="text/javascript">
alert('\xff')
</script>
</body>
</html>

The outcome is expected and documented as in (1). Screenshot:
http://www.dotinternet.be/temp/demo.jpg

I gave you proof from the docs (developer.mozilla.org) plus a
demonstration with Netscape 2. Hopefully this will convince you now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,071
Latest member
MetabolicSolutionsKeto

Latest Threads

Top