whitespaces

  • Thread starter Andreas Bergmaier
  • Start date
A

Andreas Bergmaier

Hello,
is there a definition list for which characters match the RegExp /\s/
(and followed by all implementations)? Im currently looking for the
non-breaking-space \u00A0. I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.

A quick (and maybee dirty) test script:

function test(x) {
var r = [];
for (var i=0; i<0x10000; i++) // 65536
if (x.exec(String.fromCharCode(i)))
r.push(i);

r = r.map(function(i){
return "\\u"+i.toString(16)+" bzw \\x"+i.toString(10) +
": '"+String.fromCharCode(i)+"'"
});

return r.join("\n");
}

Called with test(/\s/); it returned in Firefox
\u9 bzw \x9: ' '
\uA bzw \x10: '
'
\uB bzw \x11: ' '
\uC bzw \x12: ' '
\uD bzw \x13: '
'
\u20 bzw \x32: ' '
\uA0 bzw \x160: ' '
\u1680 bzw \x5760: ' '
\u180E bzw \x6158: 'á Ž'
\u2000 bzw \x8192: ' '
\u2001 bzw \x8193: 'â€'
\u2002 bzw \x8194: ' '
\u2003 bzw \x8195: ' '
\u2004 bzw \x8196: ' '
\u2005 bzw \x8197: ' '
\u2006 bzw \x8198: ' '
\u2007 bzw \x8199: ' '
\u2008 bzw \x8200: ' '
\u2009 bzw \x8201: ' '
\u200A bzw \x8202: ' '
\u2028 bzw \x8232: '
'
\u2029 bzw \x8233: '
'
\u202F bzw \x8239: ' '
\u205F bzw \x8287: 'âŸ'
\u3000 bzw \x12288: ' '

and in Opera it added
\u200B bzw \x8203: '​'

But the non-breaking space was in both lists. So I guess [\s\uA0] is
senseless? Are there any implementions that do something else? I'm sorry
I couldn't test IE.

Bergi
 
J

Jukka K. Korpela

Andreas said:
is there a definition list for which characters match the RegExp /\s/

ECMA-262 clause 15.10.2.12 defines \s in terms of WhiteSpace (clause 7.2)
and LíneTerminator (7.3). The former is somewhat implicitly defined, since
it lists down some specific characters and then adds a row with the generic
name "Any other Unicode 'space separator'" and formal name "<USP>", but the
significant part is really in the Code Unit Value: Other category "Zs". So
ECMA-262 really defines "WhiteSpace" characters as those with Unicode
general category Zs ("Separator, space") in Unicode version 3. This
includes, for example, EN SPACE, which not mentioned explicitly.

There is ambiguity, or "freedom", concerning Zs characters added to Unicode
after version 3: implementations are not require to treat them as
WhiteSpace, but they are allowed to.
(and followed by all implementations)?

It's conceivable that implementations might get this wrong for rarely used
characters like \u2029 (paragraph separator), but on the whole, I don't see
much reason to be suspicious about the basic characters.
Im currently looking for the
non-breaking-space \u00A0.

It's defined as being a WhiteSpace character, somewhat oddly, since many
whitespace concepts in computer languages treat it as yet another graphic
character.
I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.

I don't see there's much problem in using it
A quick (and maybee dirty) test script:

Well a _quick_ test to check whether the no-break space matches \s would be
to test that alone...
But the non-breaking space was in both lists. So I guess [\s\uA0] is
senseless?

No, it just has some redundancy.
Are there any implementions that do something else? I'm
sorry I couldn't test IE.

IE 9 treats no \uA0 as matching \s.
 
L

Lasse Reichstein Nielsen

Andreas Bergmaier said:
is there a definition list for which characters match the RegExp /\s/

Only the ECMAScript specification.
(and followed by all implementations)?

Nope.

The general problem is that the ECMAScript specification states that
what should be matched in terms of the Unicode specification - without
saying which version of Unicode to follow.
A more down-to-earth problem is that some browsers just does it wrong.
Im currently looking for the
non-breaking-space \u00A0. I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.

I believe some older IEs didn't match any non-ASCII whitespace (but that's
from memory - I only have IE 9 here, so I can't check).
A quick (and maybee dirty) test script:

function test(x) {
var r = [];
for (var i=0; i<0x10000; i++) // 65536
if (x.exec(String.fromCharCode(i)))
r.push(i);

r = r.map(function(i){
return "\\u"+i.toString(16)+" bzw \\x"+i.toString(10) +
": '"+String.fromCharCode(i)+"'"
});

return r.join("\n");
}

I've written code similar to that in the past to find differences.
Ofcourse I didn't save a log of my results :(.

....
But the non-breaking space was in both lists. So I guess [\s\uA0] is
senseless? Are there any implementions that do something else? I'm
sorry I couldn't test IE.

Try IE 6 if you get a chance.

My current browsers:
IE9 Chrome 12 Safari 5 Firefox 4 Opera 11
U+0009 U+0009 U+0009 U+0009 U+0009
U+000a U+000a U+000a U+000a U+000a
U+000b U+000b U+000b U+000b U+000b
U+000c U+000c U+000c U+000c U+000c
U+000d U+000d U+000d U+000d U+000d
U+0020 U+0020 U+0020 U+0020 U+0020
U+00a0 U+00a0 U+00a0 U+00a0 U+00a0
U+1680 U+1680 U+1680 U+1680 U+1680
U+180e U+180e U+180e U+180e U+180e
U+2000 U+2000 U+2000 U+2000 U+2000
U+2001 U+2001 U+2001 U+2001 U+2001
U+2002 U+2002 U+2002 U+2002 U+2002
U+2003 U+2003 U+2003 U+2003 U+2003
U+2004 U+2004 U+2004 U+2004 U+2004
U+2005 U+2005 U+2005 U+2005 U+2005
U+2006 U+2006 U+2006 U+2006 U+2006
U+2007 U+2007 U+2007 U+2007 U+2007
U+2008 U+2008 U+2008 U+2008 U+2008
U+2009 U+2009 U+2009 U+2009 U+2009
U+200a U+200a U+200a U+200a U+200a
U+200b
U+2028 U+2028 U+2028 U+2028 U+2028
U+2029 U+2029 U+2029 U+2029 U+2029
U+202f U+202f U+202f U+202f U+202f
U+205f U+205f U+205f U+205f U+205f
U+3000 U+3000 U+3000 U+3000 U+3000
U+feff

The U+200b recognized by Opera is probably for historical reasons. It
used to be whitespace in Unicode up to version 3.2 (category Zs), but
then changed to be formatting (category Cf).
The U+FEFF of IE is probably just them trying to be helpful, so that
trimming a Unicode text with a BOM will work. U+FEFF was always category
Cf.

/L
 
T

Thomas 'PointedEars' Lahn

Jukka said:

When talking about ECMA-262 you should name the Edition(s) you are referring
to. Currently there are 5 of them, with number 4 not having reached Final
status until the Netscape browser division, its proponent, was dissolved,
but having been implemented as Microsoft JScript.NET and Macromedia/Adobe
ActionScript 2.0+ nevertheless, and numbers 3 and 5 being (partially)
implemented by client-side script engines.
clause 15.10.2.12 defines \s in terms of WhiteSpace (clause 7.2)
and LíneTerminator (7.3). The former is somewhat implicitly defined, since
it lists down some specific characters and then adds a row with the
generic name "Any other Unicode 'space separator'" and formal name
"<USP>",

There is nothing implicit about this. The terminals that can be produced
through either production of the ECMAScript grammar are explicitly defined,
though partially through references.

However, implementations vary, so a correct answer to this question must
begin with "Yes, but …".
but the significant part is really in the Code Unit Value: Other
category "Zs".

"Zs" is _not_ a "Code Unit Value". It is an abbreviation for the Unicode
character category "Separator, Space", or simplified "spaces". What is
meant by this unfortunate presentation is that Unicode characters in that
category need be produced by the /WhiteSpace/ production.

See the Unicode Standard, Version 6.0, p. 26, for a definition of "code unit
value", and an explanation as to why it is correct, though somewhat
misleading, that it was used in that way in Edition 5 of the ECMAScript
Language Specification ("code point value" would have been equally correct,
but less misleading), except for "Other category 'Zs'".
It's defined as being a WhiteSpace character,

In prose, it's "whitespace", "white-space" or "white space" (ES5); never
WhiteSpace. /WhiteSpace/ is only the name of the goal symbol in the
ECMAScript grammar, and there are several characters that can be produced
from it.
somewhat oddly, since many whitespace concepts in computer languages treat
it as yet another graphic character.
Rubbish.
I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.

I don't see there's much problem in using it

I do see the problem that it is syntactically invalid, and ought to throw a
SyntaxError exception. `\u' must be followed by exactly four hexadecimal
digits (ES3/5, 7.8.4).
But the non-breaking space was in both lists. So I guess [\s\uA0] is
senseless?

No, it just has some redundancy.

No, since implementations vary, assuming the syntactically correct form was
used, it would complete the character class for implementations that are not
conforming in that regard.
IE 9 treats no \uA0 as matching \s.

The JScript engine (not: IE 9) ought to treat it as a syntax error.
However, in a string literal it might treat it as if it was the four-
character string value "\\uA0", which would explain why that does not match
/\s/.

Proper tests need be performed using the syntactically correct form, \u00A0
or \u00a0, or even \xA0 or \xa0. I will not be able to test the JScript
engine in MSHTML 9 before tomorrow; however, tests with JScript 5.6.6626 in
MSHTML 6.0.2800.1106 show that it, surprisingly, treats missing hex digits
in a Unicode escape sequence in a string literal as a syntax error, but
/\ua0/ as I described for string literals before, actually matching "\\ua0".


PointedEars
 
A

Andreas Bergmaier

Dr said:
I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.
Sorry, it was [\s\xa0]. See below.


Thank you, http://www.merlyn.demon.co.uk/js-valid.htm#Fred is what I was
looking for. Referring to that, older browsers dont include the
non-breaking space.

My next question is about the notation. As Thomas already complained,
I've messed up \u and \x. It should be \u plus for hexadecimal digits,
so \u00a0 not \uA0 or something. But what I couldn't find on the web
(especially not in ECMAs pdf) was the \x notation. How can it be used?
Is it decimal or hexidecimal?

Bergi
 
M

Martin Honnen

Andreas said:
My next question is about the notation. As Thomas already complained,
I've messed up \u and \x. It should be \u plus for hexadecimal digits,
so \u00a0 not \uA0 or something. But what I couldn't find on the web
(especially not in ECMAs pdf) was the \x notation. How can it be used?
Is it decimal or hexidecimal?

Mozilla's JavaScript guide in
https://developer.mozilla.org/en/JavaScript/Guide/Values,_Variables,_and_Literals#String_Literals
defines

\xXX The character with the Latin-1 encoding specified by the two
hexadecimal digits XX between 00 and FF. For example, \xA9 is the
hexadecimal sequence for the copyright symbol.
 
J

Jukka K. Korpela

Andreas said:
But what I couldn't find on the web
(especially not in ECMAs pdf) was the \x notation. How can it be used?
Is it decimal or hexidecimal?

It's defined in clause 7.8.4 (String Literals) of ECMA-262. It can be hard
to find, since the obvious way of searching for "\x" won't work (the
standard does not give any example of the construct).

The notation is \x followed by exactly two hexadecimal digits, e.g. \xA0,
and it can be used inside a string literal. As there are two hex digits in
it, it can be used for characters in the range from U+0000 to U+00FF. Some
books say that the notation is valid for ASCII characters (range U+0000 to
U+007F) only; this may or may not reflect some old implementations.
 
T

Thomas 'PointedEars' Lahn

Martin said:
Mozilla's JavaScript guide in
https://developer.mozilla.org/en/JavaScript/Guide/Values,_Variables,_and_Literals#String_Literals
defines

\xXX The character with the Latin-1 encoding specified by the two
hexadecimal digits XX between 00 and FF. For example, \xA9 is the
hexadecimal sequence for the copyright symbol.

However, this is an now-outdated statement that originates in the Netscape
Client-Side JavaScript 1.2 Guide, when Netscape JavaScript did not yet
support Unicode.

In current implementations (of ECMAScript Edition 3 and above), like
Netscape/Mozilla.org JavaScript 1.5 and above, it is the representation of
the Unicode 3.0 (and above) character with that hexadecimal code point. This
just happens to be the same character as in ISO-8859-1, not Latin-1 (ISO/IEC
8859-1), with that code point, because Unicode 3.0 (and above) is a superset
of ISO-8859-1.


PointedEars
 
T

Thomas 'PointedEars' Lahn

Jukka said:
It's defined in clause 7.8.4 (String Literals) of ECMA-262. It can be hard
to find, since the obvious way of searching for "\x" won't work (the
standard does not give any example of the construct).

The notation is \x followed by exactly two hexadecimal digits, e.g. \xA0,
and it can be used inside a string literal. As there are two hex digits in
it, it can be used for characters in the range from U+0000 to U+00FF. Some
books say that the notation is valid for ASCII characters (range U+0000 to
U+007F) only; this may or may not reflect some old implementations.

It does, living proof again that many, if not most, books and even FAQs
about the topic are hopelessly outdated (and erroneous). Nevertheless, for
full backwards compatibility it would be required to use \x… only for
characters that can be represented with US-ASCII. We have discussed this
recently.


PointedEars
 
J

Jukka K. Korpela

Thomas said:
In current implementations (of ECMAScript Edition 3 and above), like
Netscape/Mozilla.org JavaScript 1.5 and above, it is the
representation of the Unicode 3.0 (and above) character with that
hexadecimal code point. This just happens to be the same character as
in ISO-8859-1, not Latin-1 (ISO/IEC 8859-1), with that code point,
because Unicode 3.0 (and above) is a superset of ISO-8859-1.

If you wish to nitpick, you should prepare to get outnitpicked if you don't
have a solid ground.

ISO-8859-1, which is commonly called ISO Latin-1, does _not_ coincide with
Unicode in the entire range under discussion. Part of the range is reserved
for control characters in Unicode but left undefined in ISO-8859-1.

So "\x80" refers to a control character, though its meaning and effect is
undefined. If ISO-8859-1 were used as the basis, "\x80" as such would be
undefined.
 
T

Thomas 'PointedEars' Lahn

Jukka said:
If you wish to nitpick, you should prepare to get outnitpicked if you
don't have a solid ground.

In German(y) we have a saying: "Wenn nur das Wörtchen 'wenn' nicht wäre …"
which roughly translates to "If only there were not the word 'if' …".
ISO-8859-1, which is commonly called ISO Latin-1, does _not_ coincide with
Unicode in the entire range under discussion. Part of the range is
reserved for control characters in Unicode but left undefined in
ISO-8859-1.

Control characters from code points 0x00 to 0x1F, and 0x7F to 0x9F, are not
part of _ISO/IEC 8859-1_, which also has the name Latin-1 (short for "Latin
alphabet No. 1" from ISO/IEC 8859-1:1998, Part 1). However, they *are* part
of ISO-8859-1, commonly but mistakenly called Latin-1, and with the
exception of U+0080 and U+0081 they are defined in Unicode (6.0) accordingly
(and therefore in conforming implementations of ECMAScript [any edition]).
U+0080 and U+0081 are defined in Unicode (6.0) as control characters without
attached meaning. See also the Unicode (6.0) Code Charts.
So "\x80" refers to a control character, though its meaning and effect is
undefined. If ISO-8859-1 were used as the basis, "\x80" as such would be
undefined.

Your evidence is sound, but your logic is flawed.


PointedEars
 
D

Dr J R Stockton

Fri said:
My next question is about the notation. As Thomas already complained,
I've messed up \u and \x. It should be \u plus for hexadecimal digits,
so \u00a0 not \uA0 or something. But what I couldn't find on the web
(especially not in ECMAs pdf) was the \x notation. How can it be used?
Is it decimal or hexidecimal?

In practice, within a string, and not preceded by a \, the sequence \x
is an abbreviation for \u00. Two hex digits are expected to follow.

It is probably safe to assume that \S, which should mean the opposite of
\s, does mean the opposite of \s. But, when testing \s, it would be
wrong, but probably not misleading, to test it against all characters
that it is supposed to match. That is illustrated by the case of \w in
(IIRC) IE 6 7 8 not 9 : in those, although \W is indeed the opposite of
\w, \w matches not only A-Z a-x 0-9 _ but also matches a dotless lower
case i - a matter of importance if working in Turkish.

ASIDE : My ISP has been unwell, which has affected
news access and sometimes updating merlyn.
 
R

RobG

Andreas Bergmaier said:
is there a definition list for which characters match the RegExp /\s/

Only the ECMAScript specification.
(and followed by all implementations)?

Nope.

The general problem is that the ECMAScript specification states that
what should be matched in terms of the Unicode specification - without
saying which version of Unicode to follow.
A more down-to-earth problem is that some browsers just does it wrong.
Im currently looking for the
non-breaking-space \u00A0. I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.

I believe some older IEs didn't match any non-ASCII whitespace (but that's
from memory - I only have IE 9 here, so I can't check).
A quick (and maybee dirty) test script:
function test(x) {
  var r = [];
  for (var i=0; i<0x10000; i++) // 65536
   if (x.exec(String.fromCharCode(i)))
    r.push(i);
  r = r.map(function(i){
   return "\\u"+i.toString(16)+" bzw \\x"+i.toString(10) +
    ": '"+String.fromCharCode(i)+"'"
  });
  return r.join("\n");
}

I've written code similar to that in the past to find differences.
Ofcourse I didn't save a log of my results :(.

...
But the non-breaking space was in both lists. So I guess [\s\uA0] is
senseless? Are there any implementions that do something else? I'm
sorry I couldn't test IE.

Try IE 6 if you get a chance.

In IE 6, \s does not match non-breaking space. To run the test script
I commented out the r.map(...) part since IE doesn't support map. The
(decimal) list is:

9
10
11
12
13
32
 
R

RobG

Only the ECMAScript specification.
(and followed by all implementations)?

The general problem is that the ECMAScript specification states that
what should be matched in terms of the Unicode specification - without
saying which version of Unicode to follow.
A more down-to-earth problem is that some browsers just does it wrong.
Im currently looking for the
non-breaking-space \u00A0. I found some scripts that use [\s\uA0] in
RegExps, so I wondered whether this is really needed.
I believe some older IEs didn't match any non-ASCII whitespace (but that's
from memory - I only have IE 9 here, so I can't check).
A quick (and maybee dirty) test script:
function test(x) {
  var r = [];
  for (var i=0; i<0x10000; i++) // 65536
   if (x.exec(String.fromCharCode(i)))
    r.push(i);
  r = r.map(function(i){
   return "\\u"+i.toString(16)+" bzw \\x"+i.toString(10) +
    ": '"+String.fromCharCode(i)+"'"
  });
  return r.join("\n");
}
I've written code similar to that in the past to find differences.
Ofcourse I didn't save a log of my results :(.
But the non-breaking space was in both lists. So I guess [\s\uA0] is
senseless? Are there any implementions that do something else? I'm
sorry I couldn't test IE.
Try IE 6 if you get a chance.

In IE 6, \s does not match non-breaking space. To run the test script
I commented out the r.map(...) part since IE doesn't support map. The
(decimal) list is:

  9
  10
  11
  12
  13
  32

I played a little more and replaced the map function:

\u0009 bzw \x9: ' '
\u000a bzw \x10: ' '
\u000b bzw \x11: ' '
\u000c bzw \x12: ' '
\u000d bzw \x13: ' '
\u0020 bzw \x32: ' '

Which agrees with what J R Stockton has.

IE 6 doesn't match lots of characters that other browsers do (such as
n, m and thin space) so maybe a full list should be used if
consistancy is what is needed.

// List of characters matched as white space
var whiteSpace = '\u0009\u000a\u000b\u000c\u000d\u0020\u00a0' +
'\u1680\u180e\u2000\u2001\u2002\u2003\u2004' +
'\u2005\u2006\u2007\u2008\u2009\u200a\u200b' +
'\u2028\u2029\u202f\u205f\u3000';

var re = new RegExp('[' + whiteSpace + ']');

alert(re.test(' ')); // true


I'm sure this has been discussed many times before.
 
D

Dr J R Stockton

In comp.lang.javascript message <4bfe6c41-b2ce-4cf5-81f9-0ebf3fc468f9@j9
g2000prj.googlegroups.com>, Wed, 13 Apr 2011 00:29:12, RobG
// List of characters matched as white space
var whiteSpace = '\u0009\u000a\u000b\u000c\u000d\u0020\u00a0' +
'\u1680\u180e\u2000\u2001\u2002\u2003\u2004' +
'\u2005\u2006\u2007\u2008\u2009\u200a\u200b' +
'\u2028\u2029\u202f\u205f\u3000';

That does not include
0085 next line
feff zero-width no-break space
which I was told are matched in IE 9 beta.

Final Draft ECMA 5 apparently required 12 or more :
{ 0009 000a 000b 000c 000d 0020
0085 00a0 200b 2028 2029 feff }
 

Members online

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,273
Latest member
DamonShoem

Latest Threads

Top