Regular Expressions: Inconsistent behaviours with non breaking space

M

Markus

Hello

While working on some white space trimming methods I found that the "Non
breaking space" character (ASCII 160, nbsp, \u00A0) is treated
differently accross browsers. A short test shows: MSIE and Safari do
treat it as non-whitespace, FF and Opera treat it as whitespace.

This is the test code (the space between "hel" and "lo" consists of
three nbsp characters, typed on Windows with ALT + 160):

var teststr = "hel lo";
alert (teststr.replace(/\s+/g, ""));

- MSIE and Safari alert "hel lo"
- Firefox and Opera alert "hello"

I found some discussions about this. My personal oppinion is that nbsp
should be treated as non-whitespace, for 2 reasons:
- In HTML rendering it is treated like this (every single occurrence of
the character is rendered, unless other whitespace which is collapsed to
one space); treating it differently in Javascript regular expressions is
not consistent
- There are easy ways to have nbsp treated along with whitespace (as
shown in the trim function provided in this group's FAQ,
http://www.jibbering.com/faq/index.html#FAQ4_16), but if nbsp is treated
as whitespace by default, it is less trivial to get it preserved, if
this behaviour is desired.

Now, this is actually my question - the expected behaviour in my case is
to preserve non breaking spaces. Is there a possibility to exclude the
white space character explicitly from the \s class, or do I have to
replace \s by an enumeration of all whitespace characters? Such as:

alert (teststr.replace(/[ \f\n\r\t\v]+/g, ""));

Thanks for your comments!
 
M

Markus

Markus said:
Hello

While working on some white space trimming methods I found that the "Non
breaking space" character (ASCII 160, nbsp, \u00A0) is treated
differently accross browsers. A short test shows: MSIE and Safari do
treat it as non-whitespace, FF and Opera treat it as whitespace.

This is the test code (the space between "hel" and "lo" consists of
three nbsp characters, typed on Windows with ALT + 160):

var teststr = "hel lo";
alert (teststr.replace(/\s+/g, ""));

- MSIE and Safari alert "hel lo"
- Firefox and Opera alert "hello"

I found some discussions about this. My personal oppinion is that nbsp
should be treated as non-whitespace, for 2 reasons:
- In HTML rendering it is treated like this (every single occurrence of
the character is rendered, unless other whitespace which is collapsed to
one space); treating it differently in Javascript regular expressions is
not consistent
- There are easy ways to have nbsp treated along with whitespace (as
shown in the trim function provided in this group's FAQ,
http://www.jibbering.com/faq/index.html#FAQ4_16), but if nbsp is treated
as whitespace by default, it is less trivial to get it preserved, if
this behaviour is desired.

Now, this is actually my question - the expected behaviour in my case is
to preserve non breaking spaces. Is there a possibility to exclude the
white space character explicitly from the \s class, or do I have to

The non breaking space character of course, not the white space
character; sorry for the typo.
replace \s by an enumeration of all whitespace characters? Such as:

alert (teststr.replace(/[ \f\n\r\t\v]+/g, ""));

Thanks for your comments!
 
B

Bart Van der Donck

Markus said:
While working on some white space trimming methods I found that the "Non
breaking space" character (ASCII 160, nbsp, \u00A0) is treated
differently accross browsers. A short test shows: MSIE and Safari do
treat it as non-whitespace, FF and Opera treat it as whitespace.

\u00A0 is not ASCII anymore; it stops at code point 128. I would
generally avoid non-ASCII space characters, except maybe as record/
line/... separator in data files. But ASCII already has many control
characters from itself.
This is the test code (the space between "hel" and "lo" consists of
three nbsp characters, typed on Windows with ALT + 160):

That should be ALT+0160; ALT+160 gives something else.
var teststr = "hel   lo";
alert (teststr.replace(/\s+/g, ""));

- MSIE and Safari alert "hel   lo"
- Firefox and Opera alert "hello"

I found some discussions about this. My personal oppinion is that nbsp
should be treated as non-whitespace, for 2 reasons:
- In HTML rendering it is treated like this (every single occurrence of
the character is rendered, unless other whitespace which is collapsed to
one space); treating it differently in Javascript regular expressions is
not consistent
- There are easy ways to have nbsp treated along with whitespace (as
shown in the trim function provided in this group's FAQ,
http://www.jibbering.com/faq/index.html#FAQ4_16), but if nbsp is treated
as whitespace by default, it is less trivial to get it preserved, if
this behaviour is desired.

Now, this is actually my question - the expected behaviour in my case is
to preserve non breaking spaces. Is there a possibility to exclude the
white space character explicitly from the \s class, or do I have to
replace \s by an enumeration of all whitespace characters? Such as:

alert (teststr.replace(/[ \f\n\r\t\v]+/g, ""));

'[ \f\n\r\t\v]' is identical to '\s'; so this expression would
*normally* show the same result as \s+. Browsers cannot be trusted
where they categorize \u00a; the same goes for other 'exotic' space
and newline-characters. I would do something like this:

var teststr = 'hel\u00a0\u00a0\u00a0lo';
teststr = ((teststr.replace(/\u00a0/g, '\u48ef'))
.replace(/\s/g, ''))
.replace(/\u48ef/g, '\u00a0');

Cheers,
 
R

RobG

Hello

While working on some white space trimming methods I found that the "Non
breaking space" character (ASCII 160, nbsp, \u00A0) is treated
differently accross browsers. A short test shows: MSIE and Safari do
treat it as non-whitespace, FF and Opera treat it as whitespace.

According to the ECMAScript spec, Firefox and Opera are correct, but
according to the HTML 4 spec, IE and Safari are correct (see below).
This is the test code (the space between "hel" and "lo" consists of
three nbsp characters, typed on Windows with ALT + 160):

var teststr = "hel lo";
alert (teststr.replace(/\s+/g, ""));

- MSIE and Safari alert "hel lo"
- Firefox and Opera alert "hello"

I found some discussions about this. My personal oppinion is that nbsp
should be treated as non-whitespace, for 2 reasons:
- In HTML rendering it is treated like this (every single occurrence of
the character is rendered, unless other whitespace which is collapsed to
one space); treating it differently in Javascript regular expressions is
not consistent
- There are easy ways to have nbsp treated along with whitespace (as
shown in the trim function provided in this group's FAQ,http://www.jibbering.com/faq/index.html#FAQ4_16), but if nbsp is treated
as whitespace by default, it is less trivial to get it preserved, if
this behaviour is desired.

Now, this is actually my question - the expected behaviour in my case is
to preserve non breaking spaces. Is there a possibility to exclude the
white space character explicitly from the \s class, or do I have to
replace \s by an enumeration of all whitespace characters? Such as:

alert (teststr.replace(/[ \f\n\r\t\v]+/g, ""));

If you want it treated consistently, then you will have to do
something like that.

According to the W3C HTML 4 spec, there are 5 white space characters:

ASCII space ( )
ASCII tab ( )
ASCII form feed ( )
Zero-width space (​)
Linebreaks

<URL: http://www.w3.org/TR/html4/struct/text.html#whitespace >

which excludes non-breaking and fixed-width spaces (and probably
others). However, the ECMAScript language spec 7.2 says whitespace
is:

\u0009 Tab <TAB>
\u000B Vertical Tab <VT>
\u000C Form Feed <FF>
\u0020 Space <SP>
\u00A0 No-break space <NBSP>
Other category "Zs" Any other Unicode
"space separator" <USP>


that last one creates a much broader class of characters than the HTML
spec.
 
T

Thomas 'PointedEars' Lahn

RobG said:
According to the ECMAScript spec, Firefox and Opera are correct, but
according to the HTML 4 spec, IE and Safari are correct (see below).

Guess which one matters here ...
[...]
According to the W3C HTML 4 spec, there are 5 white space characters:

The HTML 4.01 Specification is irrelevant regarding ECMAScript-defined
Regular Expressions.


PointedEars
 
R

RobG

Guess which one matters here ...

The OP mentioned white space in HTML and in javascript. Suppling
information points out that the two specifications have different
opinions on what constitutes white space is relevant, though it may
not matter to you.
[...]
According to the W3C HTML 4 spec, there are 5 white space characters:

The HTML 4.01 Specification is irrelevant regarding ECMAScript-defined
Regular Expressions.

ECMAScript is irrelevant without a host environment. How it interacts
with, and differs from, that environment is relevant.
 
T

Thomas 'PointedEars' Lahn

RobG said:
The OP mentioned white space in HTML and in javascript. Suppling
information points out that the two specifications have different
opinions on what constitutes white space is relevant, though it may
not matter to you.

Nonsense. The OP used an ECMAScript-defined RegExp to match string input.
Whether that string input comes from a HTML DOM or from somewhere else is
entirely irrelevant regarding the behavior of the matcher. What matters is
only the used script engine that is either ECMAScript-compliant (AIUI,
section 2 does not allow JScript's and JavaScriptCore's deviation) or not.
[...]
According to the W3C HTML 4 spec, there are 5 white space characters:
The HTML 4.01 Specification is irrelevant regarding ECMAScript-defined
Regular Expressions.

ECMAScript is irrelevant without a host environment. How it interacts
with, and differs from, that environment is relevant.

You must be kidding.


PointedEars
 
M

Markus

Bart said:
That should be ALT+0160; ALT+160 gives something else.

Yes, that is a typo in my posting.
alert (teststr.replace(/[ \f\n\r\t\v]+/g, ""));

'[ \f\n\r\t\v]' is identical to '\s'

That's what I read, too... anyway it contradicts to what RobG states
about the ECMAScript spec.
I would do something like this:

var teststr = 'hel\u00a0\u00a0\u00a0lo';
teststr = ((teststr.replace(/\u00a0/g, '\u48ef'))
.replace(/\s/g, ''))
.replace(/\u48ef/g, '\u00a0');

Thank you! I assume, in order to be safe it will be necessary to check
for actual occurrences of \u48ef before, in case somebody uses the
application with Chinese.
 
M

Markus

RobG said:
According to the ECMAScript spec, Firefox and Opera are correct, but
according to the HTML 4 spec, IE and Safari are correct (see below).

This is really interesting! I was aware that ECMAScript and HTML are
different languages with different specs, anyway I am surprised of the
fact that the specs are contradictory in an overlapping field. Thank you
for your clarifications.
 
M

Markus

Thomas said:
[...]
According to the W3C HTML 4 spec, there are 5 white space characters:
The HTML 4.01 Specification is irrelevant regarding ECMAScript-defined
Regular Expressions.
ECMAScript is irrelevant without a host environment. How it interacts
with, and differs from, that environment is relevant.

You must be kidding.

If I understand you correctly, you state that ECMAScript is a
stand-alone scripting language, and its use with (X)HTML files is
secondary. This surprises me, as all I can read about ECMAScript in this
group's FAQ is about this type of use. Can you give some examples of
other fields where ECMAScript is used, and where the actual definition
of whitespace makes sense?

Anyway, as a scripting language should serve its purpose and not
vice-versa (I assume you agree to that), it might be a good idea for the
further development of ECMAScript, to introduce various modes according
to the environments of use. Like this, the definitions of things such as
whitespace could be adapted to the environment by setting the mode, for
example with:

document.setEnvironmentMode("HTML4");

(Of course browsers would likely do this by default if an HTML4 doctype
is present.) After that, whitespace definition matches the HTML4
whitespace definition.
 
T

Thomas 'PointedEars' Lahn

Markus said:
Thomas said:
[...] According to the W3C HTML 4 spec, there are 5 white space
characters:
The HTML 4.01 Specification is irrelevant regarding
ECMAScript-defined Regular Expressions.
ECMAScript is irrelevant without a host environment. How it
interacts with, and differs from, that environment is relevant.
You must be kidding.

If I understand you correctly, you state that ECMAScript is a stand-alone
scripting language,

You misunderstand; to state such a thing would be nonsense. Considering
Rob's previous contributions here, it is not his statement that I have to
consider as a bad joke but his using it in this context, as an argument to
justify the observed behavior in Microsoft JScript and Apple JavaScriptCore,
which is clearly non-compliant behavior or IOW an implementation bug.
and its use with (X)HTML files is secondary.

No, I would think it to be self-evident that the most important field of
application for ECMAScript implementations is the scripting of (X)HTML user
agents.
This surprises me, as all I can read about ECMAScript in this group's FAQ
is about this type of use.

So it should not come as a surprise that the most frequently asked questions
are about this field, and that the FAQ even asserts that if a question is
asked here without mentioning a specific environment, it is/should be
assumed by regulars that such a user agent is the target platform.
Can you give some examples of other fields where ECMAScript is used,

There are a number of ECMAScript implementations that few people know that
they exist or that they are actually such implementations. This includes
plain ECMAScript as supported when scripting SVG, the implementation for
scripting Adobe PDF, and of course Macromedia/Adobe Flash ActionScript and
JScript (.NET) in ASP (.NET) on IIS and compatibles.
and where the actual definition of whitespace makes sense?

The actual definition of whitespace as to be found in the ECMAScript
Language Specification, Edition 3 Final, usually makes sense.
Anyway, as a scripting language should serve its purpose and not
vice-versa (I assume you agree to that), it might be a good idea for the
further development of ECMAScript, to introduce various modes according
to the environments of use. Like this, the definitions of things such as
whitespace could be adapted to the environment by setting the mode, for
example with:

document.setEnvironmentMode("HTML4");

(Of course browsers would likely do this by default if an HTML4 doctype
is present.) After that, whitespace definition matches the HTML4
whitespace definition.

Since the `document' object reference is not a language feature it would be
up to the implementors of the respective DOM to implement the
setEnvironmentMode() method. However, I doubt this is ever going to be
considered, for three important reasons:

1. What if there is a HTML5 standard that defines something different?
What about XHTML, XML, SVG, RSS, MathML, CML, and all the other languages
that may at some point define whitespace differently?

2. DOM features should not modify the workings of any programming language.
A DOM should be, and all DOM implementations currently are, a
language-independent API.

3. That would be breaking a fly on the wheel. ECMAScript Regular
Expressions allow you to define a character class that suits your needs:

var myWhitespace = /[ \t]/;


HTH

PointedEars
 
D

Dr J R Stockton

Tue said:
I found some discussions about this. My personal oppinion is that nbsp
should be treated as non-whitespace, for 2 reasons:

No; nbsp should be treated as whitespace, because it is white and it is
space.

Therefore, another name should be used for the subsets of whitespace
which cannot be reduced, without change of meaning, to a single space or
a single newline.

There is already the semantic problem that in some languages, such as
Delphi, newline and space are (outside strings) equivalene, but in
others, such as JavaScript, they are not.
 
R

RobG

Markus said:
Thomas said:
[...] According to the W3C HTML 4 spec, there are 5 white space
characters:
The HTML 4.01 Specification is irrelevant regarding
ECMAScript-defined Regular Expressions.
ECMAScript is irrelevant without a host environment. How it
interacts with, and differs from, that environment is relevant.
You must be kidding.

You must be trolling.

You misunderstand; to state such a thing would be nonsense.

So it is neither stand-alone nor dependent on a host environment -
interesting.

Considering
Rob's previous contributions here, it is not his statement that I have to
consider as a bad joke but his using it in this context, as an argument to
justify the observed behavior in Microsoft JScript and Apple JavaScriptCore,
which is clearly non-compliant behavior or IOW an implementation bug.

I did not at any point attempt to justify the behaviour of any
browser. The point is that different environments have different
concepts of what is or isn't white space, therefore if a consistent
behaviour is requried it is best to define it explicitly and deal with
the differences where they occur.

You are developing a habit of misrepresenting what is posted purely
for the sake of argument. Please stop doing that.
 
L

Lasse Reichstein Nielsen

RobG said:
However, the ECMAScript language spec 7.2 says whitespace
is:

\u0009 Tab <TAB>
\u000B Vertical Tab <VT>
\u000C Form Feed <FF>
\u0020 Space <SP>
\u00A0 No-break space <NBSP>
Other category "Zs" Any other Unicode
"space separator" <USP>

That is whitespace wrt. the script grammar.
The relevant part of the specification here is in the definition of
regular expressions, in particular the semantics of the CharacterClassEscape
token:
---
15.10.2.12 CharacterClassEscape
....
The production CharacterClassEscape :: s evaluates by returning the
set of characters containing the characters that are on the right-hand
side of the WhiteSpace (section 7.2) or LineTerminator (section 7.3)
productions.
---
I.e., a compliant ECMAScript implementation should consider all
of the above white-space characters *and* the line terminators of
section 7.3:

Code Point Value Name Formal Name
\u000A Line Feed <LF>
\u000D Carriage Return <CR>
\u2028 Line separator <LS>
\u2029 Paragraph separator <PS>

And I agree that the HTML specification is irrelevant to an
implementation of the ECMAScript specification. This is a point where
that specification is absolutely clear, and does not give room for a
conformant implementation to differ.

/L
 
T

Thomas 'PointedEars' Lahn

RobG said:
Markus said:
Thomas 'PointedEars' Lahn schrieb:
[...] According to the W3C HTML 4 spec, there are 5 white
space characters:
The HTML 4.01 Specification is irrelevant regarding
ECMAScript-defined Regular Expressions.
ECMAScript is irrelevant without a host environment. How it
interacts with, and differs from, that environment is relevant.
You must be kidding.

You must be trolling.

Again, just for you: The HTML 4.01 Specification is entirely irrelevant
regarding a character class in an ECMAScript Regular Expression. You have
stated or implied that it is, because that Regular Expression is used in a
host environment that uses HTML. That reasoning is utterly wrong; I have
corrected that and explained why it is wrong. Period.
So it is neither stand-alone nor dependent on a host environment -
interesting.

You miss the point.
I did not at any point attempt to justify the behaviour of any browser.
The point is that different environments have different concepts of what
is or isn't white space, therefore if a consistent behaviour is requried
it is best to define it explicitly and deal with the differences where
they occur.

Sadly, you miss the point again.
You are developing a habit of misrepresenting what is posted purely for
the sake of argument. Please stop doing that.

If you would be reasonable, you would see that an apology from you is in
order now.


Score adjusted

PointedEars
 
T

Thomas 'PointedEars' Lahn

Thomas said:
RobG said:
Markus wrote:
Thomas 'PointedEars' Lahn schrieb:
[...] According to the W3C HTML 4 spec, there are 5 white
space characters:
The HTML 4.01 Specification is irrelevant regarding
ECMAScript-defined Regular Expressions.
ECMAScript is irrelevant without a host environment. How it
interacts with, and differs from, that environment is relevant.
You must be kidding.
You must be trolling.

Again, just for you: The HTML 4.01 Specification is entirely irrelevant
regarding a character class in an ECMAScript Regular Expression. You have
stated or implied that it is, because that Regular Expression is used in a ^relevant
host environment that uses HTML. That reasoning is utterly wrong; I have
corrected that and explained why it is wrong. Period.
 
R

Richard Cornford

Bart Van der Donck wrote:
\u00A0 is not ASCII anymore; it stops at code point 128.
<snip>

127, as anything bigger would require 8 bits (so maybe 'stops before
128').
That should be ALT+0160; ALT+160 gives something else.

I would have been happiest to have seen the test performed with hex or
Unicode escape sequences in the string (as that removes any character
set/encoding issues from picture).

[ \f\n\r\t\v]' is identical to '\s';
<snip>

Internet Explorer/JScript has a bug where the ECMAScript vertical tab
escape sequence '\v' is interpreted as a literal 'v'; character. It is
safer to use hex or Unicode escape sequences when vertical tab is the
intended meaning.

The official set of characters used in the interpretation of '\s' in a
regular expression are all the language's white space characters and all
of its line terminators. The last time I was up to date with Unicode
that meant all of this set:-

[white space]
cp = 9 : \u0009 <control>[ASCII Tab] Cc
cp = 11 : \u000B <control>[ASCII Vertical Tab] Cc
cp = 12 : \u000C <control>[ASCII Form Feed] Cc
cp = 32 : \u0020 SPACE Zs
cp = 160 : \u00A0 NO-BREAK SPACE Zs
cp = 5760 : \u1680 OGHAM SPACE MARK Zs
cp = 6158 : \u180E MONGOLIAN VOWEL SEPARATOR Zs
cp = 8192 : \u2000 EN QUAD Zs
cp = 8193 : \u2001 EM QUAD Zs
cp = 8194 : \u2002 EN SPACE Zs
cp = 8195 : \u2003 EM SPACE Zs
cp = 8196 : \u2004 THREE-PER-EM SPACE Zs
cp = 8197 : \u2005 FOUR-PER-EM SPACE Zs
cp = 8198 : \u2006 SIX-PER-EM SPACE Zs
cp = 8199 : \u2007 FIGURE SPACE Zs
cp = 8200 : \u2008 PUNCTUATION SPACE Zs
cp = 8201 : \u2009 THIN SPACE Zs
cp = 8202 : \u200A HAIR SPACE Zs
cp = 8239 : \u202F NARROW NO-BREAK SPACE Zs
cp = 8287 : \u205F MEDIUM MATHEMATICAL SPACE Zs
cp = 12288 : \u3000 IDEOGRAPHIC SPACE Zs

[line terminators]
cp = 10 : \u000A <control>[ASCII Line Feed] Cc
cp = 13 : \u000D <control>[ASCII Carriage Return] Cc
cp = 8232 : \u2028 LINE SEPARATOR Zl
cp = 8233 : \u2029 PARAGRAPH SEPARATOR Zp

Richard.
 
D

dhtml

Bart Van der Donck wrote:

Markus wrote:

[white space]
cp = 9 : \u0009 <control>[ASCII Tab] Cc
cp = 11 : \u000B <control>[ASCII Vertical Tab] Cc
cp = 12 : \u000C <control>[ASCII Form Feed] Cc
cp = 32 : \u0020 SPACE Zs
cp = 160 : \u00A0 NO-BREAK SPACE Zs
cp = 5760 : \u1680 OGHAM SPACE MARK Zs
cp = 6158 : \u180E MONGOLIAN VOWEL SEPARATOR Zs
cp = 8192 : \u2000 EN QUAD Zs
cp = 8193 : \u2001 EM QUAD Zs
cp = 8194 : \u2002 EN SPACE Zs
cp = 8195 : \u2003 EM SPACE Zs
cp = 8196 : \u2004 THREE-PER-EM SPACE Zs
cp = 8197 : \u2005 FOUR-PER-EM SPACE Zs
cp = 8198 : \u2006 SIX-PER-EM SPACE Zs
cp = 8199 : \u2007 FIGURE SPACE Zs
cp = 8200 : \u2008 PUNCTUATION SPACE Zs
cp = 8201 : \u2009 THIN SPACE Zs
cp = 8202 : \u200A HAIR SPACE Zs
cp = 8239 : \u202F NARROW NO-BREAK SPACE Zs
cp = 8287 : \u205F MEDIUM MATHEMATICAL SPACE Zs
cp = 12288 : \u3000 IDEOGRAPHIC SPACE Zs

[line terminators]
cp = 10 : \u000A <control>[ASCII Line Feed] Cc
cp = 13 : \u000D <control>[ASCII Carriage Return] Cc
cp = 8232 : \u2028 LINE SEPARATOR Zl
cp = 8233 : \u2029 PARAGRAPH SEPARATOR Zp

That's an impressive list to memorize!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top