"i" character regexp madness

V

VK

Honestly regular expressions - above everyday trivias - always were
kind of mysterious stuff to me, but this one drove me really nuts
with false positives.
Basically I want to sort out any strings, containing Unicode
characters from \u0100 and up to \uFFFF. I though that

new RegExp('[\u0100-\uFFFF]+','i')

would make it but it was keep giving false positives for Basic Latin
strings so I find out that

window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('a') ) //
false
....
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('h') ) //
false
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('i') ) // !
TRUE
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('j') ) //
false
....

WTF? AFAIK "Latin Small Letter I" is Unicode \u0069 and browsers seem
to overall agree on that:
window.alert( ('i'.charCodeAt(0)) ) // 105 (dec 105 = hex 69)
Yet in regexp context both FF3.5 and IE8 I tested on attribute this
char to some unknown much higher code range. Any insights (even if
coming with insults :) ?
 
V

VK

Honestly regular expressions - above everyday trivias - always were
kind of mysterious stuff  to me, but this one drove me really nuts
with false positives.
Basically I want to sort out any strings, containing Unicode
characters from \u0100 and up to \uFFFF. I though that

new RegExp('[\u0100-\uFFFF]+','i')

would make it but it was keep giving false positives for Basic Latin
strings so I find out that

window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('a') ) //
false
...
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('h') ) //
false
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('i') ) // !
TRUE
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('j') ) //
false
...

WTF? AFAIK "Latin Small Letter I" is Unicode \u0069 and browsers seem
to overall agree on that:
window.alert( ('i'.charCodeAt(0)) ) // 105 (dec 105 = hex 69)
Yet in regexp context both FF3.5 and IE8 I tested on attribute this
char to some unknown much higher code range. Any insights (even if
coming with insults :) ?

ggg... Stupid RegExp wants \ escaped and stupid VK didn't provide it:

window.alert( (new RegExp('[\\u0100-\\uFFFF]+','i')).test('i') ) //
false
 
M

Mike Duffy

... Any insights (even if coming with insults :) ?

ggg... Stupid RegExp wants \ escaped and stupid VK didn't provide it:

window.alert( (new RegExp('[\\u0100-\\uFFFF]+','i')).test('i') ) //
false
Are you not insulting your prior investigation with this insight?

BTW, thanks for the showing me this. Of course, I will not remember it in
the normal sense that we use the word, but the day will come when I will
fall into the same trap, and eventually succumb to the inexplicable urge to
double up the backslashes in my RegExp.
 
V

VK

Honestly regular expressions - above everyday trivias - always were
kind of mysterious stuff  to me, but this one drove me really nuts
with false positives.
Basically I want to sort out any strings, containing Unicode
characters from \u0100 and up to \uFFFF. I though that
new RegExp('[\u0100-\uFFFF]+','i')
would make it but it was keep giving false positives for Basic Latin
strings so I find out that
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('a') ) //
false
...
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('h') ) //
false
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('i') ) // !
TRUE
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('j') ) //
false
...
WTF? AFAIK "Latin Small Letter I" is Unicode \u0069 and browsers seem
to overall agree on that:
window.alert( ('i'.charCodeAt(0)) ) // 105 (dec 105 = hex 69)
Yet in regexp context both FF3.5 and IE8 I tested on attribute this
char to some unknown much higher code range. Any insights (even if
coming with insults :) ?

ggg... Stupid RegExp wants \ escaped and stupid VK didn't provide it:

window.alert( (new RegExp('[\\u0100-\\uFFFF]+','i')).test('i') ) //
false

Noop, false idea. Escaped escape doesn't help. Actually I see they
finally fixed that so RegExp constructor doesn't care if it double
escaped or not. Any way

<html>
<head>
<title>Test</title>
<meta http-equiv="Content-Type"
content="text/html; charset=iso-8859-1">
</head>
<body>
<script type="text/javascript">
var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');

var c = ['a','b','c','d','e','f','g',
'h','i','j','k','l','m','n',
'o','p','q','r','s','t','u',
'v','w','x','y','z'];

for (var i=0; i<c.length; i++) {
document.write(FILTER.test(c) + '<br>');
}
// Still "i" is attributed to some
// unknowingly high code range
</script>
</body>
</html>

The madness is is exposed by (out of UAs I'm having right now):
NN 4.7 (possibly the origin of it)
FF 3.5.3
IE6, IE8 in any mode

Adequately acting UAs with (new RegExp('[\\u0100-\\uFFFF]+','i')).test
('i') == false
Google Chrome 3.0
Safari 4.0
Opera 10.0

Other words it seems to come from the Browser Wars time and the engine
affected are those of the wartime or based on wartime algorithms.
 
V

VK

VK:
<html>
<head>
<title>Test</title>
<meta http-equiv="Content-Type"
 content="text/html; charset=iso-8859-1">
</head>
<body>
<script type="text/javascript">
var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');

var c = ['a','b','c','d','e','f','g',
         'h','i','j','k','l','m','n',
         'o','p','q','r','s','t','u',
         'v','w','x','y','z'];

for (var i=0; i<c.length; i++) {
 document.write(FILTER.test(c) + '<br>');}

// Still "i" is attributed to some
// unknowingly high code range
</script>
</body>
</html>

The madness is is exposed by (out of UAs I'm having right now):
 NN 4.7 (possibly the origin of it)
 FF 3.5.3
 IE6, IE8 in any mode

Adequately acting UAs with (new RegExp('[\\u0100-\\uFFFF]+','i')).test
('i') == false
 Google Chrome 3.0
 Safari 4.0
 Opera 10.0

Other words it seems to come from the Browser Wars time and the engine
affected are those of the wartime or based on wartime algorithms.


This way a cross-browser regexp sorting out strings by char code
ranges above Basic Latin should be like (going to use right now):
new RegExp('[[i[\u0100-\uFFFF]]+','i')

or more universally:
new RegExp('[[i[\uSTART-\uSTOP]]+','i')

It does look utterly silly but otherwise it fails on any IE and on
Gecko platforms.
 
R

Richard Cornford

On Oct 13, 3:47 pm, VK wrote:
var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');
// Still "i" is attributed to some
// unknowingly high code range

To quote the Unicode standard (versions 3 and 5 being the ones I
checked):-

| FFFF <not a character>
| the value FFFF is guaranteed not to be
| a Unicode character at all

And ECMA 262 says that if either 'CharSet' that defines the endpoints
of a range do not contain just one character then the result is a
syntax error (15.10.2.15, first step in algorithm for
'ChartcterRange'). A code point that is guaranteed not to be a
character could easily be regarded as not being "exactly one
character", so the above regular expression should not be expected to
be successful, and observing that it is not is just a detail.
</script>
</body>
</html>

The madness is is exposed by (out of UAs I'm having right
now):
NN 4.7 (possibly the origin of it)
FF 3.5.3
IE6, IE8 in any mode

Adequately acting UAs with
(new RegExp('[\\u0100-\\uFFFF]+','i')).test
('i') == false
Google Chrome 3.0
Safari 4.0
Opera 10.0

Other words it seems to come from the Browser Wars time and
the engine affected are those of the wartime or based on
wartime algorithms.

Back to your usual game; making it all up off the top of your head.

Richard.
 
V

VK

var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');
// Still "i" is attributed to some
// unknowingly high code range

To quote the Unicode standard (versions 3 and 5 being the ones I
checked):-

| FFFF  <not a character>
|   the value FFFF is guaranteed not to be
|   a Unicode character at all

And ECMA 262 says that if either 'CharSet' that defines the endpoints
of a range do not contain just one character then the result is a
syntax error (15.10.2.15, first step in algorithm for
'ChartcterRange'). A code point that is guaranteed not to be a
character could easily be regarded as not being "exactly one
character", so the above regular expression should not be expected to
be successful, and observing that it is not is just a detail.

A good try but no cigar :) First of all it has no sense that a code
range with an invalid upper border (let's take it as a hypotheses)
would be working for all chars but "i". Secondly at
http://javascript.myplus.org/bugs/i_regexp_2.html you see the range
moved down to "real chars" (to FFFD) and it doesn't change anything.
And the last but not least this type of regexp is working with code
values and not with chars, so it doesn't care if there are chars with
code FFFF or not. The opposite is the same nonsense as to declare that
code like
if (val < 1000) {...}
is invalid as such if val is guaranteed to be within say 0 - 100 range

So I am applauding to your specs knowledge (as I did many times
before), but it is not it.
 
V

VK

On Oct 13, 3:47 pm, VK wrote:
var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');
// Still "i" is attributed to some
// unknowingly high code range
To quote the Unicode standard (versions 3 and 5 being the ones I
checked):-
| FFFF  <not a character>
|   the value FFFF is guaranteed not to be
|   a Unicode character at all
And ECMA 262 says that if either 'CharSet' that defines the endpoints
of a range do not contain just one character then the result is a
syntax error (15.10.2.15, first step in algorithm for
'ChartcterRange'). A code point that is guaranteed not to be a
character could easily be regarded as not being "exactly one
character", so the above regular expression should not be expected to
be successful, and observing that it is not is just a detail.

A good try but no cigar :) First of all it has no sense that a code
range with an invalid upper border (let's take it as a hypotheses)
would be working for all chars but "i". Secondly at
http://javascript.myplus.org/bugs/i_regexp_2.html you see the range
moved down to "real chars" (to FFFD) and it doesn't change anything.
And the last but not least this type of regexp is working with code
values and not with chars, so it doesn't care if there are chars with
code FFFF or not. The opposite is the same nonsense as to declare that
code like
 if (val < 1000) {...}
is invalid as such if val is guaranteed to be within say 0 - 100 range

So I am applauding to your specs knowledge (as I did many times
before), but it is not it.

To save your time for fighting on that one by moving the range up and
down: at
http://javascript.myplus.org/bugs/i_regexp_3.html the range is
narrowed to 0100-FF00 w/o effect and I guess no one is opposing to the
validity of this range.
 
V

VK

On Oct 13, 3:47 pm, VK wrote:
var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');
// Still "i" is attributed to some
// unknowingly high code range
To quote the Unicode standard (versions 3 and 5 being the ones I
checked):-
| FFFF  <not a character>
|   the value FFFF is guaranteed not to be
|   a Unicode character at all
And ECMA 262 says that if either 'CharSet' that defines the endpoints
of a range do not contain just one character then the result is a
syntax error (15.10.2.15, first step in algorithm for
'ChartcterRange'). A code point that is guaranteed not to be a
character could easily be regarded as not being "exactly one
character", so the above regular expression should not be expected to
be successful, and observing that it is not is just a detail.

A good try but no cigar :) First of all it has no sense that a code
range with an invalid upper border (let's take it as a hypotheses)
would be working for all chars but "i". Secondly athttp://javascript.myplus.org/bugs/i_regexp_2.htmlyou see the range
moved down to "real chars" (to FFFD) and it doesn't change anything.
And the last but not least this type of regexp is working with code
values and not with chars, so it doesn't care if there are chars with
code FFFF or not. The opposite is the same nonsense as to declare that
code like
 if (val < 1000) {...}
is invalid as such if val is guaranteed to be within say 0 - 100 range

So I am applauding to your specs knowledge (as I did many times
before), but it is not it.

To save everyone's time: I filed bug to Mozilla (#522021) and it was
just marked as dup of the last year bug 440926
https://bugzilla.mozilla.org/show_bug.cgi?id=440926
So the twist is caused by 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U
+0130) and by inability of Mozilla and Microsoft team to resolve it in
some reasonable way. I am now wondering how do Safari, Chrome and
Opera solving it, as #440926 discussion lists cases of why this
behavior cannot be changed.

So again: [\u0100-\uFFFF] fails on Gecko+IE, [i[\u0100-\uFFFF]] fixes
it, other words, above Basic Latin any regexp filter has to be
whatever_you_actually_need + "i"

It's time for another useful FYI I guess.
 
R

Richard Cornford

On Oct 13, 3:47 pm, VK wrote:
var FILTER = new RegExp('[\\u0100-\\uFFFF]+','i');
// Still "i" is attributed to some
// unknowingly high code range
To quote the Unicode standard (versions 3 and 5 being the
ones I checked):-
| FFFF <not a character>
| the value FFFF is guaranteed not to be
| a Unicode character at all
And ECMA 262 says that if either 'CharSet' that defines the
endpoints of a range do not contain just one character then
the result is a syntax error (15.10.2.15, first step in
algorithm for 'ChartcterRange'). A code point that is
guaranteed not to be a character could easily be regarded as
not being "exactly one character", so the above regular
expression should not be expected to be successful, and
observing that it is not is just a detail.

A good try but no cigar :)

A good try at what? I was pointing out that your test is invalid and
so its results meaningless.
First of all it has no sense that a code
range with an invalid upper border (let's take it as a
hypotheses) would be working for all chars but "i".

It makes perfect sense for theoretically nonsensical code to produce
inconsistent and unexpected results.
Secondly at http://javascript.myplus.org/bugs/i_regexp_2.htmlyou
see the range moved down to "real chars" (to FFFD)

| FFFD REPLACEMENT CHARACTER
| * used to replace an incoming character whose value is
| unknown or unrepresentable in Unicode.
| * compare the use of 001A as a control character to
| indicate the substitute function.

Does not sound much like a "real chars".
and it doesn't change anything.

No it wouldn't, but if you did narrow the test range you might narrow
in on the character that is case-insensitively matched with lower case
i. It is:-

| 0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
| ...
| * lowercase is 0096
And the last but not least this type of regexp is working
with code values and not with chars, so it doesn't care if
there are chars with code FFFF or not.
The opposite is the same nonsense as to declare that
code like
if (val < 1000) {...}
is invalid as such if val is guaranteed to be within say 0 -
100 range

So I am applauding to your specs knowledge (as I did
many times before), but it is not it.

What was "it" was trivial for anyone to work out. My comment was on
an aspect of your methodology. You have already suffered from failing
to minimise the complexity of your tests code; using the RegExp
constructor where a regular expression liberal would have done the job
without confusing you with the javascript string escaping issue. Did
you try you test without the - i - fag, and so observe that case
sensitivity was significant? Did you constrain the range to see how
narrowly the effect could be observed?

It is all very simple, but instead you would rather get lost in
needless complexity and then make up stories about browser wars.

Richard.
 
V

VK

Richard said:
It is all very simple, but instead you would rather get lost in
needless complexity and then make up stories about browser wars.

Well, I am not a Mozilla or Microsoft worker to spend the entire day
to narrow their bug. I already did a big work by narrowing false
positive matches from a bunch of strings to a single character - the
strings were topic titles of this group as viewed from Google Groups,
with ads using all kind of Asian chars in mixture with Latin letters -
so if you think that it was a simple task then you are mistaken. So I
narrowed it to a single char, you are narrowed my results to a single
code place which still took some good time for you. So it is not a
case to accuse anyone in some "laziness".


And about the Browser Wars I was right. It is actually a rule of
thumb: if both Gecko and IE are acting in the same way, then look back
to 1996-1998

https://bugzilla.mozilla.org/show_bug.cgi?id=440926
Comment #7 From Sebastian Helm 2008-12-03
Ah, the sins of the past! It seems this is still a remnant from the
old hack
for Turkish, which has given localization all around the world a lot
of
trouble, not just at Mozilla.
 
R

Richard Cornford

Well, I am not a Mozilla or Microsoft worker to spend the
entire day to narrow their bug.

There is no bug. As I have already pointed out:-

| 0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
| ...
| * lowercase is 0096

- if the lower case representation of LATIN CAPITAL LETTER I WITH DOT
ABOVE is code point 0096, which is "i", then a case-insensitive match
for lower case "I" should match code point 0130.
I already did a big work by narrowing false positive
matches from a bunch of strings to a single character

By which you mean that you have already wasted a great deal of time on
a non-issue that you would not have had to face at all if you had been
rational enough to design a regular expression that was appropriate
for the task. It has never made any sense to attempt a case
insensitive match, by using the - i - flag, when you are defining
ranges of characters that include both upper and lower case characters
in the range (and in the set excluded with the range). Stop doing that
pointless and wasteful thing and there is on longer any issue at all.

... . So it is not a
case to accuse anyone in some "laziness".

No, it is stupidity that you are suffering from, or rather, the
inability to reason about the problem you are trying to solve.
And about the Browser Wars I was right.

Making things up does not make you right.
It is actually a rule of thumb: if both Gecko and IE are acting
in the same way, then look back to 1996-1998
<snip>

Somewhat circular reasoning there.

Richard.
 
V

VK

Richard said:
There is no bug. As I have already pointed out:-

| 0130 LATIN CAPITAL LETTER I WITH DOT ABOVE
|   ...
|  * lowercase is 0096

- if the lower case representation of LATIN CAPITAL LETTER I WITH DOT
ABOVE is code point 0096, which is "i", then a case-insensitive match
for lower case "I" should match code point 0130.

In attempt to be nasty with me you are making yourself looking like a
complete moron - which you are not, at least definitely not in
programming. The code point of LATIN CAPITAL LETTER I is \u0049 (dec
73) and it has nothing to do with LATIN CAPITAL LETTER I WITH DOT. It
is obvious to 3rd party programmers, it is obvious to Mozilla
programmers, it is obvious to Mozilla programmers from the old Mosaic-
Netscape team who actually introduced this bug to speed up some
localizations at pre-Unicode era. It is really silly to claim
opposites when anyone can follow the provided link
https://bugzilla.mozilla.org/show_bug.cgi?id=440926 to read the
discussion and to test provided samples.
Though you may be bizarre to the end and to file wrong behavior to
Safari, Chrome and Opera
 
L

Lasse Reichstein Nielsen

VK said:
Basically I want to sort out any strings, containing Unicode
characters from \u0100 and up to \uFFFF. I though that
new RegExp('[\u0100-\uFFFF]+','i')
would make it but it was keep giving false positives for Basic Latin
....
window.alert( (new RegExp('[\u0100-\uFFFF]+','i')).test('i') ) // !
TRUE
....
ggg... Stupid RegExp wants \ escaped and stupid VK didn't provide it:

window.alert( (new RegExp('[\\u0100-\\uFFFF]+','i')).test('i') ) //
false

Noop, false idea. Escaped escape doesn't help. Actually I see they
finally fixed that so RegExp constructor doesn't care if it double
escaped or not.

Nope, they didn't. The difference is that
'[\u0100-\uffff]+'
is a six character string (with .charCodeAt(1) == 256 and
..charCodeAt(4) = 0xffff), whereas
'[\\u0100-\\uffff]+'
is a 16 character string.
The regular expressions you get from RegExp(string, "i") will
work the same, because they define the same character range.
In one, the character is escaped by the JS string parser, in
the other it's escaped by the RegExp parser.

The madness is is exposed by (out of UAs I'm having right now):
NN 4.7 (possibly the origin of it)
FF 3.5.3
IE6, IE8 in any mode

Adequately acting UAs with (new RegExp('[\\u0100-\\uFFFF]+','i')).test
('i') == false
Google Chrome 3.0
Safari 4.0
Opera 10.0

Other words it seems to come from the Browser Wars time and the engine
affected are those of the wartime or based on wartime algorithms.

The source of the problem is that Unicode U+0131 (aka Latin Small
Letter Dotless I) has as "simple uppercase mapping" exactly the
upper-case latin letter "I".

The case folding used by the "ignore-case" flag on regexps converts
both pattern character and input character to upper case to see if it
matches. That means that U+0131 matches "i", because both convert to
"I" using toUpperCase().

However, there is an additional requirement in the ECMAScript 3rd
edition specification that says that a non-ASCII letter cannot convert
to an ASCII letter for this comparison (15.10.2.8, the Canonicalize
function), so the latter three browsers are following the
specification in not considering U+0131 equivalent to U+0069.
The former three probably (intellectually) predate the specification,
and haven't bothered fixing this.

(There are ~15 non-ASCII characters that upper-case to an
ASCII letter).
/L
 
V

VK

If anyone wonders why in fact I initially tried case insensitive match
if whole ranges excluded:
First it just comes as an automated keystroke after years of matching
searchstring-newstring in Basic Latin. Secondly and most importantly
it was a combined spam filter [ranges]+[keywords] as
var FILTER = new RegExp('[[i[\\u0100-\\uFFFD]]|paypal payment|
wholesale|~{3,}','i');
not just range filtering.
 
V

VK

Lasse said:
The source of the problem is that Unicode U+0131 (aka Latin Small
Letter Dotless I) has as "simple uppercase mapping" exactly the
upper-case latin letter "I".

The case folding used by the "ignore-case" flag on regexps converts
both pattern character and input character to upper case to see if it
matches. That means that U+0131 matches "i", because both convert to
"I" using toUpperCase().

However, there is an additional requirement in the ECMAScript 3rd
edition specification that says that a non-ASCII letter cannot convert
to an ASCII letter for this comparison (15.10.2.8, the Canonicalize
function), so the latter three browsers are following the
specification in not considering U+0131 equivalent to U+0069.
The former three probably (intellectually) predate the specification,
and haven't bothered fixing this.

(There are ~15 non-ASCII characters that upper-case to an
ASCII letter).

Again - I would never think of a such caveat unless explained. Thank
you for a great summary!

P.S. Just in case I am reminding that Mozilla bug is already filed
last year with an interesting discussion followed but so far no move
on it: https://bugzilla.mozilla.org/show_bug.cgi?id=440926
 
R

Richard Cornford

In attempt to be nasty with me you are making yourself looking like
a complete moron - which you are not, at least definitely not in
programming. The code point of LATIN CAPITAL LETTER I is \u0049
(dec 73) and it has nothing to do with LATIN CAPITAL LETTER I WITH
DOT.

Who said it did? The Unicode spec says that the lower-case
representation of LATIN CAPITAL LETTER I WITH DOT ABOVE is code point
0096, which is _lower_case_ "I" (as I said). It is case-insensitive
matching of lower case "I" that is provoking the positive match.
It
is obvious to 3rd party programmers, it is obvious to Mozilla
programmers, it is obvious to Mozilla programmers from the old
Mosaic-Netscape team who actually introduced this bug to speed
up some localizations at pre-Unicode era. It is really silly to
claim opposites when anyone can follow the provided link
https://bugzilla.mozilla.org/show_bug.cgi?id=440926
<snip>

If anything is obvious from reading that it is that the people
involved were very unsure about what should be the 'correct' behaviour
and why. Indeed, at no point is anyone able to state what it is about
the ECMA spec that makes it incorrect. It is actually step 2 in the
algorithm for Canonicalize (not step 5 as suggested by Lasse and
hinted at by some comments on the bug). Step 2 requres that "i" be
transformed to uppercase 'as if by calling
String.prototype.toUpperCase on' it so it should not matter whether
there are multiple uppercase representations of "i", because -
toUpperCase - can only return one.

Richard.
 
L

Lasse Reichstein Nielsen

VK said:
var FILTER = new RegExp('[[i[\\u0100-\\uFFFD]]|paypal payment|
wholesale|~{3,}','i');

I don't think that means what you think it means.

The above regexp matches (case insensitively)
any of "i", "[", or a character in the range U+0100 .. U+FFFD
followed by the character "]"
or
the string "paypal payment
or
the string "wholesale"
or
3 or more times the "~" character

In particular, having two instances of "[" in the first range can't
be necessary.
/L
 
V

VK

Lasse said:
VK said:
 var FILTER = new RegExp('[[i[\\u0100-\\uFFFD]]|paypal payment|
wholesale|~{3,}','i');

I don't think that means what you think it means.

The above regexp matches (case insensitively)
 any of "i", "[", or a character in the range U+0100 .. U+FFFD
 followed by the character "]"
or
 the string "paypal payment
or
 the string "wholesale"
or
 3 or more times the "~" character

In particular, having two instances of "[" in the first range can't
be necessary.

Eah, I always knew that my regexp skills are... whatever... :)

I actually need to sort out all strings that contain:

1) any char in \u0100 - \uFFFD
The upper border decreased from \FFFF to \FFFD by Mr.Cornford
suggestion that FFFE and FFFF are reserved in Unicode and no char can
have such code.

2) ~~ and longer ~ strings like "~~~~~ SUPER OFFER ~~~~" etc.

3) "discount", "hot sale", "wholesale" words in any case or in MixED
case

In ggNoSpam 0.2.4 I used two consecutive filters, the stable one for
(1) and variable one for (2-3) :

var FILTER_NO_EX = new RegExp('[\\u0100-\\uFFFD]','');
var FILTER = new RegExp('~(2,)|discount|hot sale|wholesale','i');

spam = FILTER_NO_EX.test(tmp);
if (!spam) {
spam = FILTER.test(tmp)
}
if (spam) {
// do action
}

I am open for bug corrections and more effective structures. I only
want to note that FILTER part will have to change regularly so
readability and modularity is preferred over compactness or even
effectiveness, because the filters will have to deal with a small
(20-100) set of strings 10-100 char length.
 
L

Lasse Reichstein Nielsen

Richard Cornford said:
If anything is obvious from reading that it is that the people
involved were very unsure about what should be the 'correct' behaviour
and why. Indeed, at no point is anyone able to state what it is about
the ECMA spec that makes it incorrect. It is actually step 2 in the
algorithm for Canonicalize (not step 5 as suggested by Lasse and
hinted at by some comments on the bug). Step 2 requres that "i" be
transformed to uppercase 'as if by calling
String.prototype.toUpperCase on' it so it should not matter whether
there are multiple uppercase representations of "i", because -
toUpperCase - can only return one.

Yes, you are correct. I misidentified the problem. IE and Firefox
are actually failing already at step 2 (*and* at step five, if it
makes sense to say that after we already diverged from the algorithm)

In IE, it actually is U+130, Dotted Capital I, that is matching "i".
I.e., /\u0130/i.test("i") is true, and also /i/i.test("\u0x130")
is true as well.

It seems IE canonicalizes by lower-casing (and not minding
canonicalizing non-ASCII to ASCII). That's not ECMAScript compliant,
but the behavior probably predates ECMAScript anyway.

Firefox is more interesting.
It does indeed return true for /[\u0100-\uffff]/i.test("i"), but
when you try the reverse the test, by matching /i/i against a string
of all the chars in the U+0100..U+FFFF range, *it doesn't match*.
I.e., there is some value in U+0100 through U+FFFF that is case
insensitively equal to "i" in one direction, but not in the other.
That's just .. wrong.
Well, the culprit is 304 again (binary search is your friend).
I.e., /\u0130/i.test("i") is true in Firefox as well, ...
but /i/i.test("\u0130") is false!

It's impossible to say which step in the ECMAScript Canonicalize
function that matters here. It doesn't even appear to be close.

/L '/\u03a3\u03c3\u03c2/i.test("\u03c2\u03a3\u03c3") === true'
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,600
Members
45,181
Latest member
RexGreenwa
Top