"i" character regexp madness

Evertjan. · Oct 13, 2009

VK wrote on 13 okt 2009 in comp.lang.javascript:

var FILTER_NO_EX = new RegExp('[\\u0100-\\uFFFD]','');
var FILTER = new RegExp('~(2,)|discount|hot sale|wholesale','i');

I think you mean ~{2,} not ~(2,)

no need for the , in ~{2,}, as more than two ~ are matched by two ~.

spam = FILTER_NO_EX.test(tmp);
if (!spam) {
spam = FILTER.test(tmp)
}
if (spam) {
// do action
}

if (/[^\x00-\xff]|~~|discount|(hot |whole)sale/i.test(temp)) {
// do action
};

VK · Oct 13, 2009

VK wrote on 13 okt 2009 in comp.lang.javascript:

var FILTER_NO_EX = new RegExp('[\\u0100-\\uFFFD]','');
var FILTER = new RegExp('~(2,)|discount|hot sale|wholesale','i');

Click to expand...

I think you mean ~{2,} not ~(2,)

no need for the , in ~{2,}, as more than two ~ are matched by two ~.

spam = FILTER_NO_EX.test(tmp);
if (!spam) {
spam = FILTER.test(tmp)
}
if (spam) {
// do action
}

Click to expand...

if (/[^\x00-\xff]|~~|discount|(hot |whole)sale/i.test(temp)) {
// do action

};

Fantastic! Thank you so much.
The code updated and moved to 0.2.5 both at
http://userscripts.org/scripts/show/59377 (Greasemonkey)
and
http://www.iescripts.org/view-scripts-676p1.htm (IE7Pro)

Please note that the actual filter code is
var FILTER = new RegExp('[^\\x00-\\xff]|~~|discount|(hot |whole)
sale','i');
while it shows in the "Source" windows at the site as
var FILTER = new RegExp('[^\x00-\xff]|~~|discount|(hot |whole)
sale','i');
That's their bug with viewer

Thank you again.

Thomas 'PointedEars' Lahn · Oct 14, 2009

VK said:
The code updated and moved to 0.2.5 both at
http://userscripts.org/scripts/show/59377 (Greasemonkey)
and
http://www.iescripts.org/view-scripts-676p1.htm (IE7Pro)

Please note that the actual filter code is
var FILTER = new RegExp('[^\\x00-\\xff]|~~|discount|(hot |whole)
sale','i');

while competent people would have used

var filter = /[^\u0000-\u00FF]|~~|discount|(hot |whole) sale/i;

considering the target implementations (and this RegExp initializer
is supported since JavaScript 1.2/NS 4.0, JScript 3.1/IE 4.01 anyway).

<http://PointedEars.de/es-matrix/#features>

However, it is rather curious that you would want to exclude postings
that contained any Unicode character beyond the Latin-1 Supplement range.

PointedEars

Dr J R Stockton · Oct 14, 2009

In comp.lang.javascript message <40d061ae-951a-4c08-9801-c9fcb56d9924@l2
g2000yqd.googlegroups.com>, Tue, 13 Oct 2009 04:47:04, VK

window.alert( (new RegExp('[\\u0100-\\uFFFF]+','i')).test('i') ) //
false

Independently of what your actual problems are, ISTM unwise to use "new
RegExp" in cases where a RegExp Literal could be used. RegExp Literals
require less typing and provide less room for error.

Also, generally, a comma acting as a list separator should be followed
by whitespace. That does not help the computer, but it helps many of
those reading and parsing /via/ eyeballs. Semi-colon likewise.

VK · Oct 15, 2009

Thomas said:
VK said:

Please note that the actual filter code is
var FILTER = new RegExp('[^\\x00-\\xff]|~~|discount|(hot |whole)
sale','i');

Click to expand...

while competent people would have used

var filter = /[^\u0000-\u00FF]|~~|discount|(hot |whole) sale/i;

considering the target implementations (and this RegExp initializer
is supported since JavaScript 1.2/NS 4.0, JScript 3.1/IE 4.01 anyway).
OK

However, it is rather curious that you would want to exclude postings
that contained any Unicode character beyond the Latin-1 Supplement range.

Yes, later I decided that it's too aggressive for a public release
though it was satisfactory for myself - so in 0.4.0 this part or regex
is removed.

As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Am I wrong?

VK · Oct 15, 2009

As a theoretical question I am still wondering what if one needed to

exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Am I wrong?

I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

Thomas 'PointedEars' Lahn · Oct 16, 2009

VK said:
As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Am I wrong?

Click to expand...

I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

For a public release you would probably only exclude parsed Subject headers
containing certain words and Unicode symbol characters as the spammers
appear to use those. The Unicode Character Database can help you with that.

Of course, by now a smarter person would have gotten themselves informed how
Usenet really worksÂ¹, and used a locally installed newsreader, like
Thunderbird/Icedove or KNode instead of the buggy and apparently
unmaintained Google Groups archive (search is frequently borken/inaccurate,
abuse complaints do not appear to have any effect, Control messages by
automated anti-spam bots people have set up are ignoredÂ²), because it would
provide much finer filter capabilities. Indeed, by doing that since I can
remember (with few exceptions), and accessing a not-too-bad maintained
newsserver, I am seeing only a fraction of the spam that can be seen at
Google Groups.

PointedEars
___________
Â¹ <http://en.wikipedia.org/wiki/Usenet>
Â² Insofar we might have arrived at a point where we should question whether
the Usenet saying "Google is your friend" still holds true.

VK · Oct 16, 2009

VK said:
I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

Click to expand...

Thomas 'PointedEars' Lahn wrote
For a public release you would probably only exclude parsed Subject headers
containing certain words and Unicode symbol characters as the spammers
appear to use those. The Unicode Character Database can help you with that.

So am I right with my RegExp deadlock assumption or not?

Of course, by now a smarter person would have gotten themselves informed how
Usenet really works¹, and used a locally installed newsreader, like
Thunderbird/Icedove or KNode instead of the buggy and apparently
unmaintained Google Groups archive

In the good ol' Usenet there is such thing as the server retention
time, so DejaNews -> Google Groups is the only reason why Usenet
luckily didn't follow Gopher and Veronica like stuff so didn't became
a network of a handful of enthusiasts. I am not pushing everyone to
switch on GG, God forbids!

But it always irritates me when
different "Usenet gurus" are spiting to the river they are drinking
from. But we already had a very similar situation back in 2007 about
"anti-MI5 filter". That time you remained silent on my proposal to
show real advantages of a real newsreader:
http://groups.google.com/group/comp.lang.javascript/msg/6f741e7fcd16a914
Do you want to add something in the year 2009?

Thomas 'PointedEars' Lahn · Oct 16, 2009

VK said:
VK said:

As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

Click to expand...

With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Click to expand...

Am I wrong?

Click to expand...

I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

Click to expand...

Thomas 'PointedEars' Lahn wrote

Click to expand...

Learn to quote.

So am I right with my RegExp deadlock assumption or not?
Mu.

In the good ol' Usenet there is such thing as the server retention
time, so DejaNews -> Google Groups is the only reason why Usenet
luckily didn't follow Gopher and Veronica like stuff so didn't became
a network of a handful of enthusiasts.

OMG. More fairytales from VK land. Give people like you, I wish for Usenet
to become that "network of a handful of enthusiasts". Since Google Groups,
the S/N ratio has steadily declined.

I am not pushing everyone to switch on GG,

Good.

PointedEars

PyWart: PEP8: a seething cauldron of inconsistencies.	1	Jul 28, 2011
PyWart: PEP8: A cauldron of inconsistencies.	7	Jul 27, 2011
python-dev Summary for 2004-08-01 through 2004-08-15	17	Aug 24, 2004
comp.lang.vhdl FAQ part 4 of 4: glossary	0	Jul 8, 2003

"i" character regexp madness

Evertjan.

VK

Thomas 'PointedEars' Lahn

Dr J R Stockton

VK

VK

Thomas 'PointedEars' Lahn

VK

Thomas 'PointedEars' Lahn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads