"i" character regexp madness

E

Evertjan.

VK wrote on 13 okt 2009 in comp.lang.javascript:
var FILTER_NO_EX = new RegExp('[\\u0100-\\uFFFD]','');
var FILTER = new RegExp('~(2,)|discount|hot sale|wholesale','i');

I think you mean ~{2,} not ~(2,)

no need for the , in ~{2,}, as more than two ~ are matched by two ~.
spam = FILTER_NO_EX.test(tmp);
if (!spam) {
spam = FILTER.test(tmp)
}
if (spam) {
// do action
}

if (/[^\x00-\xff]|~~|discount|(hot |whole)sale/i.test(temp)) {
// do action
};
 
V

VK

VK wrote on 13 okt 2009 in comp.lang.javascript:
  var FILTER_NO_EX = new RegExp('[\\u0100-\\uFFFD]','');
  var FILTER = new RegExp('~(2,)|discount|hot sale|wholesale','i');

I think you mean ~{2,} not ~(2,)

no need for the , in ~{2,}, as more than two ~ are matched by two ~.


  spam = FILTER_NO_EX.test(tmp);
  if (!spam) {
   spam = FILTER.test(tmp)
  }
  if (spam) {
   // do action
  }

if (/[^\x00-\xff]|~~|discount|(hot |whole)sale/i.test(temp)) {
   // do action

};

Fantastic! Thank you so much.
The code updated and moved to 0.2.5 both at
http://userscripts.org/scripts/show/59377 (Greasemonkey)
and
http://www.iescripts.org/view-scripts-676p1.htm (IE7Pro)

Please note that the actual filter code is
var FILTER = new RegExp('[^\\x00-\\xff]|~~|discount|(hot |whole)
sale','i');
while it shows in the "Source" windows at the site as
var FILTER = new RegExp('[^\x00-\xff]|~~|discount|(hot |whole)
sale','i');
That's their bug with viewer

Thank you again.
 
T

Thomas 'PointedEars' Lahn

VK said:
The code updated and moved to 0.2.5 both at
http://userscripts.org/scripts/show/59377 (Greasemonkey)
and
http://www.iescripts.org/view-scripts-676p1.htm (IE7Pro)

Please note that the actual filter code is
var FILTER = new RegExp('[^\\x00-\\xff]|~~|discount|(hot |whole)
sale','i');

while competent people would have used

var filter = /[^\u0000-\u00FF]|~~|discount|(hot |whole) sale/i;

considering the target implementations (and this RegExp initializer
is supported since JavaScript 1.2/NS 4.0, JScript 3.1/IE 4.01 anyway).

<http://PointedEars.de/es-matrix/#features>

However, it is rather curious that you would want to exclude postings
that contained any Unicode character beyond the Latin-1 Supplement range.


PointedEars
 
D

Dr J R Stockton

In comp.lang.javascript message <40d061ae-951a-4c08-9801-c9fcb56d9924@l2
g2000yqd.googlegroups.com>, Tue, 13 Oct 2009 04:47:04, VK
window.alert( (new RegExp('[\\u0100-\\uFFFF]+','i')).test('i') ) //
false

Independently of what your actual problems are, ISTM unwise to use "new
RegExp" in cases where a RegExp Literal could be used. RegExp Literals
require less typing and provide less room for error.

Also, generally, a comma acting as a list separator should be followed
by whitespace. That does not help the computer, but it helps many of
those reading and parsing /via/ eyeballs. Semi-colon likewise.
 
V

VK

Thomas said:
VK said:
Please note that the actual filter code is
 var FILTER = new RegExp('[^\\x00-\\xff]|~~|discount|(hot |whole)
sale','i');

while competent people would have used

  var filter = /[^\u0000-\u00FF]|~~|discount|(hot |whole) sale/i;

considering the target implementations (and this RegExp initializer
is supported since JavaScript 1.2/NS 4.0, JScript 3.1/IE 4.01 anyway).
OK

However, it is rather curious that you would want to exclude postings
that contained any Unicode character beyond the Latin-1 Supplement range.

Yes, later I decided that it's too aggressive for a public release
though it was satisfactory for myself - so in 0.4.0 this part or regex
is removed.

As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Am I wrong?
 
V

VK

As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

With the previously described Gecko madness and IE silliness it must
me something like
 (NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Am I wrong?

I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i
 
T

Thomas 'PointedEars' Lahn

VK said:
As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?

With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.

Am I wrong?

I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

For a public release you would probably only exclude parsed Subject headers
containing certain words and Unicode symbol characters as the spammers
appear to use those. The Unicode Character Database can help you with that.

Of course, by now a smarter person would have gotten themselves informed how
Usenet really works¹, and used a locally installed newsreader, like
Thunderbird/Icedove or KNode instead of the buggy and apparently
unmaintained Google Groups archive (search is frequently borken/inaccurate,
abuse complaints do not appear to have any effect, Control messages by
automated anti-spam bots people have set up are ignored²), because it would
provide much finer filter capabilities. Indeed, by doing that since I can
remember (with few exceptions), and accessing a not-too-bad maintained
newsserver, I am seeing only a fraction of the spam that can be seen at
Google Groups.


PointedEars
___________
¹ <http://en.wikipedia.org/wiki/Usenet>
² Insofar we might have arrived at a point where we should question whether
the Usenet saying "Google is your friend" still holds true.
 
V

VK

VK said:
I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

Thomas 'PointedEars' Lahn wrote
For a public release you would probably only exclude parsed Subject headers
containing certain words and Unicode symbol characters as the spammers
appear to use those.  The Unicode Character Database can help you with that.

So am I right with my RegExp deadlock assumption or not?
Of course, by now a smarter person would have gotten themselves informed how
Usenet really works¹, and used a locally installed newsreader, like
Thunderbird/Icedove or KNode instead of the buggy and apparently
unmaintained Google Groups archive

In the good ol' Usenet there is such thing as the server retention
time, so DejaNews -> Google Groups is the only reason why Usenet
luckily didn't follow Gopher and Veronica like stuff so didn't became
a network of a handful of enthusiasts. I am not pushing everyone to
switch on GG, God forbids! :) But it always irritates me when
different "Usenet gurus" are spiting to the river they are drinking
from. But we already had a very similar situation back in 2007 about
"anti-MI5 filter". That time you remained silent on my proposal to
show real advantages of a real newsreader:
http://groups.google.com/group/comp.lang.javascript/msg/6f741e7fcd16a914
Do you want to add something in the year 2009?
 
T

Thomas 'PointedEars' Lahn

VK said:
VK said:
As a theoretical question I am still wondering what if one needed to
exclude any characters in the said range \u0100 - \uFFFD or any other
range encompassing code point \u0130 ?
With the previously described Gecko madness and IE silliness it must
me something like
(NOT I AND NOT i) AND (within \u0100 - \uFFFD range)
but RegExp seems doesn't have a relevant syntax for "double negations"
of the kind as such situations simply not possible in the regular
expressions theory.
Am I wrong?
I mean a situation if we want to make a one liner with other regexp
parts requiring case insensitive matches so
/[code range]|this|that/i

Thomas 'PointedEars' Lahn wrote

Learn to quote.
So am I right with my RegExp deadlock assumption or not?
Mu.


In the good ol' Usenet there is such thing as the server retention
time, so DejaNews -> Google Groups is the only reason why Usenet
luckily didn't follow Gopher and Veronica like stuff so didn't became
a network of a handful of enthusiasts.

OMG. More fairytales from VK land. Give people like you, I wish for Usenet
to become that "network of a handful of enthusiasts". Since Google Groups,
the S/N ratio has steadily declined.
I am not pushing everyone to switch on GG,

Good.


PointedEars
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,177
Latest member
OrderGlucea
Top