I'm allowing only a-z, A-Z, 0-9, underscore, point, @, "
and ' - but I need to allow ALL UNICODE LETTERS, from all
languages. I would like that other languages will also be
supported (e.g.:Hebrew) in the search box, but not all
existing special characters. What should my regexp be?
The only useful security validation takes places at the server side,
as others have already explained in this thread. But you could have
other reasons to perform such a regex. At first you'll need some study
of the Unicode tables in order to define what you want to allow and
what not. A good starting point:
http://unicode.coeurlumiere.com/.
You should not store code points higher than 256 in javascript source.
Let's say you want to allow a-z, A-Z, 0-9, underscore, point, @-sign,
double quote, single quote, Hebrew alphabet and Russian alphabet
(Cyrillic):
var okay = [
// Latin uppercase
'0041','0042','0043','0044','0045','0046','0047','0048',
'0049','004A','004B','004C','004D','004E','004F','0050',
'0051','0052','0053','0054','0055','0056','0057','0058',
'0059','005A',
// Latin lowercase
'0061','0062','0063','0064','0065','0066','0067','0068',
'0069','006A','006B','006C','006D','006E','006F','0070',
'0071','0072','0073','0074','0075','0076','0077','0078',
'0079','007A',
// underscore, point, @-sign, double quote, single quote
'005F','002E','0022','0040','0027',
// Russian uppercase
'0410','0411','0412','0413','0414','0415','0416','0417',
'0418','0419','041A','041B','041C','041D','041E','041F',
'0420','0421','0422','0423','0424','0425','0426','0427',
'0428','0429','042A','042B','042C','042D','042E','042F',
// Russian lowercase
'0430','0431','0432','0433','0434','0435','0436','0437',
'0438','0439','043A','043B','043C','043D','043E','043F',
'0440','0441','0442','0443','0444','0445','0446','0447',
'0448','0449','044A','044B','044C','044D','044E','044F',
// Hebrew
'05D0','05D1','05D2','05D3','05D4','05D5','05D6','05D7',
'05D8','05D9','05DA','05DB','05DC','05DD','05DE','05DF',
'05E0','05E1','05E2','05E3','05E4','05E5','05E6','05E7',
'05E8','05E9','05EA'
]
It should be pretty straight-forward to write a regex to walk through
this array so that every character from a string must be in it. But
you see that this could easily become a heavy CPU consumer, depending
on how much you want to allow. Unicode is... big
For this reason, an alternative Visio is to just write a regex that
states which characters are NOT allowed. But despite the fact that any
common language uses ASCII instructions only, some mechanisms might be
triggered that do unexpected things when receiving unknown characters
as input. This is extremely dependent on language, application and
environment.
Hope this helps,