Validing a utf-8 string from search textbox with javascript regular expression

F

frohlinger

Hi,

I have a search textbox in my website.
I validate the search string with a "white list" of allowed
characters:

if((/^[a-zA-Z0-9_.@ \"']+$/).test(theSearchWord) == false)
{
return;
}

I'm allowing only a-z, A-Z, 0-9, underscore, point, @, " and ' - but I
need to allow ALL UNICODE LETTERS, from all languages. I would like
that other languages will also be supported (e.g.:Hebrew) in the
search box, but not all existing special characters.
What should my regexp be?

Thanks,
Gabi
 
D

d d

I'm allowing only a-z, A-Z, 0-9, underscore, point, @, " and ' - but I
need to allow ALL UNICODE LETTERS, from all languages.

I'll leave someone who knows regular expressions to address that part,
but can I ask why you need to block certain characters from the search?
Is it not just acceptable to let them put any characters they want into
the search? OK, they would get either no results or garbage results, but
then that should be what they expect (garbage in garbage out).

If you have certain characters that are really causing you problems,
maybe it would best to invert your logic and block them (black list)
instead of trying to create a potentially ever-expanding white list.

~dd
 
F

frohlinger

but can I ask why you need to block certain characters from the search?

Simply because I want to protect my website and DB from hacks.
If I enable to insert any character possible, hackers can damage my
DB.
You should always use a "white list" for textboxes that query your DB.

G
 
F

frohlinger

OK, but I still want to double check, on client-side and on server-
side.
If it's not possible on client side, I'll do it only on server-side,
but if it is, than why not?
 
R

Richard Cornford

Simply because I want to protect my website and DB from hacks.
If I enable to insert any character possible, hackers can damage my
DB.
You should always use a "white list" for textboxes that query your DB.

Client-side code offers no protection of anything from anything. Your
'hackers' can knock-out, replace or re-define your client-side
validation with trivial effort (and if they want to inject something
they probably will do so as a matter of course).

For any real protection you _need_ server-side validation, and if your
have server-side validation that will do the job you can allow the
client-side validation (which is only for user convenience, to cut down
on requests that you know will be rejected by the server if they are
actually made) can be much more tolerant.

Richard.
 
G

getsanjay.sharma

(e-mail address removed) wrote:

Client-side code offers no protection of anything from anything. Your
'hackers' can knock-out, replace or re-define your client-side
validation with trivial effort (and if they want to inject something
they probably will do so as a matter of course).

I can understand knocking out(disabling it), but how can someone
replace or redefine my own script?
 
R

Richard Cornford

I can understand knocking out(disabling it), but how can someone
replace or redefine my own script?

By just doing it. Javascript, being very dynamic, allows object
properties to be assigned new values at any time and web browsers
facilitate the execution of arbitrary code by the user (at its simplest,
by entering javascript pseudo-protocol URLs in the address/location
bar). If you have a function called, say, 'validate' and I want to
re-define it I just have to write - javascript:void validate =
function(x, y){ /* new function body */ }; - into the address/location
bar and hit return and now 'validate' is the function I defined. And if
you try masking your code inside closures (like Google maps tried) I can
still re-define it by getting the source code as a string (from an
exposed property, from the SCRIPT element via the DOM and/or from an
external file with (in a worst case) an XML HTTP request) replace the
sections of that string I want changed and then - eval - the script in
the global context to have the changed code replace the original. Or I
can write a dynamic script loading javascript URL that will import any
arbitrary script from any source, providing the option to replace your
code with mine or any (scripted) tools I may need to step outside the
limitations imposed upon the javascript pseudo-protocol, and play with
the script environment you define to my hart's content.

And all of that is without the options of using dedicated content
inserting proxies, 'greasemonkey' scripts in Firefox/Mozilla/Gecko
browser or scripting an IE browser instance from windows scripting host
(and so side-stepping most security and bringing the 'big guns' of
ActiveX that are too dangerous to be allowed to run in normal IE into
the game).

The javascript executing on the client is entirely at the mercy of the
person sitting in front of the computer on which it is executing, and so
client-side code provides precisely zero security.

Richard.
 
B

Bart Van der Donck

I'm allowing only a-z, A-Z, 0-9, underscore, point, @, "
and ' - but I need to allow ALL UNICODE LETTERS, from all
languages. I would like that other languages will also be
supported (e.g.:Hebrew) in the search box, but not all
existing special characters. What should my regexp be?

The only useful security validation takes places at the server side,
as others have already explained in this thread. But you could have
other reasons to perform such a regex. At first you'll need some study
of the Unicode tables in order to define what you want to allow and
what not. A good starting point: http://unicode.coeurlumiere.com/.

You should not store code points higher than 256 in javascript source.

Let's say you want to allow a-z, A-Z, 0-9, underscore, point, @-sign,
double quote, single quote, Hebrew alphabet and Russian alphabet
(Cyrillic):

var okay = [
// Latin uppercase
'0041','0042','0043','0044','0045','0046','0047','0048',
'0049','004A','004B','004C','004D','004E','004F','0050',
'0051','0052','0053','0054','0055','0056','0057','0058',
'0059','005A',

// Latin lowercase
'0061','0062','0063','0064','0065','0066','0067','0068',
'0069','006A','006B','006C','006D','006E','006F','0070',
'0071','0072','0073','0074','0075','0076','0077','0078',
'0079','007A',

// underscore, point, @-sign, double quote, single quote
'005F','002E','0022','0040','0027',

// Russian uppercase
'0410','0411','0412','0413','0414','0415','0416','0417',
'0418','0419','041A','041B','041C','041D','041E','041F',
'0420','0421','0422','0423','0424','0425','0426','0427',
'0428','0429','042A','042B','042C','042D','042E','042F',

// Russian lowercase
'0430','0431','0432','0433','0434','0435','0436','0437',
'0438','0439','043A','043B','043C','043D','043E','043F',
'0440','0441','0442','0443','0444','0445','0446','0447',
'0448','0449','044A','044B','044C','044D','044E','044F',

// Hebrew
'05D0','05D1','05D2','05D3','05D4','05D5','05D6','05D7',
'05D8','05D9','05DA','05DB','05DC','05DD','05DE','05DF',
'05E0','05E1','05E2','05E3','05E4','05E5','05E6','05E7',
'05E8','05E9','05EA'
]

It should be pretty straight-forward to write a regex to walk through
this array so that every character from a string must be in it. But
you see that this could easily become a heavy CPU consumer, depending
on how much you want to allow. Unicode is... big :)

For this reason, an alternative Visio is to just write a regex that
states which characters are NOT allowed. But despite the fact that any
common language uses ASCII instructions only, some mechanisms might be
triggered that do unexpected things when receiving unknown characters
as input. This is extremely dependent on language, application and
environment.

Hope this helps,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top