Sanitizing user inputs in multiple languages

S

Sam

An application I am developing needs to accept input in any of the 15
languages I've opted for, from a single, common HTML form.

I generally sanitize user inputs from an HTML form by specifying a
list of allowed characters. How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8? Is there a better method?
 
J

Jürgen Exner

Sam said:
An application I am developing needs to accept input in any of the 15
languages I've opted for, from a single, common HTML form.

I generally sanitize user inputs from an HTML form by specifying a
list of allowed characters. How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8?

Quite simple actually. Just take the superset of the white lists of
characters for each language.
Is there a better method?

Depends on your definition of "better".
More secure? Probably no, white lists are much more secure than black lists.
Easier? Well, depends on your definition of "sanitize". If you want to e.g.
eliminate x-site scripting, then you can simply remove those few characters,
that are know to cause x-site scripting. There are modules to do that.

jue
 
S

Sam

( Discussion thread in Google Group -
http://groups.google.com/group/perl...8e68b039cd2/95f19485ac944239#95f19485ac944239
)
How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8?

Jürgen Exner said:
Quite simple actually. Just take the superset of the white lists of
characters for each language.

Yes, I had something similar in mind too, but am stuck on how to
actually implement it. Here's what I've picked up so far -

1. First make sure that input is UTF-8 encoded (
http://www.w3.org/International/questions/qa-forms-utf-8 )
2. Select the allowed characters in each language (all alphabets and
numbers) and use in a regex.

For implementing the second step, I found a useful UTF-8 encoding
table which has the UTF and Hex code for the characters in each
language ( http://www.utf8-chartable.de/ ).

Here's the problem - How do I do identify the important characters of
a language I don't know? For example, I know some of the alphabets of
the Arabic language, but don't really know if characters like (for eg)
the ARABIC POETIC VERSE SIGN is necessary. Second, how do I use the
UTF-8 hex codes for the characters in a regex?

Somebody must have a better solution ... I do get the feeling that
this approach isn't great.

Jürgen Exner said:
Well, depends on your definition of "sanitize". If you want to e.g.
eliminate x-site scripting, ...

Yes, the final intention is to prevent x-site scripting, but am not
aware of any widely used, popular modules for this.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top