Sanitizing user inputs in multiple languages

Sam · Apr 5, 2007

An application I am developing needs to accept input in any of the 15
languages I've opted for, from a single, common HTML form.

I generally sanitize user inputs from an HTML form by specifying a
list of allowed characters. How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8? Is there a better method?

Jürgen Exner · Apr 5, 2007

Sam said:
An application I am developing needs to accept input in any of the 15
languages I've opted for, from a single, common HTML form.

I generally sanitize user inputs from an HTML form by specifying a
list of allowed characters. How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8?

Quite simple actually. Just take the superset of the white lists of
characters for each language.

Is there a better method?

Depends on your definition of "better".
More secure? Probably no, white lists are much more secure than black lists.
Easier? Well, depends on your definition of "sanitize". If you want to e.g.
eliminate x-site scripting, then you can simply remove those few characters,
that are know to cause x-site scripting. There are modules to do that.

jue

Sam · Apr 5, 2007

( Discussion thread in Google Group -
http://groups.google.com/group/perl...8e68b039cd2/95f19485ac944239#95f19485ac944239
)

How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8?

Jürgen Exner said:
Quite simple actually. Just take the superset of the white lists of
characters for each language.

Yes, I had something similar in mind too, but am stuck on how to
actually implement it. Here's what I've picked up so far -

1. First make sure that input is UTF-8 encoded (
http://www.w3.org/International/questions/qa-forms-utf-8 )
2. Select the allowed characters in each language (all alphabets and
numbers) and use in a regex.

For implementing the second step, I found a useful UTF-8 encoding
table which has the UTF and Hex code for the characters in each
language ( http://www.utf8-chartable.de/ ).

Here's the problem - How do I do identify the important characters of
a language I don't know? For example, I know some of the alphabets of
the Arabic language, but don't really know if characters like (for eg)
the ARABIC POETIC VERSE SIGN is necessary. Second, how do I use the
UTF-8 hex codes for the characters in a regex?

Somebody must have a better solution ... I do get the feeling that
this approach isn't great.

Jürgen Exner said:
Well, depends on your definition of "sanitize". If you want to e.g.
eliminate x-site scripting, ...

Yes, the final intention is to prevent x-site scripting, but am not
aware of any widely used, popular modules for this.

Multiple Reset Inputs	3	Nov 20, 2010
char encoding in hidden inputs	1	Dec 21, 2009
With this artifact, everyone can easily invent new languages	5	Jan 11, 2014
Problems of Symbol Congestion in Computer Languages	54	Feb 16, 2011
Sending email in multiple languages	1	Aug 10, 2006
How to create wesite in multiple user languages	1	Sep 30, 2006
Add recipes using JavaScript in table	20	Apr 17, 2023
Math Notations, Computer Languages, and the “Form” in Formalism	4	Aug 31, 2009

Sanitizing user inputs in multiple languages

Sam

Jürgen Exner

Sam

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads