( Discussion thread in Google Group -
http://groups.google.com/group/perl...8e68b039cd2/95f19485ac944239#95f19485ac944239
)
How can I do a similar sanitization for
inputs that can be in any of the 15 languages if the encoding is
UTF-8?
Jürgen Exner said:
Quite simple actually. Just take the superset of the white lists of
characters for each language.
Yes, I had something similar in mind too, but am stuck on how to
actually implement it. Here's what I've picked up so far -
1. First make sure that input is UTF-8 encoded (
http://www.w3.org/International/questions/qa-forms-utf-8 )
2. Select the allowed characters in each language (all alphabets and
numbers) and use in a regex.
For implementing the second step, I found a useful UTF-8 encoding
table which has the UTF and Hex code for the characters in each
language (
http://www.utf8-chartable.de/ ).
Here's the problem - How do I do identify the important characters of
a language I don't know? For example, I know some of the alphabets of
the Arabic language, but don't really know if characters like (for eg)
the ARABIC POETIC VERSE SIGN is necessary. Second, how do I use the
UTF-8 hex codes for the characters in a regex?
Somebody must have a better solution ... I do get the feeling that
this approach isn't great.
Jürgen Exner said:
Well, depends on your definition of "sanitize". If you want to e.g.
eliminate x-site scripting, ...
Yes, the final intention is to prevent x-site scripting, but am not
aware of any widely used, popular modules for this.