How to select international text characters in a regular expression

Amos · Jan 27, 2009

Hi,

I'm a perl newbie, so I apologize for this simple question.

I need to modify an existing perl script to handle languages other
than English (specifically Hebrew as a start).
The script has this line:

y/a-z0-9/ /cs; # retain only alpha-numeric entries

which replaces all non-alphanumeric chars with spaces.
I need it to keep Hebrew characters as well, but haven't been able to
find a regex that works.
I tried stuff like this:

y/a-z0-9×-×ª/ /cs;
y/a-z0-9[\x{05D0}-\x{05EA}]/ /cs; # the unicode codes for the first
and last alphabet letters

But nothing seems to work.

Any ideas?

Thanks!
Amos

Peter Makholm · Jan 27, 2009

Amos said:
I need to modify an existing perl script to handle languages other
than English (specifically Hebrew as a start).
The script has this line:

y/a-z0-9/ /cs; # retain only alpha-numeric entries

Note: This is not a regular expression. Even thoug y/// looks like the
operators using regular expressions it isn't using regular
expressions.

which replaces all non-alphanumeric chars with spaces.
I need it to keep Hebrew characters as well, but haven't been able to
find a regex that works.

By using the substitution operator which uses regular expressions, you
can use unicode properties to do the matching:

s/\P{Letter}/ /g;

or if you find it more readable probably also

s/\p{^Letter}/ /g;

But both are untestet...

//Makholm

Amos · Jan 27, 2009

Thanks for the quick reply Makholm!

I tried both of your recommendations, but in both cases the hebrew
letters are removed.

In addition, this character â€™ is replaced by this character ×’.Perhaps
I have some encoding issues?
The text is UTF-8 encoded and I'm using ActivePerl on Windows XP.

Maybe I can use a character range ([\x{###}-\x{###}]) instead of the
unicode property?

Any ideas?

Thanks!

Amos

Peter Makholm · Jan 27, 2009

Amos said:
I tried both of your recommendations, but in both cases the hebrew
letters are removed.

You might have a byte-string with utf-8 encoded characters instead og
a real perl utf8-string. Try doing something like:

use Encode 'decode_utf8';
$string = decode_utf8($string);

(from 'perldoc perluniintro')

Otherwise you would have to match on the raw utf-8 byte sequence
instead no matter which method to solve you method you're using.

//Makholm

FAQ 6.24 How do I match a regular expression that's in a variable?	0	Apr 19, 2011
Regular expressions: how to skip characters from a capture	10	Nov 17, 2008
Regular expression for required alpha and numeric characters	7	Apr 18, 2007
FAQ 6.11 How do I use a regular expression to strip C style comments from a file?	0	Feb 10, 2011
How to use a variable as regular expression?	6	Mar 23, 2009
comma delimited regular expression	0	Dec 7, 2006
Regular expression segmentation Fault with in-place substitution	1	Jul 29, 2009
About a regular expression	5	Nov 26, 2007

How to select international text characters in a regular expression

Amos

Peter Makholm

Amos

Peter Makholm

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads