How to select international text characters in a regular expression

A

Amos

Hi,

I'm a perl newbie, so I apologize for this simple question.

I need to modify an existing perl script to handle languages other
than English (specifically Hebrew as a start).
The script has this line:

y/a-z0-9/ /cs; # retain only alpha-numeric entries

which replaces all non-alphanumeric chars with spaces.
I need it to keep Hebrew characters as well, but haven't been able to
find a regex that works.
I tried stuff like this:

y/a-z0-9×-ת/ /cs;
y/a-z0-9[\x{05D0}-\x{05EA}]/ /cs; # the unicode codes for the first
and last alphabet letters

But nothing seems to work.

Any ideas?

Thanks!
Amos
 
P

Peter Makholm

Amos said:
I need to modify an existing perl script to handle languages other
than English (specifically Hebrew as a start).
The script has this line:

y/a-z0-9/ /cs; # retain only alpha-numeric entries

Note: This is not a regular expression. Even thoug y/// looks like the
operators using regular expressions it isn't using regular
expressions.
which replaces all non-alphanumeric chars with spaces.
I need it to keep Hebrew characters as well, but haven't been able to
find a regex that works.

By using the substitution operator which uses regular expressions, you
can use unicode properties to do the matching:

s/\P{Letter}/ /g;

or if you find it more readable probably also

s/\p{^Letter}/ /g;

But both are untestet...

//Makholm
 
A

Amos

Thanks for the quick reply Makholm!

I tried both of your recommendations, but in both cases the hebrew
letters are removed.

In addition, this character ’ is replaced by this character ג.Perhaps
I have some encoding issues?
The text is UTF-8 encoded and I'm using ActivePerl on Windows XP.

Maybe I can use a character range ([\x{###}-\x{###}]) instead of the
unicode property?

Any ideas?

Thanks!

Amos
 
P

Peter Makholm

Amos said:
I tried both of your recommendations, but in both cases the hebrew
letters are removed.

You might have a byte-string with utf-8 encoded characters instead og
a real perl utf8-string. Try doing something like:

use Encode 'decode_utf8';
$string = decode_utf8($string);

(from 'perldoc perluniintro')

Otherwise you would have to match on the raw utf-8 byte sequence
instead no matter which method to solve you method you're using.

//Makholm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,070
Latest member
BiogenixGummies

Latest Threads

Top