How to determine if a word has an extended character?



I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?

For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints


Jürgen Exner

I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?

[Interpreting 'extended' as non-ASCII]

You could simply use the POSIX character class [:ASCII:]

Another way would be to check for each character, if its ord() is less
than 128. That should work at least for the most common encodings like
ISO-Latin-1, Windows-1252, ...

Or: [untested]
if (/^[A-Za-z]*$/) {
print 'false';
} else {
print 'true';

You could probably also set your locale to EN-US and use
if (/\W/) {
print 'true';
} else {
print 'false';

All of these do somewhat different things, so you have some options to
choose the one that most closely matches your needs.


Hartmut Camphausen

In said:
I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?

For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints

$string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";

should do the trick.

This prints "has extended" if $string contains any characters other
([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
character class).

If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
If you want to include more "valid" characters, expand the [^...]
accordingly (note: if you want to inlcude '-' as valid character, put it
at the very end of the characters list).

perldoc perlre
perldoc perlrequick
perldoc perlreref
perldoc perlretut

hth, Hartmut

John W. Krahn

Hartmut said:
In said:
I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?

For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints

$string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";

[^\w] is usually written as \W.

should do the trick.

This prints "has extended" if $string contains any characters other
([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
character class).

From perlre.pod:

If "use locale" is in effect, the list of alphabetic characters
generated by "\w" is taken from the current locale. See perllocale.

In other words, if your locale supports it then 'é' will be included in\w.

If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]

[^a-zA-Z0-9] means any character that is *not* alphanumeric. You
probably meant [a-zA-Z0-9].


Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Latest member