How to determine if a word has an extended character?

A

ambarish.mitra

I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?


For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints
true.

Thanks.
 
J

Jürgen Exner

I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?

[Interpreting 'extended' as non-ASCII]

You could simply use the POSIX character class [:ASCII:]

Another way would be to check for each character, if its ord() is less
than 128. That should work at least for the most common encodings like
ISO-Latin-1, Windows-1252, ...

Or: [untested]
if (/^[A-Za-z]*$/) {
print 'false';
} else {
print 'true';
}

You could probably also set your locale to EN-US and use
if (/\W/) {
print 'true';
} else {
print 'false';
}

All of these do somewhat different things, so you have some options to
choose the one that most closely matches your needs.

jue
 
H

Hartmut Camphausen

In said:
I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?


For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints
true.


$string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";

should do the trick.

This prints "has extended" if $string contains any characters other
([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
character class).

If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]
If you want to include more "valid" characters, expand the [^...]
accordingly (note: if you want to inlcude '-' as valid character, put it
at the very end of the characters list).

See
perldoc perlre
perldoc perlrequick
perldoc perlreref
perldoc perlretut



hth, Hartmut
 
J

John W. Krahn

Hartmut said:
In said:
I have a file which contains just one word. My task is just to find
out if the word has any extended character. Thats all.

I can use regex, but am not able to find out a regex pattern for
extended character. Any hints?


For example, if the file content is: sample, then the Perl code prints
false; and if the file content is samplé, then the Perl code prints
true.


$string =~ m/[^\w]/ ? print "\nhas extended." : print "\nOK.";

[^\w] is usually written as \W.

should do the trick.

This prints "has extended" if $string contains any characters other
([^...]) then 'a' to 'z', 'A' to 'Z', '0' to '9' plus '_' (the \w
character class).

From perlre.pod:

<QUOTE>
If "use locale" is in effect, the list of alphabetic characters
generated by "\w" is taken from the current locale. See perllocale.
</QUOTE>

In other words, if your locale supports it then 'é' will be included in\w.

If you want to exclude the '_' (contained in \w), use [^a-zA-Z0-9]

[^a-zA-Z0-9] means any character that is *not* alphanumeric. You
probably meant [a-zA-Z0-9].



John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top