Thanks for all the suggestions.
Please don't top-post. Quote the relevant parts of the posting you are
replying to and write your answers below each part.
What I wanted is, for example, given the text piece below:
There is a ä¸å›½äºº in the park.
So how to scratch the gb2312 word of ä¸å›½äºº from the text?
There isn't a "gb2312 word" in the text. The whole text is gb2312.
You want to distinguish the Chinese characters from the Latin
characters.
I think in GB2312 this is easy: Just search for pairs of bytes with the
high bit set.
But in general I would convert the whole text to Unicode and check the
character properties. This works for *all* encodings, no matter how
complicated they are:
#!/usr/bin/perl
use warnings;
use strict;
binmode STDIN, ":encoding(GB2312)"; # input is GB2312
binmode STDOUT, ":encoding(UTF-8)"; # my terminal is UTF-8
while (read(STDIN, my $char, 1)) {
my $classes = "";
for my $class (qw(Han Latin)) {
if ($char =~ /\p{$class}/) {
$classes .= " $class";
}
}
print "$char - $classes\n";
}
__END__
Prints for a file containing "There is a ä¸å›½äºº in the park." in GB2312:
T - Latin
h - Latin
e - Latin
r - Latin
e - Latin
-
i - Latin
s - Latin
-
a - Latin
-
ä¸ - Han
国 - Han
人 - Han
-
i - Latin
n - Latin
-
t - Latin
h - Latin
e - Latin
-
p - Latin
a - Latin
r - Latin
k - Latin
.. -
-
hp