How to identify double bytes language?

S

sqlcamel

Hello,

I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.
 
D

Dr.Ruud

sqlcamel said:
I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.

See
`perldoc perlopentut`,
`perldoc -f open`,
`perldoc open`,
`perldoc PerlIO`
and look for "layer".
 
D

Dr.Ruud

Ben said:
Dr.Ruud:

IMHO you should start with perldoc perlunitut and perldoc perlunicode.

I don't understand. Maybe you thought that UTF-16 was meant?

The data in the "double-byte" encoded files (probably Shift-JIS, GB2312
or Big5) will just become normal Perl strings if the right IO-layer is used.

After that, some basic Unicode knowledge will of course help.
 
I

Ilya Zakharevich

Hello,

I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.

As you can see, the posters may be confused about the meaning of your
question.

Myself, I think your question is about "how to guess which encoding it
is?". But please be more specific...

Ilya
 
S

sqlcamel

Thanks for all the suggestions.
What I wanted is, for example, given the text piece below:

There is a ÖйúÈË in the park.

So how to scratch the gb2312 word of ÖйúÈË from the text?

Thanks again.
 
P

Peter J. Holzer

Thanks for all the suggestions.

Please don't top-post. Quote the relevant parts of the posting you are
replying to and write your answers below each part.
What I wanted is, for example, given the text piece below:

There is a 中国人 in the park.

So how to scratch the gb2312 word of 中国人 from the text?

There isn't a "gb2312 word" in the text. The whole text is gb2312.

You want to distinguish the Chinese characters from the Latin
characters.

I think in GB2312 this is easy: Just search for pairs of bytes with the
high bit set.

But in general I would convert the whole text to Unicode and check the
character properties. This works for *all* encodings, no matter how
complicated they are:

#!/usr/bin/perl
use warnings;
use strict;

binmode STDIN, ":encoding(GB2312)"; # input is GB2312
binmode STDOUT, ":encoding(UTF-8)"; # my terminal is UTF-8

while (read(STDIN, my $char, 1)) {
my $classes = "";
for my $class (qw(Han Latin)) {
if ($char =~ /\p{$class}/) {
$classes .= " $class";
}
}
print "$char - $classes\n";
}
__END__

Prints for a file containing "There is a 中国人 in the park." in GB2312:


T - Latin
h - Latin
e - Latin
r - Latin
e - Latin
-
i - Latin
s - Latin
-
a - Latin
-
中 - Han
国 - Han
人 - Han
-
i - Latin
n - Latin
-
t - Latin
h - Latin
e - Latin
-
p - Latin
a - Latin
r - Latin
k - Latin
.. -

-


hp
 
J

Jürgen Exner

[Please no TOFU, trying to repair]
sqlcamel said:
What I wanted is, for example, given the text piece below:

There is a ?????? in the park.

So how to scratch the gb2312 word of ?????? from the text?

gb2312 is a character set, it includes at least Chinese as well as Latin
characters. Therefore all of your text is gb2313, not just that word.

Now, having said that your real task seems to be to distinguish between
Latin/ASCII/.... and non-Latin/ASCII/... characters.
There are several POSIX classes in the regular expressions that will
help you with that, please check 'perldoc perlre' for what is most
suitable for you.

jue
 
D

Dr.Ruud

Ben said:
Dr.Ruud:

No, they will become SvUTF8 strings, which (shouldn't, but do) behave
differently from byte strings under some circumstances.

Please Ben, stop messing things up. I said Perl strings, not byte
strings. The unit of Perl strings is characters, not bytes.
 
P

Peter J. Holzer

But in general I would convert the whole text to Unicode and check the
character properties. This works for *all* encodings, no matter how
complicated they are: [...]
for my $class (qw(Han Latin)) {
if ($char =~ /\p{$class}/) {

Forgot to add: The full list of properties can be found in
perldoc perlunicode.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top