How to identify double bytes language?

sqlcamel · Nov 13, 2009

Hello,

I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.

Dr.Ruud · Nov 13, 2009

sqlcamel said:
I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.

See
`perldoc perlopentut`,
`perldoc -f open`,
`perldoc open`,
`perldoc PerlIO`
and look for "layer".

Dr.Ruud · Nov 13, 2009

Ben said:
Dr.Ruud:

IMHO you should start with perldoc perlunitut and perldoc perlunicode.

I don't understand. Maybe you thought that UTF-16 was meant?

The data in the "double-byte" encoded files (probably Shift-JIS, GB2312
or Big5) will just become normal Perl strings if the right IO-layer is used.

After that, some basic Unicode knowledge will of course help.

Ilya Zakharevich · Nov 13, 2009

Hello,

I have a text file, there are some double-bytes words in it, like
Chinese, Japanese.
Is there a way to identify them separately with Perl? Thanks.

As you can see, the posters may be confused about the meaning of your
question.

Myself, I think your question is about "how to guess which encoding it
is?". But please be more specific...

Ilya

sqlcamel · Nov 14, 2009

Thanks for all the suggestions.
What I wanted is, for example, given the text piece below:

There is a ÖÐ¹úÈË in the park.

So how to scratch the gb2312 word of ÖÐ¹úÈË from the text?

Thanks again.

Peter J. Holzer · Nov 14, 2009

Thanks for all the suggestions.

Please don't top-post. Quote the relevant parts of the posting you are
replying to and write your answers below each part.

What I wanted is, for example, given the text piece below:

There is a ä¸å›½äºº in the park.

So how to scratch the gb2312 word of ä¸å›½äºº from the text?

There isn't a "gb2312 word" in the text. The whole text is gb2312.

You want to distinguish the Chinese characters from the Latin
characters.

I think in GB2312 this is easy: Just search for pairs of bytes with the
high bit set.

But in general I would convert the whole text to Unicode and check the
character properties. This works for *all* encodings, no matter how
complicated they are:

#!/usr/bin/perl
use warnings;
use strict;

binmode STDIN, ":encoding(GB2312)"; # input is GB2312
binmode STDOUT, ":encoding(UTF-8)"; # my terminal is UTF-8

while (read(STDIN, my $char, 1)) {
my $classes = "";
for my $class (qw(Han Latin)) {
if ($char =~ /\p{$class}/) {
$classes .= " $class";
}
}
print "$char - $classes\n";
}
__END__

Prints for a file containing "There is a ä¸å›½äºº in the park." in GB2312:

T - Latin
h - Latin
e - Latin
r - Latin
e - Latin
-
i - Latin
s - Latin
-
a - Latin
-
ä¸ - Han
å›½ - Han
äºº - Han
-
i - Latin
n - Latin
-
t - Latin
h - Latin
e - Latin
-
p - Latin
a - Latin
r - Latin
k - Latin
.. -

-

hp

Jürgen Exner · Nov 14, 2009

[Please no TOFU, trying to repair]

sqlcamel said:
What I wanted is, for example, given the text piece below:

There is a ?????? in the park.

So how to scratch the gb2312 word of ?????? from the text?

gb2312 is a character set, it includes at least Chinese as well as Latin
characters. Therefore all of your text is gb2313, not just that word.

Now, having said that your real task seems to be to distinguish between
Latin/ASCII/.... and non-Latin/ASCII/... characters.
There are several POSIX classes in the regular expressions that will
help you with that, please check 'perldoc perlre' for what is most
suitable for you.

jue

Dr.Ruud · Nov 14, 2009

Ben said:
Dr.Ruud:

No, they will become SvUTF8 strings, which (shouldn't, but do) behave
differently from byte strings under some circumstances.

Please Ben, stop messing things up. I said Perl strings, not byte
strings. The unit of Perl strings is characters, not bytes.

Peter J. Holzer · Nov 14, 2009

But in general I would convert the whole text to Unicode and check the
character properties. This works for *all* encodings, no matter how
complicated they are: [...]
for my $class (qw(Han Latin)) {
if ($char =~ /\p{$class}/) {

Forgot to add: The full list of properties can be found in
perldoc perlunicode.

hp

Cannot convert (double) to (double*)	1	Sep 5, 2022
[C language] Issue in the Lotka-Volterra model.	0	Jun 28, 2023
What programming language to choose?	4	Jul 3, 2022
How do i convert a Chinese DAT file from a game I play	2	Feb 4, 2022
C language. work with text	3	Dec 10, 2021
Language OR software(coding)?	0	Apr 17, 2020
Can't decide which language to get back into programming with	1	Mar 28, 2023
How do I use Find and Loop in VBA for Excel to identify, delete, and insert blank row for values greater than 6?	0	Feb 28, 2022

How to identify double bytes language?

sqlcamel

Dr.Ruud

Dr.Ruud

Ilya Zakharevich

sqlcamel

Peter J. Holzer

Jürgen Exner

Dr.Ruud

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads