How to identify double bytes language?

Discussion in 'Perl Misc' started by sqlcamel, Nov 13, 2009.

  1. sqlcamel

    sqlcamel Guest

    Hello,

    I have a text file, there are some double-bytes words in it, like
    Chinese, Japanese.
    Is there a way to identify them separately with Perl? Thanks.
    sqlcamel, Nov 13, 2009
    #1
    1. Advertising

  2. sqlcamel

    Dr.Ruud Guest

    sqlcamel wrote:

    > I have a text file, there are some double-bytes words in it, like
    > Chinese, Japanese.
    > Is there a way to identify them separately with Perl? Thanks.


    See
    `perldoc perlopentut`,
    `perldoc -f open`,
    `perldoc open`,
    `perldoc PerlIO`
    and look for "layer".

    --
    Ruud
    Dr.Ruud, Nov 13, 2009
    #2
    1. Advertising

  3. sqlcamel

    Dr.Ruud Guest

    Ben Morrow wrote:
    > Dr.Ruud:


    >>> I have a text file, there are some double-bytes words in it, like
    >>> Chinese, Japanese.
    >>> Is there a way to identify them separately with Perl? Thanks.

    >> See
    >> `perldoc perlopentut`,
    >> `perldoc -f open`,
    >> `perldoc open`,
    >> `perldoc PerlIO`
    >> and look for "layer".

    >
    > IMHO you should start with perldoc perlunitut and perldoc perlunicode.


    I don't understand. Maybe you thought that UTF-16 was meant?

    The data in the "double-byte" encoded files (probably Shift-JIS, GB2312
    or Big5) will just become normal Perl strings if the right IO-layer is used.

    After that, some basic Unicode knowledge will of course help.

    --
    Ruud
    Dr.Ruud, Nov 13, 2009
    #3
  4. On 2009-11-13, sqlcamel <> wrote:
    > Hello,
    >
    > I have a text file, there are some double-bytes words in it, like
    > Chinese, Japanese.
    > Is there a way to identify them separately with Perl? Thanks.


    As you can see, the posters may be confused about the meaning of your
    question.

    Myself, I think your question is about "how to guess which encoding it
    is?". But please be more specific...

    Ilya
    Ilya Zakharevich, Nov 13, 2009
    #4
  5. sqlcamel

    sqlcamel Guest

    Thanks for all the suggestions.
    What I wanted is, for example, given the text piece below:

    There is a ÖйúÈË in the park.

    So how to scratch the gb2312 word of ÖйúÈË from the text?

    Thanks again.


    On 11ÔÂ14ÈÕ, ÉÏÎç5ʱ58·Ö, Ilya Zakharevich <> wrote:
    > On 2009-11-13, sqlcamel <> wrote:
    >
    > > Hello,

    >
    > > I have a text file, there are some double-bytes words in it, like
    > > Chinese, Japanese.
    > > Is there a way to identify them separately with Perl? Thanks.

    >
    > As you can see, the posters may be confused about the meaning of your
    > question.
    >
    > Myself, I think your question is about "how to guess which encoding it
    > is?". But please be more specific...
    >
    > Ilya
    sqlcamel, Nov 14, 2009
    #5
  6. On 2009-11-14 03:31, sqlcamel <> wrote:
    > Thanks for all the suggestions.


    Please don't top-post. Quote the relevant parts of the posting you are
    replying to and write your answers below each part.

    > What I wanted is, for example, given the text piece below:
    >
    > There is a 中国人 in the park.
    >
    > So how to scratch the gb2312 word of 中国人 from the text?


    There isn't a "gb2312 word" in the text. The whole text is gb2312.

    You want to distinguish the Chinese characters from the Latin
    characters.

    I think in GB2312 this is easy: Just search for pairs of bytes with the
    high bit set.

    But in general I would convert the whole text to Unicode and check the
    character properties. This works for *all* encodings, no matter how
    complicated they are:

    #!/usr/bin/perl
    use warnings;
    use strict;

    binmode STDIN, ":encoding(GB2312)"; # input is GB2312
    binmode STDOUT, ":encoding(UTF-8)"; # my terminal is UTF-8

    while (read(STDIN, my $char, 1)) {
    my $classes = "";
    for my $class (qw(Han Latin)) {
    if ($char =~ /\p{$class}/) {
    $classes .= " $class";
    }
    }
    print "$char - $classes\n";
    }
    __END__

    Prints for a file containing "There is a 中国人 in the park." in GB2312:


    T - Latin
    h - Latin
    e - Latin
    r - Latin
    e - Latin
    -
    i - Latin
    s - Latin
    -
    a - Latin
    -
    中 - Han
    国 - Han
    人 - Han
    -
    i - Latin
    n - Latin
    -
    t - Latin
    h - Latin
    e - Latin
    -
    p - Latin
    a - Latin
    r - Latin
    k - Latin
    .. -

    -


    hp
    Peter J. Holzer, Nov 14, 2009
    #6
  7. [Please no TOFU, trying to repair]
    sqlcamel <> wrote:
    >> On 2009-11-13, sqlcamel <> wrote:
    >> > I have a text file, there are some double-bytes words in it, like
    >> > Chinese, Japanese.
    >> > Is there a way to identify them separately with Perl? Thanks.

    >
    >What I wanted is, for example, given the text piece below:
    >
    >There is a ?????? in the park.
    >
    >So how to scratch the gb2312 word of ?????? from the text?


    gb2312 is a character set, it includes at least Chinese as well as Latin
    characters. Therefore all of your text is gb2313, not just that word.

    Now, having said that your real task seems to be to distinguish between
    Latin/ASCII/.... and non-Latin/ASCII/... characters.
    There are several POSIX classes in the regular expressions that will
    help you with that, please check 'perldoc perlre' for what is most
    suitable for you.

    jue
    Jürgen Exner, Nov 14, 2009
    #7
  8. sqlcamel

    Dr.Ruud Guest

    Ben Morrow wrote:
    > Dr.Ruud:


    >> The data in the "double-byte" encoded files (probably Shift-JIS, GB2312
    >> or Big5) will just become normal Perl strings if the right IO-layer is used.

    >
    > No, they will become SvUTF8 strings, which (shouldn't, but do) behave
    > differently from byte strings under some circumstances.


    Please Ben, stop messing things up. I said Perl strings, not byte
    strings. The unit of Perl strings is characters, not bytes.

    --
    Ruud
    Dr.Ruud, Nov 14, 2009
    #8
  9. On 2009-11-14 10:03, Peter J. Holzer <> wrote:
    > But in general I would convert the whole text to Unicode and check the
    > character properties. This works for *all* encodings, no matter how
    > complicated they are:

    [...]
    > for my $class (qw(Han Latin)) {
    > if ($char =~ /\p{$class}/) {


    Forgot to add: The full list of properties can be found in
    perldoc perlunicode.

    hp
    Peter J. Holzer, Nov 14, 2009
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. javadev
    Replies:
    2
    Views:
    393
    Adam Maass
    Apr 14, 2006
  2. Sydex
    Replies:
    12
    Views:
    6,454
    Victor Bazarov
    Feb 17, 2005
  3. Replies:
    2
    Views:
    465
    Richard Tobin
    Apr 11, 2008
  4. Klaus
    Replies:
    16
    Views:
    160
    Michele Dondi
    May 13, 2007
  5. Replies:
    6
    Views:
    99
    Joost Diepenmaat
    Apr 13, 2008
Loading...

Share This Page