trouble processing non-English text

Discussion in 'Perl Misc' started by DavidK, Jan 5, 2010.

  1. DavidK

    DavidK Guest

    Hello,

    I am trying to process some Greek text using Perl. Strangely, I can
    print out the text properly but when I try to assign the text to a
    variable or do some processing, it fails.

    The data file is:

    1 και
    2 να

    My program is:

    #!/usr/bin/perl -w
    use strict;
    use encoding "greek";

    my %symbols = ();

    open(FILE, "$file");

    while (my $line = <FILE>) {
    chomp($line);

    my @fields = split(/\s+/, $line);

    my $num_fields = @fields;

    if ($num_fields == 2) {

    my $freq = shift(@fields);
    my $word = shift(@fields);

    print "$word\n";

    my @letters = split(//, $word);

    foreach my $letter (@letters) {
    $symbols{$letter} = 1;

    print "$letter -> $letter_test\n";
    }

    print "\n";
    }
    }

    The output is:

    και
    � ->
    � ->
    � ->
    � ->
    � ->
    � ->

    να
    � ->
    � ->
    � ->
    � ->

    I've done some reading on the web and I still can't figure out what's
    happening.

    I'd appreciate any help. Thanks!
    DavidK, Jan 5, 2010
    #1
    1. Advertising

  2. DavidK

    Dr.Ruud Guest

    DavidK wrote:

    > I am trying to process some Greek text using Perl. Strangely, I can
    > print out the text properly but when I try to assign the text to a
    > variable or do some processing, it fails.
    > [...]
    > use encoding "greek";
    > [...]
    > The output is:
    >
    > και
    > � ->
    > [...]


    In what sense does it fail?

    What does `echo $LANG` show you?

    --
    Ruud
    Dr.Ruud, Jan 6, 2010
    #2
    1. Advertising

  3. DavidK

    DavidK Guest

    Thanks for the responses!

    My $LANG variable is set to en_US.UTF-8.

    The file I thought was in ISO8859-7 is actually UTF-8. I should have
    been opening the file with >

    open(my $FILE, "<:encoding(UTF-8)", $file)
    or die "can't open '$file': $!";

    I also had to format the output with

    binmode STDOUT, ":utf8";

    to view it properly.

    Thanks again. It seems to be working now. thank you ben for the Perl
    style tips.

    I'm sorry about the confusing source code. I tried to simplify it and
    I removed some lines by mistake.


    On Jan 5, 7:17 pm, Ben Morrow <> wrote:
    > Quoth DavidK <>:
    >
    >
    >
    > > I am trying to process some Greek text using Perl. Strangely, I can
    > > print out the text properly but when I try to assign the text to a
    > > variable or do some processing, it fails.

    >
    > > The data file is:

    >
    > > 1 και
    > > 2 να

    >
    > > My program is:

    >
    > > #!/usr/bin/perl -w

    >
    > 'use warnings' is preferred to -w nowadays.
    >
    > > use strict;
    > > use encoding "greek";

    >
    > Don't do that. In principle 'encoding' specifies the encoding of your
    > *source* file, and also pushes encoding layers onto STD{IN,OUT}; it has
    > no effect on other filehandles. In practice it has never worked properly
    > and should be avoided.
    >
    > I don't know how your data file is encoded, but AFAIK "greek" is not a
    > valid encoding name. You might have meant "iso-8859-7", which I believe
    > is the usual pre-Unicode encoding for Greek, or you might have meant
    > "UTF-8" (or you might have meant something else entirely). You will need
    > to find out which.
    >
    > > my %symbols = ();

    >
    > > open(FILE, "$file");

    >
    > Always check the return value of open.
    > Use 3-arg open instead of magic 2-arg open, unless you've got a good
    > reason not to.
    > Don't quote variables when you don't need to.
    > Use lexical filehandles instead of global barewords.
    > In your case, you want to push an encoding PerlIO layer when you open
    > the file.
    >
    > open(my $FILE, "<:encoding(iso-8859-7)", $file)
    > or die "can't open '$file': $!";
    >
    > You might also consider using the 'autodie' module, which will do the
    > 'or die' check for you.
    >
    >
    >
    > > while (my $line = <FILE>) {
    > > chomp($line);

    >
    > > my @fields = split(/\s+/, $line);

    >
    > > my $num_fields = @fields;

    >
    > > if ($num_fields == 2) {

    >
    > There's no need for this. '==' gives scalar context to both sides, so
    >
    > if (@fields < 2) {
    >
    > will suffice.
    >
    > > my $freq = shift(@fields);
    > > my $word = shift(@fields);

    >
    > > print "$word\n";

    >
    > > my @letters = split(//, $word);

    >
    > > foreach my $letter (@letters) {
    > > $symbols{$letter} = 1;

    >
    > > print "$letter -> $letter_test\n";

    >
    > Where does $letter_test come from? Did you actually run the code you
    > posted?
    >
    > > }

    >
    > > print "\n";
    > > }
    > > }

    >
    > > The output is:

    >
    > > και
    > > ->
    > > ->
    > > ->
    > > ->
    > > ->
    > > ->

    >
    > This suggests your file is not in ISO8859-7, but in some multi-byte
    > encoding like UTF-8 or UTF-16. If you're on a Unix machine it's probably
    > UTF-8.
    >
    > Ben
    DavidK, Jan 6, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. AC
    Replies:
    8
    Views:
    486
  2. Johan
    Replies:
    0
    Views:
    1,797
    Johan
    Oct 7, 2004
  3. =?Utf-8?B?UmFlZCBTYXdhbGhh?=

    English/English DLL

    =?Utf-8?B?UmFlZCBTYXdhbGhh?=, Oct 15, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    1,663
    =?Utf-8?B?UmFlZCBTYXdhbGhh?=
    Oct 16, 2005
  4. IchBin
    Replies:
    1
    Views:
    760
  5. Hu Ma

    transform non-english text

    Hu Ma, Jan 19, 2007, in forum: Ruby
    Replies:
    4
    Views:
    91
    Daniel DeLorme
    Jan 23, 2007
Loading...

Share This Page