Newbe Unicode question

Discussion in 'Perl Misc' started by Scottie, Feb 18, 2004.

  1. Scottie

    Scottie Guest

    How do I make a Unicode Perl script that uses:
    perl zapotec.pl zapotecUnicode.txt > asdf.txt
    where "zapotecUnicode.txt" is UTF-8 file?

    In the zapotec.pl I have:
    binmode(STDOUT, ":utf8");
    binmode(STDIN, ":utf8");
    use encoding "latin2";
    at the very top.

    Any help would be appreciated.

    Scott
     
    Scottie, Feb 18, 2004
    #1
    1. Advertising

  2. Scottie

    Ben Morrow Guest

    (Scottie) wrote:
    > How do I make a Unicode Perl script that uses:
    > perl zapotec.pl zapotecUnicode.txt > asdf.txt
    > where "zapotecUnicode.txt" is UTF-8 file?
    >
    > In the zapotec.pl I have:
    > binmode(STDOUT, ":utf8");
    > binmode(STDIN, ":utf8");
    > use encoding "latin2";


    Why? Is your source in latin2?

    > at the very top.
    >
    > Any help would be appreciated.


    Err... what does yor script do, and in what ways is in not working?

    Ben

    --
    don't get my sympathy hanging out the 15th floor. you've changed the locks 3
    times, he still comes reeling though the door, and soon he'll get to you, teach
    you how to get to purest hell. you do it to yourself and that's what really
    hurts is you do it to yourself just you, you and noone else *
     
    Ben Morrow, Feb 18, 2004
    #2
    1. Advertising

  3. Scottie

    Scottie Guest

    Ben,

    > > In the zapotec.pl I have:
    > > binmode(STDOUT, ":utf8");
    > > binmode(STDIN, ":utf8");
    > > use encoding "latin2";

    >
    > Why? Is your source in latin2?


    I'm sorry. The 3rd line is:
    use encoding "latin1";

    > Err... what does yor script do, and in what ways is in not working?


    I started with GAWK and used a2p to change it to Perl. I think I know
    that the @Fld line isn't allowing it to be Unicode. I have hunted
    through the Perl docs concerning my problem and I haven't come up with
    an answer. What do you think?

    # Perl - a2p - Combines many changes to the Zapotec-Spanish
    dictionary.
    # Scott Starker

    binmode(STDOUT, ":utf8");
    binmode(STDIN, ":utf8");
    use encoding "latin1";

    # ${^WIDE_SYSTEM_CALLS} = 1;
    $[ = 1; # set array base to 1
    $, = " "; # set output field separator
    $\ = "\n"; # set output record separator

    $AlreadyGN = 0;
    $notes = 0;
    $gnsgnFirstLine = 0;
    $anyline = 0;
    $position = 0;
    $lxline = '';
    $mldef = '';
    $seline = '';
    $line = '';
    $beg = '';
    $end = '';

    # This program takes out the "lx"'s that are alone on the line ("\k").
    while (<>) {
    chomp; # strip record separator
    @Fld = split("\x{0020}", $_, 9999); # " "
    print "\x{002a}";
    # if ($Fld[1] eq " \\ l x") {
    # if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
    if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
    print "\x{002a}\x{002a}";
    $s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
    # Make "tone" un-bolded
    $Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
    s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
    s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
    s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
    $Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
    $position = index($Fld[$#Fld], "\x{005d}");
    $lxline = $_;
    ..
    ..
    ..

    Scott
     
    Scottie, Feb 19, 2004
    #3
  4. Scottie

    Ben Morrow Guest

    (Scottie) wrote:
    > Ben,
    >
    > > > In the zapotec.pl I have:
    > > > binmode(STDOUT, ":utf8");
    > > > binmode(STDIN, ":utf8");
    > > > use encoding "latin2";

    > >
    > > Why? Is your source in latin2?

    >
    > I'm sorry. The 3rd line is:
    > use encoding "latin1";
    >
    > > Err... what does yor script do, and in what ways is in not working?

    >
    > I started with GAWK and used a2p to change it to Perl. I think I know
    > that the @Fld line isn't allowing it to be Unicode. I have hunted
    > through the Perl docs concerning my problem and I haven't come up with
    > an answer. What do you think?
    >
    > # Perl - a2p - Combines many changes to the Zapotec-Spanish
    > dictionary.
    > # Scott Starker
    >
    > binmode(STDOUT, ":utf8");
    > binmode(STDIN, ":utf8");
    > use encoding "latin1";


    This is unnecessary because al latin1 is the default anyway and b. your
    source is all ascii.

    > # ${^WIDE_SYSTEM_CALLS} = 1;
    > $[ = 1; # set array base to 1


    Aaarg... run away... $[ is highly deprecated and double-plus-ungood.
    Yes, I know it's not your code :).

    > $, = " "; # set output field separator
    > $\ = "\n"; # set output record separator
    >
    > $AlreadyGN = 0;
    > $notes = 0;
    > $gnsgnFirstLine = 0;
    > $anyline = 0;
    > $position = 0;
    > $lxline = '';
    > $mldef = '';
    > $seline = '';
    > $line = '';
    > $beg = '';
    > $end = '';
    >
    > # This program takes out the "lx"'s that are alone on the line ("\k").
    > while (<>) {
    > chomp; # strip record separator
    > @Fld = split("\x{0020}", $_, 9999); # " "
    > print "\x{002a}";
    > # if ($Fld[1] eq " \\ l x") {
    > # if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
    > if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
    > print "\x{002a}\x{002a}";
    > $s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
    > # Make "tone" un-bolded
    > $Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
    > s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
    > s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
    > s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
    > $Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
    > $position = index($Fld[$#Fld], "\x{005d}");
    > $lxline = $_;


    Right, let's attempt to translate that into Perl... (untested)

    #!/usr/bin/perl

    use strict;
    use warnings;

    $, = " ";
    $\ = "\n";

    binmode STDIN, ':encoding(utf8)';
    binmode STDOUT, ':encoding(utf8)';
    # this is better as you get fallback if the input is invalid

    my $ced = "\xb8";

    while (<>) {
    chomp;
    my ($a, $b, $c) = split " ";
    if ($a eq '\\\lx') { # this comes out as two \
    print '**';
    s/-/^~/g;
    $b = "|b$b";
    s/\[/|r[/g;
    s/]/]|b/g;
    s/]\|b$ced ]/]$ced |b/g;

    ....etc. (Bog, that code's making my eyes hurt!) You can carry on, and
    finish it (what you posted wasn't complete, right?).

    Now, I can't really see what this is supposed to do, so what do you want
    it to do, and what is it in fact doing?

    Ben

    --
    $.=1;*g=sub{print@_};sub r($$\$){my($w,$x,$y)=@_;for(keys%$x){/main/&&next;*p=$
    $x{$_};/(\w)::$/&&(r($w.$1,$x.$_,$y),next);$y eq\$p&&&g("$w$_")}};sub t{for(@_)
    {$f&&($_||&g(" "));$f=1;r"","::",$_;$_&&&g(chr(0012))}};t #
    $J::u::s::t, $a::n::eek:::t::h::e::r, $P::e::r::l, $h::a::c::k::e::r, $.
     
    Ben Morrow, Feb 19, 2004
    #4
  5. Scottie

    Scottie Guest

    Ben,

    > ... (what you posted wasn't complete, right?).


    It wasn't nearly all of it!

    > Now, I can't really see what this is supposed to do, so what do you want
    > it to do, and what is it in fact doing?


    Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
    Zapotec word (spoken in Mexico) and it's Spanish words as it's
    definitions. It's almost a database type-of-thing. The program is
    called Shoebox. There are different lines for each record. They all
    start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
    There might at least one subentry (\se) along with it's definition(s)
    (\sgn). There's more than these fields. (The Perl line "print "**";
    was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
    $_, 9999); that takes an array like this. Can you help me out? I need
    to know how to get the line into @Fld.

    Scott
     
    Scottie, Feb 20, 2004
    #5
  6. Scottie

    Ben Morrow Guest

    (Scottie) wrote:
    >
    > Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
    > Zapotec word (spoken in Mexico) and it's Spanish words as it's
    > definitions. It's almost a database type-of-thing. The program is
    > called Shoebox. There are different lines for each record. They all
    > start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
    > There might at least one subentry (\se) along with it's definition(s)
    > (\sgn). There's more than these fields. (The Perl line "print "**";
    > was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
    > $_, 9999); that takes an array like this. Can you help me out? I need
    > to know how to get the line into @Fld.


    Well, that's easy:

    my @F = split ' ';

    if the records on each line are space-separated. Alternatively,

    my @F = split /\\/;

    may work better, as it will split the line on the backslashes. There are
    two 'unfortunately's here: firstly, you'll get an initial empty field,
    before the first backslash; secondly, the actual backslashes themselves
    will be removed, so you'll have to remember to put them back in.

    It's probably easiest if you then iterate over the fields, and do
    whatever you need to based on the field type:

    #!/usr/bin/perl -lanF\\
    # see perldoc perlrun for the above: it automagically iterates over all
    # lines and splits them into @F

    BEGIN {
    $\ = '';
    binmode STDIN, ':encoding(utf8)';
    binmode STDOUT, ':encoding(utf8)';
    }

    for (@F) {
    /^lx/ and next;

    /^gn/ and do {
    s/\xB8/|b/; # or whatever it is you want to do
    next;
    };
    }
    continue {
    # this makes sure each entry gets printed, with its backslash,
    # when you're done with it.

    print '\\' . (join $,, @_) . $\;
    }

    Ben

    --
    For the last month, a large number of PSNs in the Arpa[Inter-]net have been
    reporting symptoms of congestion ... These reports have been accompanied by an
    increasing number of user complaints ... As of June,... the Arpanet contained
    47 nodes and 63 links. [ftp://rtfm.mit.edu/pub/arpaprob.txt] *
     
    Ben Morrow, Feb 21, 2004
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?UnlhbiBTbWl0aA==?=

    Easy SQL Question - Newbe

    =?Utf-8?B?UnlhbiBTbWl0aA==?=, Jan 5, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    313
    Bob Lehmann
    Jan 5, 2005
  2. =?Utf-8?B?b3o=?=

    Newbe question ---- response.writefile

    =?Utf-8?B?b3o=?=, Jan 9, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    304
    =?Utf-8?B?b3o=?=
    Jan 9, 2005
  3. =?Utf-8?B?UnlhbiBTbWl0aA==?=

    Newbe Question

    =?Utf-8?B?UnlhbiBTbWl0aA==?=, Sep 9, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    385
  4. Daniel Bello Urizarri

    Newbe question about location of assemblies.

    Daniel Bello Urizarri, Sep 30, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    307
    Daniel Bello Urizarri
    Sep 30, 2005
  5. Replies:
    7
    Views:
    351
Loading...

Share This Page