Newbe Unicode question

Scottie · Feb 18, 2004

How do I make a Unicode Perl script that uses:
perl zapotec.pl zapotecUnicode.txt > asdf.txt
where "zapotecUnicode.txt" is UTF-8 file?

In the zapotec.pl I have:
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin2";
at the very top.

Any help would be appreciated.

Scott

Ben Morrow · Feb 18, 2004

How do I make a Unicode Perl script that uses:
perl zapotec.pl zapotecUnicode.txt > asdf.txt
where "zapotecUnicode.txt" is UTF-8 file?

In the zapotec.pl I have:
binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin2";

Why? Is your source in latin2?

at the very top.

Any help would be appreciated.

Err... what does yor script do, and in what ways is in not working?

Ben

Scottie · Feb 19, 2004

Ben,

Why? Is your source in latin2?

I'm sorry. The 3rd line is:
use encoding "latin1";

Err... what does yor script do, and in what ways is in not working?

I started with GAWK and used a2p to change it to Perl. I think I know
that the @Fld line isn't allowing it to be Unicode. I have hunted
through the Perl docs concerning my problem and I haven't come up with
an answer. What do you think?

# Perl - a2p - Combines many changes to the Zapotec-Spanish
dictionary.
# Scott Starker

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin1";

# ${^WIDE_SYSTEM_CALLS} = 1;
$[ = 1; # set array base to 1
$, = " "; # set output field separator
$\ = "\n"; # set output record separator

$AlreadyGN = 0;
$notes = 0;
$gnsgnFirstLine = 0;
$anyline = 0;
$position = 0;
$lxline = '';
$mldef = '';
$seline = '';
$line = '';
$beg = '';
$end = '';

# This program takes out the "lx"'s that are alone on the line ("\k").
while (<>) {
chomp; # strip record separator
@Fld = split("\x{0020}", $_, 9999); # " "
print "\x{002a}";
# if ($Fld[1] eq " \\ l x") {
# if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
print "\x{002a}\x{002a}";
$s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
# Make "tone" un-bolded
$Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
$Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
$position = index($Fld[$#Fld], "\x{005d}");
$lxline = $_;
..
..
..

Scott

Ben Morrow · Feb 19, 2004

Ben,

I'm sorry. The 3rd line is:
use encoding "latin1";

I started with GAWK and used a2p to change it to Perl. I think I know
that the @Fld line isn't allowing it to be Unicode. I have hunted
through the Perl docs concerning my problem and I haven't come up with
an answer. What do you think?

# Perl - a2p - Combines many changes to the Zapotec-Spanish
dictionary.
# Scott Starker

binmode(STDOUT, ":utf8");
binmode(STDIN, ":utf8");
use encoding "latin1";

This is unnecessary because al latin1 is the default anyway and b. your
source is all ascii.

# ${^WIDE_SYSTEM_CALLS} = 1;
$[ = 1; # set array base to 1

Aaarg... run away... $[ is highly deprecated and double-plus-ungood.
Yes, I know it's not your code

.

$, = " "; # set output field separator
$\ = "\n"; # set output record separator

$AlreadyGN = 0;
$notes = 0;
$gnsgnFirstLine = 0;
$anyline = 0;
$position = 0;
$lxline = '';
$mldef = '';
$seline = '';
$line = '';
$beg = '';
$end = '';

# This program takes out the "lx"'s that are alone on the line ("\k").
while (<>) {
chomp; # strip record separator
@Fld = split("\x{0020}", $_, 9999); # " "
print "\x{002a}";
# if ($Fld[1] eq " \\ l x") {
# if ($Fld[1] eq "\x{005c}\x{006c}\x{0078}") { # "\\lx"
if ($Fld[1] eq "\x{005c}\x{005c}\x{006c}\x{0078}") { # "\\lx"
print "\x{002a}\x{002a}";
$s = "\x{002d}", s/$s/\^\x{007e}/g; # "-"
# Make "tone" un-bolded
$Fld[2] = "\x{007c}\x{0062}" . $Fld[2]; # "\x{007c}\x{0062}"
s/\x{005b}/\x{007c}\x{0072}\x{005b}/g; # If "[" or "," exist
s/\x{005d}/\x{005d}\x{007c}\x{0062}/g;
s/\x{005d}\x{007c}\x{0062}\x{00b8}\x{0020}/\x{005d}\x{00b8}\x{0020}\x{007c}\x{0062}/g;
$Fld[$#Fld] = $Fld[$#Fld] . "\x{007c}\x{0072}";
$position = index($Fld[$#Fld], "\x{005d}");
$lxline = $_;

Right, let's attempt to translate that into Perl... (untested)

#!/usr/bin/perl

use strict;
use warnings;

$, = " ";
$\ = "\n";

binmode STDIN, ':encoding(utf8)';
binmode STDOUT, ':encoding(utf8)';
# this is better as you get fallback if the input is invalid

my $ced = "\xb8";

while (<>) {
chomp;
my ($a, $b, $c) = split " ";
if ($a eq '\\\lx') { # this comes out as two \
print '**';
s/-/^~/g;
$b = "|b$b";
s/\[/|r[/g;
s/]/]|b/g;
s/]\|b$ced ]/]$ced |b/g;

....etc. (Bog, that code's making my eyes hurt!) You can carry on, and
finish it (what you posted wasn't complete, right?).

Now, I can't really see what this is supposed to do, so what do you want
it to do, and what is it in fact doing?

Ben

Scottie · Feb 20, 2004

Ben,

... (what you posted wasn't complete, right?).

It wasn't nearly all of it!

Now, I can't really see what this is supposed to do, so what do you want
it to do, and what is it in fact doing?

Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
Zapotec word (spoken in Mexico) and it's Spanish words as it's
definitions. It's almost a database type-of-thing. The program is
called Shoebox. There are different lines for each record. They all
start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
There might at least one subentry (\se) along with it's definition(s)
(\sgn). There's more than these fields. (The Perl line "print "**";
was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
$_, 9999); that takes an array like this. Can you help me out? I need
to know how to get the line into @Fld.

Scott

Ben Morrow · Feb 21, 2004

Well, the zapotecUnicode.txt is a file the contains a "dictionary" of
Zapotec word (spoken in Mexico) and it's Spanish words as it's
definitions. It's almost a database type-of-thing. The program is
called Shoebox. There are different lines for each record. They all
start with "\lx" (lexicon). Then the definition(s) (\gn) follows.
There might at least one subentry (\se) along with it's definition(s)
(\sgn). There's more than these fields. (The Perl line "print "**";
was for testing purposes.) Thus, I therefor I need a @Fld = split(" ",
$_, 9999); that takes an array like this. Can you help me out? I need
to know how to get the line into @Fld.

Well, that's easy:

my @F = split ' ';

if the records on each line are space-separated. Alternatively,

my @F = split /\\/;

may work better, as it will split the line on the backslashes. There are
two 'unfortunately's here: firstly, you'll get an initial empty field,
before the first backslash; secondly, the actual backslashes themselves
will be removed, so you'll have to remember to put them back in.

It's probably easiest if you then iterate over the fields, and do
whatever you need to based on the field type:

#!/usr/bin/perl -lanF\\
# see perldoc perlrun for the above: it automagically iterates over all
# lines and splits them into @F

BEGIN {
$\ = '';
binmode STDIN, ':encoding(utf8)';
binmode STDOUT, ':encoding(utf8)';
}

for (@F) {
/^lx/ and next;

/^gn/ and do {
s/\xB8/|b/; # or whatever it is you want to do
next;
};
}
continue {
# this makes sure each entry gets printed, with its backslash,
# when you're done with it.

print '\\' . (join $,, @_) . $\;
}

Ben

setting binmode for empty filehandle	3	Apr 8, 2014
sorting file according to a unicode column	17	May 28, 2014
Why "Wide character in print"?	40	Sep 30, 2012
Unicode help please	5	Oct 19, 2013
Error in Handling Unicode(UTF16-LE) File & String	4	May 6, 2008
Opening Unicode files?	7	Dec 25, 2011
How to avoid \x{...} when converting unicode to latin1?	3	Jul 21, 2009
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023

Newbe Unicode question

Scottie

Ben Morrow

Scottie

Ben Morrow

Scottie

Ben Morrow

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads