perl 5.6 multi byte

S

Sulla

Hey guys, I need to do some parsing on a file that includes Japanese
Shift JIS and Chinese GB1312 and was wondering if someone could help
me with some errors im getting. Basically, I want to open the file,
split the line by tabs, and then place the substrings in different
files. I am not entirely sure what pragmas i need to use, or really
how to open a wide character file properly (is GB1312 and Japanese
Shift JIS wide chars? Is that different from utf8?) I have been
trying to do research on multilingual support for perl 5.6, but it is
highly confusing and I am positive I am missing something. My program
is exiting early without having read the entire file (at least, it is
only getting through about 10K of a 20K line file). I've included a
code snippet and stripped out any attempts at multi-byte compatibility
I've attempted in the hopes that someone will spot what is obviously
wrong with it. Thanks so much in advance!

my %g_hMsds;
keys %g_hMsds = 60160;
open IN, "<$g_strPrimaryFile" or die "Error opening file\n"
$i = 0;
while (<IN>) {

my @aSplit = split /\t/, $_;
my @aTemp = ();

# insert into array
$aTemp[0] = $aSplit[3];
$aTemp[1] = $g_hLang{$aSplit[0]};
$aTemp[2] = $aSplit[1];
$aTemp[3] = $aSplit[4];
$aTemp[4] = "";
$aTemp[5] = $aSplit[7];
$aTemp[6] = $aSplit[8];

#attach the array
$g_hMsds{$aSplit[3]} = \@aTemp;

$i++;

if ($i >= $g_nMaxFiles) {
logResult("EXIT LOOP: ".$i." rows run");
last;
}

}
close IN;
 
J

James E Keenan

Sulla said:
[snip] My program
is exiting early without having read the entire file (at least, it is
only getting through about 10K of a 20K line file).

Is it possible that $i has reached $g_nMaxFiles and the loop has therefore
been terminated?

I can't comment about your attempts at multi-byte compatibility. But the
code snippet you posted doesn't include assignments for all the variables
you've used, so a reader has to guess as to your intent. It's also evident
that you didn't test your code snippet under 'use strict;' and 'use
warnings;' before posting.

That being said, I'll provide a few comments on the code and then post what
I think is a cleaned-up version of what you intend. You can take it from
there. Note: your code did not entail use of any Perl module. Hence, no
need to post to comp.lang.perl.modules; comp.lang.perl.misc would have
sufficed.
my %g_hMsds;
keys %g_hMsds = 60160;

What's the purpose of the above? In Perl, you don't need to pre-allocate
the number of keys in a hash.
open IN, "<$g_strPrimaryFile" or die "Error opening file\n"
$i = 0;
while (<IN>) {

my @aSplit = split /\t/, $_;
my @aTemp = ();

# insert into array
$aTemp[0] = $aSplit[3];
$aTemp[1] = $g_hLang{$aSplit[0]};

%g_hLang was not previously declared.
$aTemp[2] = $aSplit[1];
$aTemp[3] = $aSplit[4];
$aTemp[4] = "";
$aTemp[5] = $aSplit[7];
$aTemp[6] = $aSplit[8];

#attach the array
$g_hMsds{$aSplit[3]} = \@aTemp;

You're using @aTemp only to assign to %g_hMsds. See below how to eliminate
it.
$i++;

if ($i >= $g_nMaxFiles) {
logResult("EXIT LOOP: ".$i." rows run");

sub logResult not provided. See my guess at a substitution below and note
simpler code.
last;
}

}
close IN;

use strict;
use warnings;
use Data::Dumper;

my (%g_hMsds, $i, $g_nMaxFiles);
$i = 0;
$g_nMaxFiles = 3;

while (<DATA>) {
my @aSplit = split /\t/, $_;
$g_hMsds{$aSplit[3]} =
[ $aSplit[3], 'arbitrary', $aSplit[1], $aSplit[4],
'', $aSplit[7], $aSplit[8] ];
$i++;
last if $i >= $g_nMaxFiles;
}

print "EXIT LOOP: $i rows run\n";
print Dumper(\%g_hMsds);

__DATA__
alpha beta gamma delta epsilon zeta eta theta iota
kappa lambda mu nu xi omicron pi rho sigma tau
1 2 3 4 5 6 7 8 9
q w e r t y u i o
a s d f g h j k l
 
M

Mihai N.

Hey guys, I need to do some parsing on a file that includes Japanese
Shift JIS and Chinese GB1312 and was wondering if someone could help
me with some errors im getting.
Nobody answered her, so I will give it a try :)
I am not entirely sure what pragmas i need to use, or really
how to open a wide character file properly (is GB1312 and Japanese
Shift JIS wide chars? Is that different from utf8?)
Nothing special with Perl 5.6.
GB1312 is in fact GB2312 and is used for Simplified Chinese.
Both GB2312 and ShiftJIS are double byte character sets (DBCS).
It does not mean they are wide char.
Some characters have on byte, some have two bytes.
This is why in many cases is a problem to do search, search-replace, etc
for bytes that can be half a characters.
For instance back-slash can be the second byte for several Japanese
characters. Same for other characters (second byte can be anything above
0x40)
And yes, they are very different from utf8.
DBCS can have 1 or 2 bytes, utf8 can have up to 5.
DBCS cover one character set only (Simplified Chinese or Japanese, in this
case), utf8 covers the whole Unicode.
For DBCS it is not possible to tell what bytes can be lead or trayling bytes,
without help from the OS or without hard-coded tables. And the tables are
different from DBCS charset to another. UTF8 is clear, no need of tables.
I have been
trying to do research on multilingual support for perl 5.6, but it is
highly confusing and I am positive I am missing something.
Main question: why 5.6? 5.8 is out for a long time already, and it is way
better in handling this kind of problems.
It does supports utf8, regular expressions on utf8, etc.
My program
is exiting early without having read the entire file (at least, it is
only getting through about 10K of a 20K line file).
There is no reason to stop reading, does not matter the encoding.
I suspect something else.
Tell us more about OS, data file (is there a risk to have control
characters?)
It allways stops in the same place? Did you try to delete some lines from the
beginning of the files to see where it stops after this? Maybe there is
a certain line that stops it.
I've included a
code snippet and stripped out any attempts at multi-byte compatibility
I've attempted in the hopes that someone will spot what is obviously
wrong with it.
Nothing obviously wrong.
Except no ; after "open IN, ..."
And no $g_hLang not defined, but used.

And you increment $i for each line you read, then compare it
against $g_nMaxFiles (again undefined) and exit.
It this what you want? To exit after $g_nMaxFiles lines?
Maybe this is the problem. And has nothing to do with the encoding.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,140
Latest member
SweetcalmCBDreview
Top