perl 5.6 multi byte

Discussion in 'Perl Misc' started by Sulla, Nov 25, 2003.

  1. Sulla

    Sulla Guest

    Hey guys, I need to do some parsing on a file that includes Japanese
    Shift JIS and Chinese GB1312 and was wondering if someone could help
    me with some errors im getting. Basically, I want to open the file,
    split the line by tabs, and then place the substrings in different
    files. I am not entirely sure what pragmas i need to use, or really
    how to open a wide character file properly (is GB1312 and Japanese
    Shift JIS wide chars? Is that different from utf8?) I have been
    trying to do research on multilingual support for perl 5.6, but it is
    highly confusing and I am positive I am missing something. My program
    is exiting early without having read the entire file (at least, it is
    only getting through about 10K of a 20K line file). I've included a
    code snippet and stripped out any attempts at multi-byte compatibility
    I've attempted in the hopes that someone will spot what is obviously
    wrong with it. Thanks so much in advance!

    my %g_hMsds;
    keys %g_hMsds = 60160;
    open IN, "<$g_strPrimaryFile" or die "Error opening file\n"
    $i = 0;
    while (<IN>) {

    my @aSplit = split /\t/, $_;
    my @aTemp = ();

    # insert into array
    $aTemp[0] = $aSplit[3];
    $aTemp[1] = $g_hLang{$aSplit[0]};
    $aTemp[2] = $aSplit[1];
    $aTemp[3] = $aSplit[4];
    $aTemp[4] = "";
    $aTemp[5] = $aSplit[7];
    $aTemp[6] = $aSplit[8];

    #attach the array
    $g_hMsds{$aSplit[3]} = \@aTemp;

    $i++;

    if ($i >= $g_nMaxFiles) {
    logResult("EXIT LOOP: ".$i." rows run");
    last;
    }

    }
    close IN;
     
    Sulla, Nov 25, 2003
    #1
    1. Advertising

  2. "Sulla" <> wrote in message
    news:...
    > [snip] My program
    > is exiting early without having read the entire file (at least, it is
    > only getting through about 10K of a 20K line file).


    Is it possible that $i has reached $g_nMaxFiles and the loop has therefore
    been terminated?

    I can't comment about your attempts at multi-byte compatibility. But the
    code snippet you posted doesn't include assignments for all the variables
    you've used, so a reader has to guess as to your intent. It's also evident
    that you didn't test your code snippet under 'use strict;' and 'use
    warnings;' before posting.

    That being said, I'll provide a few comments on the code and then post what
    I think is a cleaned-up version of what you intend. You can take it from
    there. Note: your code did not entail use of any Perl module. Hence, no
    need to post to comp.lang.perl.modules; comp.lang.perl.misc would have
    sufficed.

    >
    > my %g_hMsds;
    > keys %g_hMsds = 60160;


    What's the purpose of the above? In Perl, you don't need to pre-allocate
    the number of keys in a hash.

    > open IN, "<$g_strPrimaryFile" or die "Error opening file\n"
    > $i = 0;
    > while (<IN>) {
    >
    > my @aSplit = split /\t/, $_;
    > my @aTemp = ();
    >
    > # insert into array
    > $aTemp[0] = $aSplit[3];
    > $aTemp[1] = $g_hLang{$aSplit[0]};


    %g_hLang was not previously declared.

    > $aTemp[2] = $aSplit[1];
    > $aTemp[3] = $aSplit[4];
    > $aTemp[4] = "";
    > $aTemp[5] = $aSplit[7];
    > $aTemp[6] = $aSplit[8];
    >
    > #attach the array
    > $g_hMsds{$aSplit[3]} = \@aTemp;


    You're using @aTemp only to assign to %g_hMsds. See below how to eliminate
    it.

    >
    > $i++;
    >
    > if ($i >= $g_nMaxFiles) {
    > logResult("EXIT LOOP: ".$i." rows run");


    sub logResult not provided. See my guess at a substitution below and note
    simpler code.
    > last;
    > }
    >
    > }
    > close IN;


    use strict;
    use warnings;
    use Data::Dumper;

    my (%g_hMsds, $i, $g_nMaxFiles);
    $i = 0;
    $g_nMaxFiles = 3;

    while (<DATA>) {
    my @aSplit = split /\t/, $_;
    $g_hMsds{$aSplit[3]} =
    [ $aSplit[3], 'arbitrary', $aSplit[1], $aSplit[4],
    '', $aSplit[7], $aSplit[8] ];
    $i++;
    last if $i >= $g_nMaxFiles;
    }

    print "EXIT LOOP: $i rows run\n";
    print Dumper(\%g_hMsds);

    __DATA__
    alpha beta gamma delta epsilon zeta eta theta iota
    kappa lambda mu nu xi omicron pi rho sigma tau
    1 2 3 4 5 6 7 8 9
    q w e r t y u i o
    a s d f g h j k l
     
    James E Keenan, Nov 26, 2003
    #2
    1. Advertising

  3. Sulla

    Mihai N. Guest


    > Hey guys, I need to do some parsing on a file that includes Japanese
    > Shift JIS and Chinese GB1312 and was wondering if someone could help
    > me with some errors im getting.

    Nobody answered her, so I will give it a try :)

    > I am not entirely sure what pragmas i need to use, or really
    > how to open a wide character file properly (is GB1312 and Japanese
    > Shift JIS wide chars? Is that different from utf8?)

    Nothing special with Perl 5.6.
    GB1312 is in fact GB2312 and is used for Simplified Chinese.
    Both GB2312 and ShiftJIS are double byte character sets (DBCS).
    It does not mean they are wide char.
    Some characters have on byte, some have two bytes.
    This is why in many cases is a problem to do search, search-replace, etc
    for bytes that can be half a characters.
    For instance back-slash can be the second byte for several Japanese
    characters. Same for other characters (second byte can be anything above
    0x40)
    And yes, they are very different from utf8.
    DBCS can have 1 or 2 bytes, utf8 can have up to 5.
    DBCS cover one character set only (Simplified Chinese or Japanese, in this
    case), utf8 covers the whole Unicode.
    For DBCS it is not possible to tell what bytes can be lead or trayling bytes,
    without help from the OS or without hard-coded tables. And the tables are
    different from DBCS charset to another. UTF8 is clear, no need of tables.

    > I have been
    > trying to do research on multilingual support for perl 5.6, but it is
    > highly confusing and I am positive I am missing something.

    Main question: why 5.6? 5.8 is out for a long time already, and it is way
    better in handling this kind of problems.
    It does supports utf8, regular expressions on utf8, etc.

    > My program
    > is exiting early without having read the entire file (at least, it is
    > only getting through about 10K of a 20K line file).

    There is no reason to stop reading, does not matter the encoding.
    I suspect something else.
    Tell us more about OS, data file (is there a risk to have control
    characters?)
    It allways stops in the same place? Did you try to delete some lines from the
    beginning of the files to see where it stops after this? Maybe there is
    a certain line that stops it.

    > I've included a
    > code snippet and stripped out any attempts at multi-byte compatibility
    > I've attempted in the hopes that someone will spot what is obviously
    > wrong with it.

    Nothing obviously wrong.
    Except no ; after "open IN, ..."
    And no $g_hLang not defined, but used.

    And you increment $i for each line you read, then compare it
    against $g_nMaxFiles (again undefined) and exit.
    It this what you want? To exit after $g_nMaxFiles lines?
    Maybe this is the problem. And has nothing to do with the encoding.

    --
    Mihai
    -------------------------
    Replace _year_ with _ to get the real email
     
    Mihai N., Nov 30, 2003
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andreas
    Replies:
    1
    Views:
    906
    Jonathan Bromley
    May 4, 2004
  2. Bharat Bhushan

    Appending byte[] to another byte[] array

    Bharat Bhushan, Aug 5, 2003, in forum: Java
    Replies:
    15
    Views:
    40,407
    Roedy Green
    Aug 5, 2003
  3. Jean-Daniel Gamache
    Replies:
    0
    Views:
    435
    Jean-Daniel Gamache
    Jul 14, 2004
  4. Peter
    Replies:
    3
    Views:
    741
    Michael Borgwardt
    Aug 5, 2004
  5. Kirby
    Replies:
    3
    Views:
    671
    Kirby
    Oct 8, 2004
Loading...

Share This Page