whether large hash might leak?

Discussion in 'Perl Misc' started by Kimia, Jul 27, 2007.

  1. Kimia

    Kimia Guest

    hi, girls and dudes,

    .....I doubt whether hash might leak when it comprises of a large
    amount of pairs.
    Recently I have been asked to do some statitic work over large
    files. All I wanted to do is to find the duplicated lines of a file
    and I wrote the snippet as below:
    code:
    mysort.pl
    ------------------------
    #!/usr/bin/perl

    use strict;
    use warnings;
    my %in;
    my $cnt = 0;
    while(<>){
    chomp;
    $_ or ++$cnt, next;
    ++$in{$_};
    }
    foreach(sort keys %in){
    $cnt += $in{$_};
    print "$_*$in{$_}\n";
    }

    ------------------------


    When input file contains a few lines, it goes perfectly well.

    data file:
    in1.dat
    ------------------------
    1aa
    2bbbbb
    3cc
    1aa
    5dd
    ------------------------

    $ ./mysort.pl in1.dat
    then i got:
    ------------------------
    1aa*2
    2bbbbb*1
    3cc*1
    5dd*1
    ------------------------

    However, when I used it for a large file, which contains 10M lines, it
    failed.

    $ ./mysort <TenLinesInput.dat >out
    $ echo $?
    0
    $ tail out -n 5
    ------------------------
    ??????????????*2
    ????????????????*1
    ??????????????????*1
    ?????????????????????????????*2834
    ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
    *1
    ------------------------
    Where '?' is \0xff, when viewed as binary file.
    I'm sure that the input contains no char as: \0xff. Most of lines
    are tens of char long, few exceeds 100 and none exceeds 1000.
    The other output lines, except last 10, all are as expected.

    Then I tried it for a input file conprised of one million lines
    and it failed with the same error; I tried it for a input file of 100k
    lines and it did OK.
    I am not sure that it should be a bug. If anyone know the reason,
    would you plz tell us?

    thank you for your attention.

    --
    uita uinum est.
     
    Kimia, Jul 27, 2007
    #1
    1. Advertising

  2. Kimia

    Mirco Wahab Guest

    Kimia wrote:
    > recently I have been asked to do some statitic work over large
    > files. All I wanted to do is to find the duplicated lines of a file
    > and I wrote the snippet as below:
    > code:
    > ...
    > However, when I used it for a large file, which contains 10M lines, it
    > failed.
    >
    > $ ./mysort <TenLinesInput.dat >out
    > $ echo $?
    > 0
    > $ tail out -n 5
    > ------------------------
    > ??????????????*2
    > ????????????????*1
    > ??????????????????*1
    > ?????????????????????????????*2834
    > ------------------------
    > Where '?' is \0xff, when viewed as binary file.
    > I'm sure that the input contains no char as: \0xff. Most of lines
    > are tens of char long, few exceeds 100 and none exceeds 1000.


    This might depend on the properties of the input file,
    which encoding does it use, UTF8/16 or plain ASCII?

    What system do you working on, what Perl version is installed?

    Regards

    M.
     
    Mirco Wahab, Jul 27, 2007
    #2
    1. Advertising

  3. Kimia

    Guest

    Kimia <> wrote:
    > hi, girls and dudes,
    >
    > ....I doubt whether hash might leak when it comprises of a large
    > amount of pairs.
    > Recently I have been asked to do some statitic work over large
    > files. All I wanted to do is to find the duplicated lines of a file
    > and I wrote the snippet as below:
    > code:
    > mysort.pl
    > ------------------------
    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    > my %in;
    > my $cnt = 0;
    > while(<>){
    > chomp;
    > $_ or ++$cnt, next;
    > ++$in{$_};
    > }
    > foreach(sort keys %in){
    > $cnt += $in{$_};
    > print "$_*$in{$_}\n";
    > }


    What is the stuff with $cnt?

    >
    > However, when I used it for a large file, which contains 10M lines, it
    > failed.


    It doesn't fail. I gives you output you didn't expect.

    >
    > $ ./mysort <TenLinesInput.dat >out
    > $ echo $?
    > 0
    > $ tail out -n 5
    > ------------------------
    > ??????????????*2
    > ????????????????*1
    > ??????????????????*1
    > ?????????????????????????????*2834
    > ?????????????????????????????????????????????????????????????????????????
    > ?????????????????????????????????????????????????????????????????????????
    > ?????????????????????????????????????????????????????????????????????????
    > ????????????????????? *1
    > ------------------------
    > Where '?' is \0xff, when viewed as binary file.
    > I'm sure that the input contains no char as: \0xff.


    I am not sure of that. Try this and see what it gives, and if
    it consistently gives the same thing:

    perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat


    > Most of lines
    > are tens of char long, few exceeds 100 and none exceeds 1000.
    > The other output lines, except last 10, all are as expected.
    >
    > Then I tried it for a input file conprised of one million lines
    > and it failed with the same error;


    It didn't fail with an error. The value of $? shows that. (And I don't
    see anything suggestive of a "leak", either.) It seems like what it comes
    down to is that you and Perl disagree over what is in your file.


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 27, 2007
    #3
  4. Kimia

    J. Gleixner Guest

    Kimia wrote:
    > hi, girls and dudes,
    >
    > ....I doubt whether hash might leak when it comprises of a large
    > amount of pairs.


    You could also try using uniq -with the -d -c options: man uniq
     
    J. Gleixner, Jul 27, 2007
    #4
  5. Kimia

    Kimia Guest

    On 27 juil, 15:45, Mirco Wahab <> wrote:
    > Kimia wrote:


    > > ?????????????????????????????*2834
    > > ------------------------
    > > Where '?' is \0xff, when viewed as binary file.
    > > I'm sure that the input contains no char as: \0xff. Most of lines
    > > are tens of char long, few exceeds 100 and none exceeds 1000.

    >
    > This might depend on the properties of the input file,
    > which encoding does it use, UTF8/16 or plain ASCII?
    >
    > What system do you working on, what Perl version is installed?
    >
    > Regards
    >
    > M.


    the file is encoded with gb2312, which is ASCII-compatibe and that is
    used in P.R. China.
     
    Kimia, Jul 28, 2007
    #5
  6. Kimia

    Kimia Guest

    >On 28 juil, 03:12, wrote:
    > Kimia <> wrote:
    > > hi, girls and dudes,

    >
    > > ....I doubt whether hash might leak when it comprises of a large
    > > amount of pairs.
    > > Recently I have been asked to do some statitic work over large
    > > files. All I wanted to do is to find the duplicated lines of a file
    > > and I wrote the snippet as below:
    > > code:
    > > mysort.pl
    > > ------------------------
    > > #!/usr/bin/perl

    >
    > > use strict;
    > > use warnings;
    > > my %in;
    > > my $cnt = 0;
    > > while(<>){
    > > chomp;
    > > $_ or ++$cnt, next;
    > > ++$in{$_};
    > > }
    > > foreach(sort keys %in){
    > > $cnt += $in{$_};
    > > print "$_*$in{$_}\n";
    > > }

    >
    > What is the stuff with $cnt?
    >
    >
    >
    > > However, when I used it for a large file, which contains 10M lines, it
    > > failed.

    >
    > It doesn't fail. I gives you output you didn't expect.
    >
    >
    >
    >
    >
    > > $ ./mysort <TenLinesInput.dat >out
    > > $ echo $?
    > > 0
    > > $ tail out -n 5
    > > ------------------------
    > > ??????????????*2
    > > ????????????????*1
    > > ??????????????????*1
    > > ?????????????????????????????*2834
    > > ?????????????????????????????????????????????????????????????????????????
    > > ?????????????????????????????????????????????????????????????????????????
    > > ?????????????????????????????????????????????????????????????????????????
    > > ????????????????????? *1
    > > ------------------------
    > > Where '?' is \0xff, when viewed as binary file.
    > > I'm sure that the input contains no char as: \0xff.

    >
    > I am not sure of that. Try this and see what it gives, and if
    > it consistently gives the same thing:
    >
    > perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat
    >
    > > Most of lines
    > > are tens of char long, few exceeds 100 and none exceeds 1000.
    > > The other output lines, except last 10, all are as expected.

    >
    > > Then I tried it for a input file conprised of one million lines
    > > and it failed with the same error;

    >
    > It didn't fail with an error. The value of $? shows that. (And I don't
    > see anything suggestive of a "leak", either.) It seems like what it comes
    > down to is that you and Perl disagree over what is in your file.
    >
    > Xho
    >
    > --
    > --------------------http://NewsReader.Com/--------------------
    > Usenet Newsgroup Service $9.95/Month 30GB


    thanks, xho. I've found the bug, which, of course, I've made.
    The output file is perfectly correct. The input file does contains
    lines
    of ????.
    Before debugging, I have tryed with:
    $perl -lne 'print if /^\0xff/'
    and the output was none. Then I assured myself with the assumption.
    However, the regex should be : /^\xff/

    It was part of the volumnious log-file processing that I was asked
    to do.
    \0xff should not exist in normal encoding and should be generated in
    some
    uncertain situation.
    The code that I posted was written for debugging when I found
    exceptions in
    other processing. However, I did not succeed in it, and it was so
    stupid~
    Befor debugging would expel error, it does import stupidness:)
    Thanks for all your help.

    ps:
    > perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat


    I tried this lines and it does help me.

    --
    fous, c'est un mot qu'on dirait invent'e pour nous.
     
    Kimia, Jul 28, 2007
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Richard Heathfield

    Leak or no leak ??

    Richard Heathfield, Jul 10, 2006, in forum: C Programming
    Replies:
    4
    Views:
    364
    Richard Heathfield
    Jul 10, 2006
  2. rp
    Replies:
    1
    Views:
    555
    red floyd
    Nov 10, 2011
  3. Srijayanth Sridhar
    Replies:
    19
    Views:
    640
    David A. Black
    Jul 2, 2008
  4. Greg Hauptmann
    Replies:
    8
    Views:
    94
    Robert Klemme
    Jan 13, 2009
  5. dblock
    Replies:
    2
    Views:
    669
    Simon Krahnke
    Oct 9, 2011
Loading...

Share This Page