K
Kimia
hi, girls and dudes,
.....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl
use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}
------------------------
When input file contains a few lines, it goes perfectly well.
data file:
in1.dat
------------------------
1aa
2bbbbb
3cc
1aa
5dd
------------------------
$ ./mysort.pl in1.dat
then i got:
------------------------
1aa*2
2bbbbb*1
3cc*1
5dd*1
------------------------
However, when I used it for a large file, which contains 10M lines, it
failed.
$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
*1
------------------------
Where '?' is \0xff, when viewed as binary file.
I'm sure that the input contains no char as: \0xff. Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.
Then I tried it for a input file conprised of one million lines
and it failed with the same error; I tried it for a input file of 100k
lines and it did OK.
I am not sure that it should be a bug. If anyone know the reason,
would you plz tell us?
thank you for your attention.
.....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl
use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}
------------------------
When input file contains a few lines, it goes perfectly well.
data file:
in1.dat
------------------------
1aa
2bbbbb
3cc
1aa
5dd
------------------------
$ ./mysort.pl in1.dat
then i got:
------------------------
1aa*2
2bbbbb*1
3cc*1
5dd*1
------------------------
However, when I used it for a large file, which contains 10M lines, it
failed.
$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
*1
------------------------
Where '?' is \0xff, when viewed as binary file.
I'm sure that the input contains no char as: \0xff. Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.
Then I tried it for a input file conprised of one million lines
and it failed with the same error; I tried it for a input file of 100k
lines and it did OK.
I am not sure that it should be a bug. If anyone know the reason,
would you plz tell us?
thank you for your attention.