whether large hash might leak?

Kimia · Jul 27, 2007

hi, girls and dudes,

.....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl

use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}

------------------------

When input file contains a few lines, it goes perfectly well.

data file:
in1.dat
------------------------
1aa
2bbbbb
3cc
1aa
5dd
------------------------

$ ./mysort.pl in1.dat
then i got:
------------------------
1aa*2
2bbbbb*1
3cc*1
5dd*1
------------------------

However, when I used it for a large file, which contains 10M lines, it
failed.

$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
*1
------------------------
Where '?' is \0xff, when viewed as binary file.
I'm sure that the input contains no char as: \0xff. Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.

Then I tried it for a input file conprised of one million lines
and it failed with the same error; I tried it for a input file of 100k
lines and it did OK.
I am not sure that it should be a bug. If anyone know the reason,
would you plz tell us?

thank you for your attention.

Mirco Wahab · Jul 27, 2007

Kimia said:
recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
...
However, when I used it for a large file, which contains 10M lines, it
failed.

$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834

This might depend on the properties of the input file,
which encoding does it use, UTF8/16 or plain ASCII?

What system do you working on, what Perl version is installed?

Regards

M.

xhoster · Jul 27, 2007

Kimia said:
hi, girls and dudes,

....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl

use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}

What is the stuff with $cnt?

However, when I used it for a large file, which contains 10M lines, it
failed.

It doesn't fail. I gives you output you didn't expect.

$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
?????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????????
????????????????????? *1

I am not sure of that. Try this and see what it gives, and if
it consistently gives the same thing:

perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat

Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.

Then I tried it for a input file conprised of one million lines
and it failed with the same error;

It didn't fail with an error. The value of $? shows that. (And I don't
see anything suggestive of a "leak", either.) It seems like what it comes
down to is that you and Perl disagree over what is in your file.

Xho

J. Gleixner · Jul 27, 2007

Kimia said:
hi, girls and dudes,

....I doubt whether hash might leak when it comprises of a large
amount of pairs.

You could also try using uniq -with the -d -c options: man uniq

Kimia · Jul 28, 2007

Kimia wrote:

This might depend on the properties of the input file,
which encoding does it use, UTF8/16 or plain ASCII?

What system do you working on, what Perl version is installed?

Regards

M.

the file is encoded with gb2312, which is ASCII-compatibe and that is
used in P.R. China.

Kimia · Jul 28, 2007

What is the stuff with $cnt?

It doesn't fail. I gives you output you didn't expect.

I am not sure of that. Try this and see what it gives, and if
it consistently gives the same thing:

perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat

It didn't fail with an error. The value of $? shows that. (And I don't
see anything suggestive of a "leak", either.) It seems like what it comes
down to is that you and Perl disagree over what is in your file.

Xho

thanks, xho. I've found the bug, which, of course, I've made.
The output file is perfectly correct. The input file does contains
lines
of ????.
Before debugging, I have tryed with:
$perl -lne 'print if /^\0xff/'
and the output was none. Then I assured myself with the assumption.
However, the regex should be : /^\xff/

It was part of the volumnious log-file processing that I was asked
to do.
\0xff should not exist in normal encoding and should be generated in
some
uncertain situation.
The code that I posted was written for debugging when I found
exceptions in
other processing. However, I did not succeed in it, and it was so
stupid~
Befor debugging would expel error, it does import stupidness

Thanks for all your help.

ps:

perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat

I tried this lines and it does help me.

Help!! Can anyone provide this solution?	1	Jan 30, 2022
hash of arrays	1	Sep 13, 2012
data to hash	1	Jul 29, 2011
Read a hash	5	Apr 28, 2011
suitable key for a hash	8	Oct 12, 2010
Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
How to try a range of hex values in C# code ?	0	Nov 19, 2022
Hash array with variable size?	5	Feb 28, 2011

whether large hash might leak?

Kimia

Mirco Wahab

xhoster

J. Gleixner

Kimia

Kimia

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads