whether large hash might leak?

K

Kimia

hi, girls and dudes,

.....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl

use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}

------------------------


When input file contains a few lines, it goes perfectly well.

data file:
in1.dat
------------------------
1aa
2bbbbb
3cc
1aa
5dd
------------------------

$ ./mysort.pl in1.dat
then i got:
------------------------
1aa*2
2bbbbb*1
3cc*1
5dd*1
------------------------

However, when I used it for a large file, which contains 10M lines, it
failed.

$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????
*1
------------------------
Where '?' is \0xff, when viewed as binary file.
I'm sure that the input contains no char as: \0xff. Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.

Then I tried it for a input file conprised of one million lines
and it failed with the same error; I tried it for a input file of 100k
lines and it did OK.
I am not sure that it should be a bug. If anyone know the reason,
would you plz tell us?

thank you for your attention.
 
M

Mirco Wahab

Kimia said:
recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
...
However, when I used it for a large file, which contains 10M lines, it
failed.

$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834

This might depend on the properties of the input file,
which encoding does it use, UTF8/16 or plain ASCII?

What system do you working on, what Perl version is installed?

Regards

M.
 
X

xhoster

Kimia said:
hi, girls and dudes,

....I doubt whether hash might leak when it comprises of a large
amount of pairs.
Recently I have been asked to do some statitic work over large
files. All I wanted to do is to find the duplicated lines of a file
and I wrote the snippet as below:
code:
mysort.pl
------------------------
#!/usr/bin/perl

use strict;
use warnings;
my %in;
my $cnt = 0;
while(<>){
chomp;
$_ or ++$cnt, next;
++$in{$_};
}
foreach(sort keys %in){
$cnt += $in{$_};
print "$_*$in{$_}\n";
}

What is the stuff with $cnt?
However, when I used it for a large file, which contains 10M lines, it
failed.

It doesn't fail. I gives you output you didn't expect.
$ ./mysort <TenLinesInput.dat >out
$ echo $?
0
$ tail out -n 5
------------------------
??????????????*2
????????????????*1
??????????????????*1
?????????????????????????????*2834
?????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????????
?????????????????????????????????????????????????????????????????????????
????????????????????? *1

I am not sure of that. Try this and see what it gives, and if
it consistently gives the same thing:

perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat

Most of lines
are tens of char long, few exceeds 100 and none exceeds 1000.
The other output lines, except last 10, all are as expected.

Then I tried it for a input file conprised of one million lines
and it failed with the same error;

It didn't fail with an error. The value of $? shows that. (And I don't
see anything suggestive of a "leak", either.) It seems like what it comes
down to is that you and Perl disagree over what is in your file.


Xho
 
J

J. Gleixner

Kimia said:
hi, girls and dudes,

....I doubt whether hash might leak when it comprises of a large
amount of pairs.

You could also try using uniq -with the -d -c options: man uniq
 
K

Kimia

Kimia wrote:

This might depend on the properties of the input file,
which encoding does it use, UTF8/16 or plain ASCII?

What system do you working on, what Perl version is installed?

Regards

M.

the file is encoded with gb2312, which is ASCII-compatibe and that is
used in P.R. China.
 
K

Kimia

What is the stuff with $cnt?




It doesn't fail. I gives you output you didn't expect.






I am not sure of that. Try this and see what it gives, and if
it consistently gives the same thing:

perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat



It didn't fail with an error. The value of $? shows that. (And I don't
see anything suggestive of a "leak", either.) It seems like what it comes
down to is that you and Perl disagree over what is in your file.

Xho

thanks, xho. I've found the bug, which, of course, I've made.
The output file is perfectly correct. The input file does contains
lines
of ????.
Before debugging, I have tryed with:
$perl -lne 'print if /^\0xff/'
and the output was none. Then I assured myself with the assumption.
However, the regex should be : /^\xff/

It was part of the volumnious log-file processing that I was asked
to do.
\0xff should not exist in normal encoding and should be generated in
some
uncertain situation.
The code that I posted was written for debugging when I found
exceptions in
other processing. However, I did not succeed in it, and it was so
stupid~
Befor debugging would expel error, it does import stupidness:)
Thanks for all your help.

ps:
perl -lne 'print $. unless -1==index $_, chr(0xff)' TenLinesInput.dat

I tried this lines and it does help me.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top