Ying Hu said:
My perl script is as the following, but I do know that is not good one.
The two data sets are in text files and very huge.
a bit more info might still help to optimize
how big datasets?
are the A-intervals overlapping?
what are the value ranges? (for example integers 0-1000)
foreach $b (sort {$B_point{a}<=>$B_point{$b} keys %B_point){
you are sorting the dataset B , but nothing in the following requires it
unless the output needs to be sorted. in that case it might be more
efficient to sort
the output if it is a much smaller set than B
foreach $a (sort {$A_from{$a} <=> $A_from{$b} keys %A_from){
here you sort unnecessarily A for EACH element of B (bad if B is large)
better to sort it before (if you want to sort it at all)
if ($B_point{$b} >= $A_from{$a} and $B_point{$b} <= $A_to){
print "A:$a B:$b\n";
}
}
}
try to move the sort to before the loops and see if you have improvements.
@Aset=sort {$A_from{$a} <=> $A_from{$b} keys %A_from;
@Bset=sort {$B_point{$a}<=>$B_point{$b} keys %B_point;
foreach $b (@Bset) {
foreach $a (@Aset) {
...
if this improvement is not enough, you might make the innerloop slightly
more complex and make use of the fact that (if A and B are sorted)
for any given $Bset[x] >=$Aset[y], you do not need to check $Aset[0..y-1]
when processing $Bset[x+1]. in other words, for each element of Bset,
you can start the inner loop at the first @Aset element that passed the
$B_point{$b} >= $A_from{$a} test in the previous B
also if A is sorted, you can exit from the inner loop with last as soon as
$A_from{$a}>$B_point{$b},
as all following intervals will just be more distant from the B point
another problem is memory usage. if set B is huge and much larger than set A
you might want/need to only read A into memory, and process B line by line.
if both are too huge to read into memory, you might need to use another
method altogether
good luck
gnari