search interval

Y

Ying Hu

Hi,
There are two data sets,
data set 1:
A1 123 125
A2 129 200
A3 400 420
....
....
data set 2:
B1 126
B2 130
B3 202
....
....

My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
(A2)} .

thanks
Ying
 
G

Gunnar Hjalmarsson

Ying said:
There are two data sets,
data set 1:
A1 123 125
A2 129 200
A3 400 420
...
...
data set 2:
B1 126
B2 130
B3 202
...
...

My question is how to get B2 and A2 { 130 (B2) is in [from]129
[to]200 (A2)} .

I would suggest that you write a script that does what you want. Perl
is probably a suitable programming language, btw.

If you would encounter difficulties, that you after having made
serious attempts to resolve them yourself can't find the solution to,
please post your program here. If you do so, don't forget to comply
with the posting guidelines:
http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

Good luck!
 
R

Ragnar Hafstað

Ying Hu said:
Hi,
There are two data sets,
data set 1:
A1 123 125
A2 129 200
A3 400 420
...
...
data set 2:
B1 126
B2 130
B3 202
...
...

My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
(A2)} .

SELECT b.key,a.key from tableb,tablea where b.val between a.minval and
a.maxval

seriously, we need a bit of context here to fathom what your real question
is.

if your data sets are small, simple iteration might be called for
otherwise, more details could help.
for instance what ranges for A and B values are we talking about here?
what are the sizes of the datasets?
are the A intervals non-overlapping? are the B values unique?

what have you tried so far, and how did that fail?

gnari
 
M

Matt Garrish

Ying Hu said:
Hi,
There are two data sets,
data set 1:
A1 123 125
A2 129 200
A3 400 420
...
...
data set 2:
B1 126
B2 130
B3 202
...
...

My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
(A2)} .

Feels like homework, you don't even seem to have tried, AND you didn't even
make it very clear what you're trying to do (are the identifier values of
any use whatsoever???), but since it's the holidays here's something to get
you started: (and please don't post back asking for modifications to the
code if it doesn't do what you want!)

use strict;
use warnings;

my %test;
my %range;

while (my $line = <DATA>) {

next if $line =~ /^\s*$/;

my @vals = split(/\s+/, $line);

if (scalar(@vals) == 2) {

$test{$vals[0]} = [@vals];

}

elsif (scalar(@vals) == 3) {

$range{$vals[0]} = [@vals];

}

else {

die "Unknown pattern found in data line $.\n";

}

}

# because life is all about pleasing Uri...

my @print;

foreach my $key (sort keys %test) {

my $chk = isbetween($test{$key}[1], \%range);

push @print, $chk if $chk;

}

print @print;



sub isbetween {

my ($v, $href) = @_;

foreach my $r (sort keys %$href) {

# just for clarity I've assigned more comprehensible variable names

my $min = $$href{$r}[1];
my $max = $$href{$r}[2];

if (($v > $min) && ($v < $max)) {
return "The value $v is between $min and $max\n";
}

}

}


__DATA__
A1 123 125
A2 129 200
A3 400 420

B1 126
B2 130
B3 202
 
Y

Ying Hu

My perl script is as the following, but I do know that is not good one.
The two data sets are in text files and very huge.

$infile = "data_sets1.txt";
open IN, $infile or die "$infile $!\n";
while (<IN>){
chomp;
@F = split;
$A_from{$F[0]} = $F[1];
$A_to{$F[0] = $F[2];
}
$infile = "data_sets2.txt";
open IN, $infile or die "$infile $!\n";
while (<IN>){
chomp;
@F = split;
$B_point{$F[0]} = $F[1];
}

foreach $b (sort {$B_point{a}<=>$B_point{$b} keys %B_point){
foreach $a (sort {$A_from{$a} <=> $A_from{$b} keys %A_from){
if ($B_point{$b} >= $A_from{$a} and $B_point{$b} <= $A_to){
print "A:$a B:$b\n";
}
}
}
 
R

Ragnar Hafstað

Ying Hu said:
My perl script is as the following, but I do know that is not good one.
The two data sets are in text files and very huge.
a bit more info might still help to optimize
how big datasets?
are the A-intervals overlapping?
what are the value ranges? (for example integers 0-1000)
foreach $b (sort {$B_point{a}<=>$B_point{$b} keys %B_point){
you are sorting the dataset B , but nothing in the following requires it
unless the output needs to be sorted. in that case it might be more
efficient to sort
the output if it is a much smaller set than B
foreach $a (sort {$A_from{$a} <=> $A_from{$b} keys %A_from){
here you sort unnecessarily A for EACH element of B (bad if B is large)
better to sort it before (if you want to sort it at all)
if ($B_point{$b} >= $A_from{$a} and $B_point{$b} <= $A_to){
print "A:$a B:$b\n";
}
}
}

try to move the sort to before the loops and see if you have improvements.

@Aset=sort {$A_from{$a} <=> $A_from{$b} keys %A_from;
@Bset=sort {$B_point{$a}<=>$B_point{$b} keys %B_point;
foreach $b (@Bset) {
foreach $a (@Aset) {
...

if this improvement is not enough, you might make the innerloop slightly
more complex and make use of the fact that (if A and B are sorted)
for any given $Bset[x] >=$Aset[y], you do not need to check $Aset[0..y-1]
when processing $Bset[x+1]. in other words, for each element of Bset,
you can start the inner loop at the first @Aset element that passed the
$B_point{$b} >= $A_from{$a} test in the previous B

also if A is sorted, you can exit from the inner loop with last as soon as
$A_from{$a}>$B_point{$b},
as all following intervals will just be more distant from the B point

another problem is memory usage. if set B is huge and much larger than set A
you might want/need to only read A into memory, and process B line by line.
if both are too huge to read into memory, you might need to use another
method altogether

good luck
gnari
 
G

Gunnar Hjalmarsson

Ying said:
My perl script is as the following, but I do know that is not good
one.

I agree. It's not good at all.

1) It does not compile.

2) You are not using strictures (the script you posted contains errors
which are easily debugged by using strict).

3) It's not a good idea to use the special variables $a and $b outside
the sort() function, especially in a program that uses sort().

It's good that you posted code this time, but you have obviously not
studied the posting guidelines as suggested in my previous post in
this thread (or you decided to ignore them). Please give it a new try,
complying with the guidelines. It's inconsiderate to expect the people
who you ask for help to start possible efforts to assist you with
correcting typos.
 
A

Anno Siegel

Ragnar Hafstað said:
a bit more info might still help to optimize
how big datasets?
are the A-intervals overlapping?
what are the value ranges? (for example integers 0-1000)

If we assume the ranges are non-overlapping and delimited by smallish
integers, the ranges can be found in constant time without sorting and
searching. First build an array @rangemap that holds, for index $i,
the name of the range $i belongs to, or undef if there is no interval
for $i. Then each interval can be found with a single array lookup:


my @rangemap;

while ( <DATA> ) {
last if /^$/;
my ( $range, $from, $to) = split;
@rangemap[ $from .. $to] = ( $range) x ( $to - $from + 1);
}

while ( <DATA> ) {
my ( $item, $val) = split;
defined and print "$item and $_\n" for $rangemap[ $val];
}

__DATA__
A1 123 125
A2 129 200
A3 400 420

B1 126
B2 130
B3 202

Anno
 
R

Ragnar Hafstað

Anno Siegel said:
If we assume the ranges are non-overlapping and delimited by smallish
integers, the ranges can be found in constant time without sorting and
searching. First build an array @rangemap that holds, for index $i,
the name of the range $i belongs to, or undef if there is no interval
for $i. Then each interval can be found with a single array lookup:

this possibility, was one of the things i had in mind when I asked my
questions to the OP.

Usually I wait until the premises are met, before I code complete
optimisations based on them :)

....snipped nice implmentation...

I hope the OP does not use this on a data set like
A1 0 1000000000
A2 1000000001 2000000000
....


gnari
 
Y

Ying Hu

If the A and B data sets are huge (more than 100,000 for each), A-interval
is overlapped and the value ranges are big (0-100000). Is the following code
good one? Maybe there is the best algorithm to solve the problem. Happy new
year!
my %a_from = ();
my %a_to = ();
my %b_position = ();
my %all_position;
while ( <DATA> ) {
last if /^$/;
my ( $range, $from, $to) = split;
$a_from{$from} = $range;
$a_to{$to} = $range;
$all_position{$from} = 1;
$all_position{$to} = 1;
}
while ( <DATA> ) {
my ( $item, $val) = split;
$b_position{$val} = $item;
$all_position{$val} = 1;
}
my %from = ();
foreach my $p (sort {$a <=> $b} keys %all_position){
if ($a_from{$p}){
$from{$a_from{$p}} = $p;
}
if ($a_to{$p} and !$b_position{$p}){
$from{$a_to{$p}} = ();
}
if (%from and $b_position{$p}){
print "B: $b_position{$p} A:";
foreach my $k (sort {$from{$a}<=>$from{$b}} keys %from){
print " $k";
}
print "\n";
}
if ($a_to{$p}){
delete $from{$a_to{$p}};
}
}


__DATA__
A1 123 125
A2 129 200
A3 400 420
A4 415 425

B1 126
B2 130
B3 202
B4 419
 
R

Ragnar Hafstað

please do not top-post.
(admittedly, in this case, it did not hurt much, as you not really refering
to the
quoted text, but in that case you should not include it.)
If the A and B data sets are huge (more than 100,000 for each), A-interval
is overlapped and the value ranges are big (0-100000). Is the following code
good one?
no

Maybe there is the best algorithm to solve the problem.

no

snipped most of badly indented code ...
foreach my $p (sort {$a <=> $b} keys %all_position){
...
if (%from and $b_position{$p}){
print "B: $b_position{$p} A:";
foreach my $k (sort {$from{$a}<=>$from{$b}} keys %from){

here you are sorting the keys of %from once for each $p (more that 100000
times)

what's more , i cannot say that I understand your code. does it really work?

gnari
 
R

Ragnar Hafstað

Ying Hu said:
If the A and B data sets are huge (more than 100,000 for each), A-interval
is overlapped and the value ranges are big (0-100000). Is the following code
good one?
snipped the code

in the following, assuming
a) all points are integers
b) most intervals are relatively small (71001-71167 but not 12-98765)
c) B points are not repeated

if you have memory for an array of all B points (100000 items),
then i suggest something like:
(note: untested, not even syntax checked)

our @Bpoints=();

open B,"< "Bset" or die "could not open Bset: $!";
while (<B>) {
next unless my ($B,$bp)=/^(\S+)\s+(\S+)/;
$Bpoints[$bp]=$B;
}
close B;

open A,"< "Aset" or die "could not open Aset: $!";
while (<A>) {
next unless my ($A,$ap1,$ap2)=/^(\S+)\s+(\S+)\s/;
for ($ap1..$ap2) {
print "B: Bpoints[$_] ($_) A: $A ($ap1 - $ap2)\n" if $Bpoints[$_];
}
}
close A;


the output order will be A intervals in original order, B points increasing

if assumption C) is false , we let @Bpoints contain array of B points,
and wrap the print in a for loop of those.
 
A

Anno Siegel

Ying Hu said:
If the A and B data sets are huge (more than 100,000 for each), A-interval
is overlapped

Overlap can be dealt with in many ways. One is to resolve overlapping
parts into sub-intervals that don't overlap. Some algorithms may not
even care.
and the value ranges are big (0-100000).

You'd need a bigger value range to accommodate 100_000 intervals, unless
most of them overlap. On the other hand, an array of 100_000 scalars
isn't out of the question, even in Perl, so the mapping method could
still be feasible.
Is the following code
good one? Maybe there is the best algorithm to solve the problem. Happy new
year!

If you want to talk about algorithms you should describe what your code
is doing. It isn't all that clear by itself. What are the four hashes
keeping track of?

In any case, the frequent sorts that gnari noted in another followup
will kill it, run-time wise.

I don't know what the best algorithm might be for this problem, but
a related problem has been dealt with in _Introduction to Algorithms_
by Cormen e.a. under the topic "Interval trees". Like all good algorithms,
it is centered around a data structure, one that allows queries about sets
of intervals. If you are seriously interested, take a look there.

[code snipped]

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top