search interval

Discussion in 'Perl Misc' started by Ying Hu, Dec 23, 2003.

  1. Ying Hu

    Ying Hu Guest

    Hi,
    There are two data sets,
    data set 1:
    A1 123 125
    A2 129 200
    A3 400 420
    ....
    ....
    data set 2:
    B1 126
    B2 130
    B3 202
    ....
    ....

    My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
    (A2)} .

    thanks
    Ying
     
    Ying Hu, Dec 23, 2003
    #1
    1. Advertising

  2. Ying Hu wrote:
    > There are two data sets,
    > data set 1:
    > A1 123 125
    > A2 129 200
    > A3 400 420
    > ...
    > ...
    > data set 2:
    > B1 126
    > B2 130
    > B3 202
    > ...
    > ...
    >
    > My question is how to get B2 and A2 { 130 (B2) is in [from]129
    > [to]200 (A2)} .


    I would suggest that you write a script that does what you want. Perl
    is probably a suitable programming language, btw.

    If you would encounter difficulties, that you after having made
    serious attempts to resolve them yourself can't find the solution to,
    please post your program here. If you do so, don't forget to comply
    with the posting guidelines:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html

    Good luck!

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 23, 2003
    #2
    1. Advertising

  3. "Ying Hu" <> wrote in message
    news:...
    > Hi,
    > There are two data sets,
    > data set 1:
    > A1 123 125
    > A2 129 200
    > A3 400 420
    > ...
    > ...
    > data set 2:
    > B1 126
    > B2 130
    > B3 202
    > ...
    > ...
    >
    > My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
    > (A2)} .


    SELECT b.key,a.key from tableb,tablea where b.val between a.minval and
    a.maxval

    seriously, we need a bit of context here to fathom what your real question
    is.

    if your data sets are small, simple iteration might be called for
    otherwise, more details could help.
    for instance what ranges for A and B values are we talking about here?
    what are the sizes of the datasets?
    are the A intervals non-overlapping? are the B values unique?

    what have you tried so far, and how did that fail?

    gnari
     
    Ragnar Hafstað, Dec 23, 2003
    #3
  4. Ying Hu

    Matt Garrish Guest

    "Ying Hu" <> wrote in message
    news:...
    > Hi,
    > There are two data sets,
    > data set 1:
    > A1 123 125
    > A2 129 200
    > A3 400 420
    > ...
    > ...
    > data set 2:
    > B1 126
    > B2 130
    > B3 202
    > ...
    > ...
    >
    > My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
    > (A2)} .
    >


    Feels like homework, you don't even seem to have tried, AND you didn't even
    make it very clear what you're trying to do (are the identifier values of
    any use whatsoever???), but since it's the holidays here's something to get
    you started: (and please don't post back asking for modifications to the
    code if it doesn't do what you want!)

    use strict;
    use warnings;

    my %test;
    my %range;

    while (my $line = <DATA>) {

    next if $line =~ /^\s*$/;

    my @vals = split(/\s+/, $line);

    if (scalar(@vals) == 2) {

    $test{$vals[0]} = [@vals];

    }

    elsif (scalar(@vals) == 3) {

    $range{$vals[0]} = [@vals];

    }

    else {

    die "Unknown pattern found in data line $.\n";

    }

    }

    # because life is all about pleasing Uri...

    my @print;

    foreach my $key (sort keys %test) {

    my $chk = isbetween($test{$key}[1], \%range);

    push @print, $chk if $chk;

    }

    print @print;



    sub isbetween {

    my ($v, $href) = @_;

    foreach my $r (sort keys %$href) {

    # just for clarity I've assigned more comprehensible variable names

    my $min = $$href{$r}[1];
    my $max = $$href{$r}[2];

    if (($v > $min) && ($v < $max)) {
    return "The value $v is between $min and $max\n";
    }

    }

    }


    __DATA__
    A1 123 125
    A2 129 200
    A3 400 420

    B1 126
    B2 130
    B3 202
     
    Matt Garrish, Dec 24, 2003
    #4
  5. Ying Hu

    Sara Guest

    Ying Hu <> wrote in message news:<>...
    > Hi,
    > There are two data sets,
    > data set 1:
    > A1 123 125
    > A2 129 200
    > A3 400 420
    > ...
    > ...
    > data set 2:
    > B1 126
    > B2 130
    > B3 202
    > ...
    > ...
    >
    > My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
    > (A2)} .
    >
    > thanks
    > Ying


    TRY:

    http://groups.google.com/groups?hl=en&lr=&ie=UTF-8&oe=UTF-8&group=comp.lang.apl

    Its better suited to this kind of application.

    Cheers!
     
    Sara, Dec 24, 2003
    #5
  6. Ying Hu

    Ying Hu Guest

    My perl script is as the following, but I do know that is not good one.
    The two data sets are in text files and very huge.

    $infile = "data_sets1.txt";
    open IN, $infile or die "$infile $!\n";
    while (<IN>){
    chomp;
    @F = split;
    $A_from{$F[0]} = $F[1];
    $A_to{$F[0] = $F[2];
    }
    $infile = "data_sets2.txt";
    open IN, $infile or die "$infile $!\n";
    while (<IN>){
    chomp;
    @F = split;
    $B_point{$F[0]} = $F[1];
    }

    foreach $b (sort {$B_point{a}<=>$B_point{$b} keys %B_point){
    foreach $a (sort {$A_from{$a} <=> $A_from{$b} keys %A_from){
    if ($B_point{$b} >= $A_from{$a} and $B_point{$b} <= $A_to){
    print "A:$a B:$b\n";
    }
    }
    }


    Ying Hu wrote:

    > Hi,
    > There are two data sets,
    > data set 1:
    > A1 123 125
    > A2 129 200
    > A3 400 420
    > ...
    > ...
    > data set 2:
    > B1 126
    > B2 130
    > B3 202
    > ...
    > ...
    >
    > My question is how to get B2 and A2 { 130 (B2) is in [from]129 [to]200
    > (A2)} .
    >
    > thanks
    > Ying
     
    Ying Hu, Dec 24, 2003
    #6
  7. "Ying Hu" <> wrote in message
    news:...
    > My perl script is as the following, but I do know that is not good one.
    > The two data sets are in text files and very huge.
    >

    a bit more info might still help to optimize
    how big datasets?
    are the A-intervals overlapping?
    what are the value ranges? (for example integers 0-1000)

    >
    > foreach $b (sort {$B_point{a}<=>$B_point{$b} keys %B_point){

    you are sorting the dataset B , but nothing in the following requires it
    unless the output needs to be sorted. in that case it might be more
    efficient to sort
    the output if it is a much smaller set than B

    > foreach $a (sort {$A_from{$a} <=> $A_from{$b} keys %A_from){

    here you sort unnecessarily A for EACH element of B (bad if B is large)
    better to sort it before (if you want to sort it at all)
    > if ($B_point{$b} >= $A_from{$a} and $B_point{$b} <= $A_to){
    > print "A:$a B:$b\n";
    > }
    > }
    > }


    try to move the sort to before the loops and see if you have improvements.

    @Aset=sort {$A_from{$a} <=> $A_from{$b} keys %A_from;
    @Bset=sort {$B_point{$a}<=>$B_point{$b} keys %B_point;
    foreach $b (@Bset) {
    foreach $a (@Aset) {
    ...

    if this improvement is not enough, you might make the innerloop slightly
    more complex and make use of the fact that (if A and B are sorted)
    for any given $Bset[x] >=$Aset[y], you do not need to check $Aset[0..y-1]
    when processing $Bset[x+1]. in other words, for each element of Bset,
    you can start the inner loop at the first @Aset element that passed the
    $B_point{$b} >= $A_from{$a} test in the previous B

    also if A is sorted, you can exit from the inner loop with last as soon as
    $A_from{$a}>$B_point{$b},
    as all following intervals will just be more distant from the B point

    another problem is memory usage. if set B is huge and much larger than set A
    you might want/need to only read A into memory, and process B line by line.
    if both are too huge to read into memory, you might need to use another
    method altogether

    good luck
    gnari
     
    Ragnar Hafstað, Dec 25, 2003
    #7
  8. Ying Hu wrote:
    > My perl script is as the following, but I do know that is not good
    > one.


    I agree. It's not good at all.

    1) It does not compile.

    2) You are not using strictures (the script you posted contains errors
    which are easily debugged by using strict).

    3) It's not a good idea to use the special variables $a and $b outside
    the sort() function, especially in a program that uses sort().

    It's good that you posted code this time, but you have obviously not
    studied the posting guidelines as suggested in my previous post in
    this thread (or you decided to ignore them). Please give it a new try,
    complying with the guidelines. It's inconsiderate to expect the people
    who you ask for help to start possible efforts to assist you with
    correcting typos.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Dec 25, 2003
    #8
  9. Ying Hu

    Anno Siegel Guest

    Ragnar Hafstað <> wrote in comp.lang.perl.misc:
    > "Ying Hu" <> wrote in message
    > news:...
    > > My perl script is as the following, but I do know that is not good one.
    > > The two data sets are in text files and very huge.
    > >

    > a bit more info might still help to optimize
    > how big datasets?
    > are the A-intervals overlapping?
    > what are the value ranges? (for example integers 0-1000)


    If we assume the ranges are non-overlapping and delimited by smallish
    integers, the ranges can be found in constant time without sorting and
    searching. First build an array @rangemap that holds, for index $i,
    the name of the range $i belongs to, or undef if there is no interval
    for $i. Then each interval can be found with a single array lookup:


    my @rangemap;

    while ( <DATA> ) {
    last if /^$/;
    my ( $range, $from, $to) = split;
    @rangemap[ $from .. $to] = ( $range) x ( $to - $from + 1);
    }

    while ( <DATA> ) {
    my ( $item, $val) = split;
    defined and print "$item and $_\n" for $rangemap[ $val];
    }

    __DATA__
    A1 123 125
    A2 129 200
    A3 400 420

    B1 126
    B2 130
    B3 202

    Anno
     
    Anno Siegel, Dec 29, 2003
    #9
  10. "Anno Siegel" <-berlin.de> wrote in message
    news:bspe60$9gl$-Berlin.DE...
    > Ragnar Hafstað <> wrote in comp.lang.perl.misc:
    > > "Ying Hu" <> wrote in message
    > > news:...
    > > > My perl script is as the following, but I do know that is not good

    one.
    > > > The two data sets are in text files and very huge.
    > > >

    > > a bit more info might still help to optimize
    > > how big datasets?
    > > are the A-intervals overlapping?
    > > what are the value ranges? (for example integers 0-1000)

    >
    > If we assume the ranges are non-overlapping and delimited by smallish
    > integers, the ranges can be found in constant time without sorting and
    > searching. First build an array @rangemap that holds, for index $i,
    > the name of the range $i belongs to, or undef if there is no interval
    > for $i. Then each interval can be found with a single array lookup:
    >


    this possibility, was one of the things i had in mind when I asked my
    questions to the OP.

    Usually I wait until the premises are met, before I code complete
    optimisations based on them :)

    ....snipped nice implmentation...

    I hope the OP does not use this on a data set like
    A1 0 1000000000
    A2 1000000001 2000000000
    ....


    gnari
     
    Ragnar Hafstað, Dec 29, 2003
    #10
  11. Ying Hu

    Ying Hu Guest

    If the A and B data sets are huge (more than 100,000 for each), A-interval
    is overlapped and the value ranges are big (0-100000). Is the following code
    good one? Maybe there is the best algorithm to solve the problem. Happy new
    year!
    my %a_from = ();
    my %a_to = ();
    my %b_position = ();
    my %all_position;
    while ( <DATA> ) {
    last if /^$/;
    my ( $range, $from, $to) = split;
    $a_from{$from} = $range;
    $a_to{$to} = $range;
    $all_position{$from} = 1;
    $all_position{$to} = 1;
    }
    while ( <DATA> ) {
    my ( $item, $val) = split;
    $b_position{$val} = $item;
    $all_position{$val} = 1;
    }
    my %from = ();
    foreach my $p (sort {$a <=> $b} keys %all_position){
    if ($a_from{$p}){
    $from{$a_from{$p}} = $p;
    }
    if ($a_to{$p} and !$b_position{$p}){
    $from{$a_to{$p}} = ();
    }
    if (%from and $b_position{$p}){
    print "B: $b_position{$p} A:";
    foreach my $k (sort {$from{$a}<=>$from{$b}} keys %from){
    print " $k";
    }
    print "\n";
    }
    if ($a_to{$p}){
    delete $from{$a_to{$p}};
    }
    }


    __DATA__
    A1 123 125
    A2 129 200
    A3 400 420
    A4 415 425

    B1 126
    B2 130
    B3 202
    B4 419



    "Ragnar Hafstað" wrote:

    > "Anno Siegel" <-berlin.de> wrote in message
    > news:bspe60$9gl$-Berlin.DE...
    > > Ragnar Hafstað <> wrote in comp.lang.perl.misc:
    > > > "Ying Hu" <> wrote in message
    > > > news:...
    > > > > My perl script is as the following, but I do know that is not good

    > one.
    > > > > The two data sets are in text files and very huge.
    > > > >
    > > > a bit more info might still help to optimize
    > > > how big datasets?
    > > > are the A-intervals overlapping?
    > > > what are the value ranges? (for example integers 0-1000)

    > >
    > > If we assume the ranges are non-overlapping and delimited by smallish
    > > integers, the ranges can be found in constant time without sorting and
    > > searching. First build an array @rangemap that holds, for index $i,
    > > the name of the range $i belongs to, or undef if there is no interval
    > > for $i. Then each interval can be found with a single array lookup:
    > >

    >
    > this possibility, was one of the things i had in mind when I asked my
    > questions to the OP.
    >
    > Usually I wait until the premises are met, before I code complete
    > optimisations based on them :)
    >
    > ...snipped nice implmentation...
    >
    > I hope the OP does not use this on a data set like
    > A1 0 1000000000
    > A2 1000000001 2000000000
    > ...
    >
    > gnari
     
    Ying Hu, Dec 30, 2003
    #11
  12. "Ying Hu" <> wrote in message
    news:...

    please do not top-post.
    (admittedly, in this case, it did not hurt much, as you not really refering
    to the
    quoted text, but in that case you should not include it.)

    > If the A and B data sets are huge (more than 100,000 for each),

    A-interval
    > is overlapped and the value ranges are big (0-100000). Is the following

    code
    > good one?


    no

    > Maybe there is the best algorithm to solve the problem.


    no

    snipped most of badly indented code ...

    > foreach my $p (sort {$a <=> $b} keys %all_position){
    > ...
    > if (%from and $b_position{$p}){
    > print "B: $b_position{$p} A:";
    > foreach my $k (sort {$from{$a}<=>$from{$b}} keys %from){


    here you are sorting the keys of %from once for each $p (more that 100000
    times)

    what's more , i cannot say that I understand your code. does it really work?

    gnari
     
    Ragnar Hafstað, Dec 30, 2003
    #12
  13. "Ying Hu" <> wrote in message
    news:...
    > If the A and B data sets are huge (more than 100,000 for each),

    A-interval
    > is overlapped and the value ranges are big (0-100000). Is the following

    code
    > good one?

    snipped the code

    in the following, assuming
    a) all points are integers
    b) most intervals are relatively small (71001-71167 but not 12-98765)
    c) B points are not repeated

    if you have memory for an array of all B points (100000 items),
    then i suggest something like:
    (note: untested, not even syntax checked)

    our @Bpoints=();

    open B,"< "Bset" or die "could not open Bset: $!";
    while (<B>) {
    next unless my ($B,$bp)=/^(\S+)\s+(\S+)/;
    $Bpoints[$bp]=$B;
    }
    close B;

    open A,"< "Aset" or die "could not open Aset: $!";
    while (<A>) {
    next unless my ($A,$ap1,$ap2)=/^(\S+)\s+(\S+)\s/;
    for ($ap1..$ap2) {
    print "B: Bpoints[$_] ($_) A: $A ($ap1 - $ap2)\n" if $Bpoints[$_];
    }
    }
    close A;


    the output order will be A intervals in original order, B points increasing

    if assumption C) is false , we let @Bpoints contain array of B points,
    and wrap the print in a for loop of those.
     
    Ragnar Hafstað, Dec 30, 2003
    #13
  14. Ying Hu

    Anno Siegel Guest

    Ying Hu <> wrote in comp.lang.perl.misc:
    > If the A and B data sets are huge (more than 100,000 for each), A-interval
    > is overlapped


    Overlap can be dealt with in many ways. One is to resolve overlapping
    parts into sub-intervals that don't overlap. Some algorithms may not
    even care.

    > and the value ranges are big (0-100000).


    You'd need a bigger value range to accommodate 100_000 intervals, unless
    most of them overlap. On the other hand, an array of 100_000 scalars
    isn't out of the question, even in Perl, so the mapping method could
    still be feasible.

    > Is the following code
    > good one? Maybe there is the best algorithm to solve the problem. Happy new
    > year!


    If you want to talk about algorithms you should describe what your code
    is doing. It isn't all that clear by itself. What are the four hashes
    keeping track of?

    In any case, the frequent sorts that gnari noted in another followup
    will kill it, run-time wise.

    I don't know what the best algorithm might be for this problem, but
    a related problem has been dealt with in _Introduction to Algorithms_
    by Cormen e.a. under the topic "Interval trees". Like all good algorithms,
    it is centered around a data structure, one that allows queries about sets
    of intervals. If you are seriously interested, take a look there.

    [code snipped]

    Anno
     
    Anno Siegel, Dec 30, 2003
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dave Varley via .NET 247

    Informix Interval values into a DataGrid

    Dave Varley via .NET 247, Sep 1, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    681
    Dave Varley via .NET 247
    Sep 1, 2004
  2. New ASP.NET User

    Redirect User to another page after some interval

    New ASP.NET User, Jan 14, 2004, in forum: ASP .Net
    Replies:
    1
    Views:
    511
    Hans Kesting
    Jan 14, 2004
  3. Roland
    Replies:
    1
    Views:
    518
    Rutger Smit
    Sep 8, 2004
  4. Dica
    Replies:
    0
    Views:
    410
  5. Abby Lee
    Replies:
    5
    Views:
    419
    Abby Lee
    Aug 2, 2004
Loading...

Share This Page