What is the best way to pull out a range of values?

A

Alf McLaughlin

Hello all-
It seems I am running across the following problem a lot recently.
Let's say I have a large collection of values and they are in an array
(perhaps there are 10,000 of them). I would like to pull out all
values between $x and $y without having to loop through the entire
array to figure out where the values I'm interested in are locating.
Is there a simple way to do this in Perl? For example, in SQL I would
write something like this:

SELECT position FROM my_table WHERE position BETWEEN 14000 and 1479;

Perhaps the answer involves storing the data in a different structure
(I can picture this working very quickly if the values I'm interested
in were keys in a hash, but I still don't know how to do it).

Thanks,
Alf
 
P

Paul Lalli

Alf said:
Hello all-
It seems I am running across the following problem a lot recently.
Let's say I have a large collection of values and they are in an array
(perhaps there are 10,000 of them). I would like to pull out all
values between $x and $y without having to loop through the entire
array to figure out where the values I'm interested in are locating.
Is there a simple way to do this in Perl?

I see two possible solutions, other than the explicit loop you said you
want to avoid:

1) sort them all, and then find the ones you want:
my @sorted = sort { $a <=> $b } @values;
my @selected;
my $i = 0;
$i++ while $selected[$i] < $x;
push $selected[$i] until $selected[$i++] > $y;


2) Use grep, which really is the same as your original solution, just
hiding the loop:
my @selected = grep { $_ >= $x and $_ <= $y } @values;

I'm not at all convinced the first method is in any way "better" than
the linear search (you can do some Benchmarks if you're curious.).
Perhaps the answer involves storing the data in a different structure
(I can picture this working very quickly if the values I'm interested
in were keys in a hash, but I still don't know how to do it).

I can't imagine how you would build such a hash without doing a linear
search through the values to begin with, which negates any possible
benefit the hash look up would have.

Paul Lalli
 
D

DJ Stunks

Alf said:
Hello all-
It seems I am running across the following problem a lot recently.
Let's say I have a large collection of values and they are in an array
(perhaps there are 10,000 of them). I would like to pull out all
values between $x and $y without having to loop through the entire
array to figure out where the values I'm interested in are locating.
Is there a simple way to do this in Perl? For example, in SQL I would
write something like this:

SELECT position FROM my_table WHERE position BETWEEN 14000 and 1479;

I'm sure a computer scientist out there will have a faster way, but the
simplest and easiest comprehensible way is:

my @selected_nums = grep { $x <= $_ and $_ < $y } @ten_thousand_nums

sorting would be necessary if you performed some kind of
chop-search-ish routine to identify the indexes associated with the
start and end for @select_nums, then just slice them out of
@ten_thousand_nums, but like I said - I'll leave that for the computer
scientists... the time to sort the big array would also have to be
taken into account then since grep could operate on unsorted data.
Perhaps the answer involves storing the data in a different structure
(I can picture this working very quickly if the values I'm interested
in were keys in a hash, but I still don't know how to do it).

I don't see how this would be any faster in a hash, but I'm all ears...

-jp
 
X

xhoster

Alf McLaughlin said:
Hello all-
It seems I am running across the following problem a lot recently.
Let's say I have a large collection of values and they are in an array
(perhaps there are 10,000 of them). I would like to pull out all
values between $x and $y without having to loop through the entire
array to figure out where the values I'm interested in are locating.

Why do you want to avoid looping through the entire array? Because it is
too slow? Or because it takes too much code to implement?
Is there a simple way to do this in Perl? For example, in SQL I would
write something like this:

SELECT position FROM my_table WHERE position BETWEEN 14000 and 1479;

How do you know the DBMS isn't just looping through the whole table behind
the scenes? Or is that the point, that you want it to happen behind
the scenes?

my @result = grep $_>=14000 && $_<=14790, @my_table;

or maybe:

sub between(\@$$) { grep $_>=$_[1] && $_<=$_[2], @{$_[0]} };
Perhaps the answer involves storing the data in a different structure
(I can picture this working very quickly if the values I'm interested
in were keys in a hash, but I still don't know how to do it).

I don't see how you could make it happen very fast (i.e. substantially
faster than the ordinary array scan, which is already quite fast for an
array as small as 10_000) just by simply using a regular hash.

If the array is sorted, you could do a binary search for the first element
in the range, then linear search from their to the last.

You could store the numbers in range-bins a hash (or an array) and then
limit your inspection to only the necessary bins.

Both of these require special consideration at the time you put things
into the list, as well as when you search it.

Xho
 
U

Uri Guttman

DS> I'm sure a computer scientist out there will have a faster way, but the
DS> simplest and easiest comprehensible way is:

DS> my @selected_nums = grep { $x <= $_ and $_ < $y } @ten_thousand_nums

DS> sorting would be necessary if you performed some kind of
DS> chop-search-ish routine to identify the indexes associated with the
DS> start and end for @select_nums, then just slice them out of
DS> @ten_thousand_nums, but like I said - I'll leave that for the computer
DS> scientists... the time to sort the big array would also have to be
DS> taken into account then since grep could operate on unsorted data.

DS> I don't see how this would be any faster in a hash, but I'm all ears...

a hash would have the same problem as an array, you can't do ordered
sequential access from any given value. the proper way to do this is
with a B tree or similar structure. then you have fairly quick access
O(log N) to any element and very quick access O(1) to the next elements
in sorted order. this is also famously known as ISAM (indexed sequential
access method) and was a core feature in cobol and pl/1.

i wouldn't be surprised if there was some btree stuff on cpan. on disk
you can get it from most any decent db as it is a common type of index.

uri
 
A

Alf McLaughlin

How do you know the DBMS isn't just looping through the whole table behind
the scenes? Or is that the point, that you want it to happen behind
the scenes?

I think maybe the DBMS is using a BTREE once it is properly indexed. I
think this is going on because if I first create the index:

CREATE INDEX position ON my_table(position);

Then I can see the number of rows needed for the search go down
considerably:

EXPLAIN SELECT position FROM my_table WHERE position BETWEEN $value1
and $value2;

So, yes, an unindexed database search isn't any faster than looping
through all the values.

The following module seems to work:

http://search.cpan.org/~rrwo/Tie-RangeHash-1.03/lib/Tie/RangeHash.pm

I'm sure there are BTREE modules; that is a great idea.
 
X

xhoster

Alf McLaughlin said:
I think maybe the DBMS is using a BTREE once it is properly indexed. I
think this is going on because if I first create the index: ....

So, yes, an unindexed database search isn't any faster than looping
through all the values.

So from the above, I gather it is not the neat syntax of SQL but rather
the performance offered by the index you are interested in?

Just using DBI and SQL might be a good way to go.

I can't seem to make that module work for what you want it to do. Or at
least not correctly.

use strict;
use warnings;
use Tie::RangeHash;

my $t=new Tie::RangeHash(Type => Tie::RangeHash::TYPE_NUMBER);
my @array;

for (1..10_000) {
my $x = int rand(2e7);
$t->add("$x,$x",$x);
push @array,$x;
};

print join "\t", "fetch", (sort {$a<=>$b}
$t->fetch_overlap("50000,60000"))[1..10];
print join "\t", "array", sort {$a<=>$b}
grep $_>=50_000 && $_<=60_000, @array;

__END__

fetch 3202 5936 10416 16533 17858 25605 26118 29266 31885 35022
array 50701 50948 51314 51644 56512 56999 58544

I'm sure there are BTREE modules; that is a great idea.

I've played around with Tree::BPTree. You can get cursors into the tree,
but the cursors always start at the beginning or end. As far as I can
tell, you can't ask for a cursor which starts at any given point, which
seems to stultify the whole point of using BTrees. I started trying to add
this feature, but I decided the whole thing was slow enough that I might
as well just use hashes and/or arrays, or a real db with DBI (or a binning
method with DBM::Deep for one recent job.)

Xho
 
A

Alf McLaughlin

I can't seem to make that module work for what you want it to do. Or at
least not correctly.

I think I got Tie::RangeHash to do what I want. It seems to work; let
me know if you notice anything wrong with this. I haven't clocked it
yet, so this may still be inefficient.

#!/usr/bin/perl

BEGIN: {
use Tie::RangeHash;
}

MAIN: {
my $val1 = 1;
my $val2 = 100;
my $val3 = 1000;
my $val4 = 10000;
my $val5 = 40000;
my $val6 = 50000;
my $range1 = "$val1,$val2";
my $range2 = "$val3,$val4";
my $range3 = "$val5,$val6";
tie %hash, 'Tie::RangeHash', {Type => Tie::RangeHash::TYPE_NUMBER};
%hash = ($range1 => 'Element1',
$range2 => 'Element2',
$range3 => 'Element3');
my $result = $hash{$ARGV[0]};
unless($result) { $result = 'Sorry, No Results';}
print "$result\n";
}
 
X

xhoster

Alf McLaughlin said:
I think I got Tie::RangeHash to do what I want. It seems to work; let
me know if you notice anything wrong with this. I haven't clocked it
yet, so this may still be inefficient.

There are only two concerns I have about it (other than efficiency).
One is that it doesn't do what you originally said you wanted. It takes
a single-point query, and returns a single "label" for the range that that
query falls in. Your original specification was that it take a range query
(a two-point query) and return a list of all the points in that range. So
originally you wanted a database of points probed with a range query, but
this is a database of ranges probed with a point query. These are quite
different things. So I don't know if your original description wasn't
accurate and you new realize that this is what you really wanted to do
(hey, it happens), or if it was originally accurate but you lost sight of
that goal (that happens, too).

My other concern is that, when I'm using a module which seems to give
obviously wrong answers under some use-cases, I am always a bit worried
about whether that same underlying problem could also cause less obvious
wrong answers in other use-cases. So if I were going to use this module
for any real work, I'd either trace done the cause of the fetch_overlap bug
to make sure it doesn't also effect the FETCH method, or I'd do extensive
tests of the FETCH method by generating large amounts of random data to
query, and verify that it gives the right answer.

Cheers,

Xho

#!/usr/bin/perl

BEGIN: {
use Tie::RangeHash;
}

MAIN: {
my $val1 = 1;
my $val2 = 100;
my $val3 = 1000;
my $val4 = 10000;
my $val5 = 40000;
my $val6 = 50000;
my $range1 = "$val1,$val2";
my $range2 = "$val3,$val4";
my $range3 = "$val5,$val6";
tie %hash, 'Tie::RangeHash', {Type => Tie::RangeHash::TYPE_NUMBER};
%hash = ($range1 => 'Element1',
$range2 => 'Element2',
$range3 => 'Element3');
my $result = $hash{$ARGV[0]};
unless($result) { $result = 'Sorry, No Results';}
print "$result\n";
}
 
A

Alf McLaughlin

There are only two concerns I have about it (other than efficiency).
One is that it doesn't do what you originally said you wanted.

Hi Xho-
The label can also be a data structure that holds the values of the
low and high ranges. The code below demonstrates this.
My other concern is that, when I'm using a module which seems to give
obviously wrong answers under some use-cases, I am always a bit worried

Me too! :) In addition to evaluating speed, I will compare it to my
subroutine that doesn't use the module to make sure both versions of
the program are giving me the same results. Best, ALF


The revised code:

#!/usr/bin/perl

BEGIN: {
use Tie::RangeHash;
}

MAIN: {
my ($val1, $val2, $val3, $val4, $val5, $val6) = (100, 1000, 10000,
40000, 50000, 100000);
my $range1 = "$val1,$val2";
my $range2 = "$val3,$val4";
my $range3 = "$val5,$val6";
tie %hash, 'Tie::RangeHash', {Type => Tie::RangeHash::TYPE_NUMBER};
%hash = ($range1 => [($val1, $val2)],
$range2 => [($val3, $val4)],
$range3 => [($val5, $val6)]);
my ($low_range, $high_range) = ($hash{$ARGV[0]}->[0],
$hash{$ARGV[0]}->[1]);
my $result;
unless($low_range && $high_range) {
$result = 'Sorry, No Results';
} else {
$result = "low: $low_range, high: $high_range";
}
print "$result\n";
}
 
A

Alf McLaughlin

The results: there are no speed benefits gained by switching to this
module. :-(

back to the drawing board!
 
X

xhoster

Alf McLaughlin said:
Hi Xho-
The label can also be a data structure that holds the values of the
low and high ranges. The code below demonstrates this.

Yep, I understand that you can do this. I just don't understand how
it helps. You are still going from single-point query to a single range
(expressed as either a label or a pair) as the response, and not from a
single range as the query to list of points as the response.

Anyway,

my ($low_range, $high_range) = ($hash{$ARGV[0]}->[0],
$hash{$ARGV[0]}->[1]);

This would be better written as:

my ($low_range, $high_range) = @{$hash{$ARGV[0]}};

In your code you are doing the hash FETCH twice, once to feed to ->[0] and
once to feed to ->[1]. FETCH on a tied hash likely to be the slow step, so
it is better to only do it once rather than twice.

Xho
 
X

xhoster

Alf McLaughlin said:
The results: there are no speed benefits gained by switching to this
module. :-(

back to the drawing board!

I'm still not exactly sure what you're requirements are. Could you post
the code which implements, tests, and times your current (slow) hand-rolled
method and also tests and times your (also slow) Tie::RangeHash method?
With that in hand I might be able to make a better recommendation. (If the
testing/timing code is too long to post here, you can email me.)

Xho
 
A

Anno Siegel

Alf McLaughlin said:
Hello all-
It seems I am running across the following problem a lot recently.
Let's say I have a large collection of values and they are in an array
(perhaps there are 10,000 of them). I would like to pull out all
values between $x and $y without having to loop through the entire
array to figure out where the values I'm interested in are locating.
Is there a simple way to do this in Perl? For example, in SQL I would
write something like this:

SELECT position FROM my_table WHERE position BETWEEN 14000 and 1479;

Perhaps the answer involves storing the data in a different structure

[Ignoring the other thread that has sprouted, I'm going back to the OP]

It seems to me the data structure you are looking for is a sorted list.
Then you can use binary search to find the first element above the lower
end of the interval, and the last element below the upper end. The elements
in between are all the list elements in the interval.

Of course, sorting only pays when you have many intervals to check
against the same set of data. Otherwise, a linear search would be faster.

Anno
 
A

Alf McLaughlin

It seems to me the data structure you are looking for is a sorted list.
Then you can use binary search to find the first element above the lower
end of the interval, and the last element below the upper end. The elements
in between are all the list elements in the interval.

Of course, sorting only pays when you have many intervals to check
against the same set of data. Otherwise, a linear search would be faster.

Anno

Thanks, Anno. I am realizing after posting with Xho that I did not
state my question precisely right, but you did deduce what I am
interested in I think. Basically, I have a list of 500 values and I
would like to take another list of values (which happens to be 80 long
at the moment, but this could be longer as well) and for each of these
80 values I would like to find which 2 of the 500 values is the next
greater value and the next lesser value. I've been doing this (in
pseudocode, please take syntax mistakes and finer points lightly):

my @val1 = (my 80 values); my @val2 = (my 500 values);

foreach my $i (@val1) {
for (my $j = 0; $j < @val2 - 1; $j++) {
my $current = $val2[$j];
my $next = $val2[$j + 1];
if ($i > $current && $i < $next) {
#then this value is between $current and next.
}
}
}

In one example I have, it was necessary to loop through this 8000
times. I may be interested in running this program perhaps 1 million
times on a different set of 80 values; you can start to see how many
times I will have to loop through the 500 values and how much work the
computer will have to do. With one example I have, it took me 16
seconds to compute this 1000 times. Since I am interested in running
this perhaps millions of times, coding the problem this way in Perl may
end up taking my program an additional day or two to run.

I hope I have described the problem more clearly now. Does it still
sound as if a binary search on a sorted array is the way to go?

Also, I plan on re-coding the same program is C or C++ to see if the
code speeds up; I suspect the gain in performance may make the program
run in a reasonable amount of time without having to use a more
sophisticated algorithm.

Thanks for the help!
 
A

Alf McLaughlin

It seems to me the data structure you are looking for is a sorted list.
Then you can use binary search to find the first element above the lower
end of the interval, and the last element below the upper end. The elements
in between are all the list elements in the interval.

Of course, sorting only pays when you have many intervals to check
against the same set of data. Otherwise, a linear search would be faster.

Anno

Thanks, Anno. I am realizing after posting with Xho that I did not
state my question precisely right, but you did deduce what I am
interested in I think. Basically, I have a list of 500 values and I
would like to take another list of values (which happens to be 80 long
at the moment, but this could be longer as well) and for each of these
80 values I would like to find which 2 of the 500 values is the next
greater value and the next lesser value. I've been doing this (in
pseudocode, please take syntax mistakes and finer points lightly):

my @val1 = (my 80 values); my @val2 = (my 500 values);

foreach my $i (@val1) {
for (my $j = 0; $j < @val2 - 1; $j++) {
my $current = $val2[$j];
my $next = $val2[$j + 1];
if ($i > $current && $i < $next) {
#then this value is between $current and next.
}
}
}

In one example I have, it was necessary to loop through this 8000
times. I may be interested in running this program perhaps 1 million
times on a different set of 80 values; you can start to see how many
times I will have to loop through the 500 values and how much work the
computer will have to do. With one example I have, it took me 16
seconds to compute this 1000 times. Since I am interested in running
this perhaps millions of times, coding the problem this way in Perl may
end up taking my program an additional day or two to run.

I hope I have described the problem more clearly now. Does it still
sound as if a binary search on a sorted array is the way to go?

Also, I plan on re-coding the same program is C or C++ to see if the
code speeds up; I suspect the gain in performance may make the program
run in a reasonable amount of time without having to use a more
sophisticated algorithm.

Thanks for the help!
 
A

Anno Siegel

Alf McLaughlin said:
Thanks, Anno. I am realizing after posting with Xho that I did not
state my question precisely right, but you did deduce what I am
interested in I think. Basically, I have a list of 500 values and I
would like to take another list of values (which happens to be 80 long
at the moment, but this could be longer as well) and for each of these
80 values I would like to find which 2 of the 500 values is the next
greater value and the next lesser value.

That's exactly what binary search does, in log n time for lists of
length n.

# return the index $i, such that $a->[$i] <= $x and $x < $a->[$i+1]
# if no such index exists, return empty
sub binsearch {
my ( $x, $a) = @_;
return if $x < $a->[ 0] or $x >= $a->[ -1];
my ( $l, $h) = ( 0, $#$a + 1);
while ( $h - $l > 1 ) {
my $mid = int( ($l + $h)/2);
$a->[ $mid] <= $x ? $l : $h = $mid;
}
return $l;
}

I've been doing this (in
pseudocode, please take syntax mistakes and finer points lightly):

my @val1 = (my 80 values); my @val2 = (my 500 values);

foreach my $i (@val1) {
for (my $j = 0; $j < @val2 - 1; $j++) {
my $current = $val2[$j];
my $next = $val2[$j + 1];
if ($i > $current && $i < $next) {

#then this value is between $current and next.
}
}
}

Apparently you are already supposing that the list @val2 is sorted. Your
search time is linear in the size of @val2. Binary search will greatly
speed it up.

[rest snipped]

Anno
 
A

Alf McLaughlin

Sweet! Thanks a lot everyone who helped out. I implemented this
algorithm into my program and it is 10 times faster now (probably good
enough, but I will re-code to C if I require more speed). I consider
the issue resolved (for me, anyway). Here is a mini program that
demonstrates (more or less) what I implemented:

#!/usr/bin/perl

MAIN: {
my @test_array = qw(60 100 1500 3000);
unless ($ARGV[0]) { die "ERROR: enter query.\n"; }
my $query = $ARGV[0];
my $first_element = $test_array[0];
my $array_length = @test_array;
my $last_element = $test_array[$array_length - 1];
if ($query < $first_element || $query > $last_element) {
die "ERROR: $query is out of range.\n";
}
my $ra_val = bsearch($query, \@test_array);
if (@$ra_val == 1) {
#then we found the exact value:
my $exact = $ra_val->[0];
print "EXACT found: $exact\n";
} else {
my $below_val = $ra_val->[0];
my $above_val = $ra_val->[1];
print "RANGE: $below_val\t$above_val\n";
}
}

sub bsearch {
my ($x, $a) = @_; # search for x in array a
my ($l, $u) = (0, @$a - 1); # lower, upper end of search interval
my $i; # index of probe
while ($l <= $u) {
$i = int(($l + $u)/2);
if ($a->[$i] < $x) {
$l = $i+1;
}
elsif ($a->[$i] > $x) {
$u = $i-1;
}
else {
return [($a->[$i])]; # found
}
}
return [($a->[$u], $a->[$l])];
}
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top