What is the best way to pull out a range of values?

X

xhoster

Alf McLaughlin said:
It seems to me the data structure you are looking for is a sorted list.
Then you can use binary search to find the first element above the
lower end of the interval, and the last element below the upper end.
The elements in between are all the list elements in the interval.

Of course, sorting only pays when you have many intervals to check
against the same set of data. Otherwise, a linear search would be
faster.

Anno

Thanks, Anno. I am realizing after posting with Xho that I did not
state my question precisely right, but you did deduce what I am
interested in I think. Basically, I have a list of 500 values and I
would like to take another list of values (which happens to be 80 long
at the moment, but this could be longer as well) and for each of these
80 values I would like to find which 2 of the 500 values is the next
greater value and the next lesser value. I've been doing this (in
pseudocode, please take syntax mistakes and finer points lightly):

my @val1 = (my 80 values); my @val2 = (my 500 values);

foreach my $i (@val1) {
for (my $j = 0; $j < @val2 - 1; $j++) {
my $current = $val2[$j];
my $next = $val2[$j + 1];
if ($i > $current && $i < $next) {
#then this value is between $current and next.
}
}
}

In one example I have, it was necessary to loop through this 8000
times.

With a different 500 database set and different 80 querent set each of
those 8000 times?

I may be interested in running this program perhaps 1 million
times on a different set of 80 values;

Maybe this is a naive question, but how would that be different than
running the problem one time on a set of 80 million querent values?

you can start to see how many
times I will have to loop through the 500 values and how much work the
computer will have to do. With one example I have, it took me 16
seconds to compute this 1000 times. Since I am interested in running
this perhaps millions of times, coding the problem this way in Perl may
end up taking my program an additional day or two to run.

I hope I have described the problem more clearly now. Does it still
sound as if a binary search on a sorted array is the way to go?

It is certainly a good candidate, both for performance and ease. Another
choice would be to sort both the database array (the 500) and the querent
array (the 80), then your linear method might be faster than the binary
search (because your linear search through the 500 can pick up where the
previous one left off, rather than restarting at the beginning each time.)
You would have to test it to see, but I wouldn't bother. The code involved
would be more subtle than the binary search is. As you suggest below, if
the binary search isn't good enough I'd switch languages before resorting
to such desperate measures.
Also, I plan on re-coding the same program is C or C++ to see if the
code speeds up; I suspect the gain in performance may make the program
run in a reasonable amount of time without having to use a more
sophisticated algorithm.

Binary searches are't *that* sophisticated. With a size of 500, I would
definitely choose binary over linear. I recently benchmarked a finely
tuned linear search and a finely tuned binary search in C, and the
crossover-point (where binary started doing better than linear) was at an
array sizes of more than ~120. (That is with g++ -O3. With g++ -O0, the
crossover was something like 10). So I think you could safely assume it is
worth doing binary on arrays of 500 or more.


Xho
 
J

John W. Krahn

Alf said:
Sweet! Thanks a lot everyone who helped out. I implemented this
algorithm into my program and it is 10 times faster now (probably good
enough, but I will re-code to C if I require more speed). I consider
the issue resolved (for me, anyway). Here is a mini program that
demonstrates (more or less) what I implemented:

#!/usr/bin/perl

MAIN: {
my @test_array = qw(60 100 1500 3000);
unless ($ARGV[0]) { die "ERROR: enter query.\n"; }
my $query = $ARGV[0];
my $first_element = $test_array[0];
my $array_length = @test_array;
my $last_element = $test_array[$array_length - 1];

You don't have to know the array length to get the last element of the array:

my $last_element = $test_array[ -1 ];

Or if you have an older (heaven forbid) version of Perl:

my $last_element = $test_array[ $#test_array ];

And you _could_ (if you wanted to) assign both values using an array slice:

my ($first_element, $last_element) = @test_array[ 0, -1 ];



John
 
A

Alf McLaughlin

You don't have to know the array length to get the last element of the array:
my $last_element = $test_array[ -1 ];

Or if you have an older (heaven forbid) version of Perl:

my $last_element = $test_array[ $#test_array ];

Or, you could do this:

my $last_element = $test_array[get_last_element_index(\@test_array)];

sub get_last_element_index {
my $ra_array = shift;
return -1 * time()/time();
}
 
A

Alf McLaughlin

With a different 500 database set and different 80 querent set each
of
those 8000 times?

Same 500, different 80.
Maybe this is a naive question, but how would that be different than
running the problem one time on a set of 80 million querent values?

It is the same, but each array of 80 represents a distinct data set to
be processed one at a time.
It is certainly a good candidate, both for performance and ease. Another
choice would be to sort both the database array (the 500) and the querent
array (the 80), then your linear method might be faster than the binary
search (because your linear search through the 500 can pick up where the
previous one left off, rather than restarting at the beginning each time.)

I like this idea, but the order of the 80 is important (it represents a
unqiue signal generated from real-word data that will probably never
appear sorted, while the order of the 500 is sorted and static).

Thanks again!
 
D

Dr.Ruud

Alf McLaughlin schreef:
the order of the 80 is important (it represents
a unqiue signal generated from real-word data that will probably never
appear sorted

You can create a temporary index for the signal data.
 
R

robic0

Alf McLaughlin said:
Thanks, Anno. I am realizing after posting with Xho that I did not
state my question precisely right, but you did deduce what I am
interested in I think. Basically, I have a list of 500 values and I
would like to take another list of values (which happens to be 80 long
at the moment, but this could be longer as well) and for each of these
80 values I would like to find which 2 of the 500 values is the next
greater value and the next lesser value.

That's exactly what binary search does, in log n time for lists of
length n.

# return the index $i, such that $a->[$i] <= $x and $x < $a->[$i+1]
# if no such index exists, return empty
sub binsearch {
my ( $x, $a) = @_;
return if $x < $a->[ 0] or $x >= $a->[ -1];
my ( $l, $h) = ( 0, $#$a + 1);
while ( $h - $l > 1 ) {
my $mid = int( ($l + $h)/2);
$a->[ $mid] <= $x ? $l : $h = $mid;
}
return $l;
}

I've been doing this (in
pseudocode, please take syntax mistakes and finer points lightly):

my @val1 = (my 80 values); my @val2 = (my 500 values);

foreach my $i (@val1) {
for (my $j = 0; $j < @val2 - 1; $j++) {
my $current = $val2[$j];
my $next = $val2[$j + 1];
if ($i > $current && $i < $next) {

#then this value is between $current and next.
}
}
}

Apparently you are already supposing that the list @val2 is sorted. Your
search time is linear in the size of @val2. Binary search will greatly
speed it up.

[rest snipped]

Anno

Ok Anno, after reading everybodys response, I finally found a place to reply.
It was a hard choice but I think your the winner here. I actually speed read
most of the posts so I may not have all the details entirely. However I did
understand the problem and the solution immediately after readin the OP' post.

For instance, I don't know for sure if the "data" array is static or it is being
added to as time goes by. I also don't know if its a singular column, as in an
array of numbers. Is it relational, an array of indexes/references/pointers?

Hey, it all means something I guess. First impressions with this entire thread
is that there is whoe'full lack of knowledge of columns and records.

But it appears trivial as some have suggested, just sort the array and do a
"binary" search as you have shown above.

Disregarding the prospect of a dynamic array, ie: adding and removing elements,
Clearly, sorting is not necessary unless its static. Binary is used for searching
that basically yeilds a logrithmic halfing. Weather the element is removed, added,
array resized, depends on the usage.

So the array could be a monolithic standalone or an index (offset) into another
array, like an array of pointers to more meaningful data such as strings and such.
This is commonly known as an index in DB terms.

A "index" array is an "indirection" array into the body of real data that lies static
in some other location. The index array is maintained in an order that represents
some logical concept such as when its data is accesed serially, will address the
static data in that logical concept. Yeah, its sorted. This allows multiple indexes
to be maintained on the same static data, and will show different views of it when
accessed serially.

Before big DB managers arrived on the scene (with sql), we did it like this (I'm sure
nothings changed).
The indexes (record views) were grown (from defaults) starting from scratch, with zero
size. As a record (common structure) was created, space was malloc'd, and the pointer
was stored into an array. The index routine of interest was called with the record
pointers index. That routine knew what field(s) to compare. It compared the field(s)
with its current list. But *NOT* linearly, a "binary" search was used as Anno shows above.
So the log-n binary search located the position in the index array and shifted things
down to "insert" the new index. Now in 'C' the *index* array is not an array. It is
a "block" of memory alloc'd with a fixed size, so the actual *index* is a pointer
malloc'd and re-alloc'd. So to shift a sequential block was as simple as a memmove.
And indexing as such: int *Xndxptr; Xndxptr[insert_loc] = index_to_static_mallocd_record;
Contracting was the same. Although nowadays its voided until a garbage collection threshold,
then contracted. And some block optimizations for expansion exist also.

Lots of blow-hard shit huh? But thats the way I did it back in the old days.

The important thing to remember is as Anno said is the "binary" search.
I've done this many times in C apps but never for Perl. The thing that realy,
realy, realy impresses me with the "binary" search is the meticulous attention needed
for *boundry conditions*. This is *not* something that can be perfected in 5 minutes.

The OP just mentioned he might want to do this in C++ so I thought I would drop in here
a simple "real" binary search, albeit custom (but they all are). The important items of
note are the *optimizations* I use, crafted over the years.
I have not had a need to do this in Perl. Hell, I'm just floored when I see linked-lists
in Perl.

gluck
robic0



// Get ptr to unit rec (cstr)
UNIT_REC * CUnitsServ::GetPtrToUnitRec(LANG_REC *pLangRec, CString sUnitName)
{
sUnitName.MakeUpper();sUnitName.TrimLeft();sUnitName.TrimRight();
if (sUnitName == _T(""))
return NULL;

if (pLangRec)
{
// New 2/11/02 - Binary search via Units Name Hash array
int width = pLangRec->vaUhash_Ary.size();
if (width <= 0)
return NULL;

int nret;
int pos = 0;
int last = width;
int cnt = 0;
int chkndx = pos;

while (1)
{
nret = pLangRec->vaUhash_Ary[pos].sUname.CompareNoCase((LPCTSTR)sUnitName);

if (nret > 0) // if pos > user name
{
last = pos;
pos = pos - width;
chkndx = pos;
}
else
if (nret < 0) // if pos < user name
{
chkndx = pos;
width = (last - pos) / 2;
pos = pos + width;
}
else // if pos == user name
{
pLangRec->vaUhash_Ary[pos].nUndx;
return &(pLangRec->vaUnits_Ary[pLangRec->vaUhash_Ary[pos].nUndx]);
}
if ((last - chkndx) < 5)
{
for (int i = chkndx; i < last; i++)
{
if (pLangRec->vaUhash_Ary.sUname == sUnitName)
return &(pLangRec->vaUnits_Ary[pLangRec->vaUhash_Ary.nUndx]);
cnt++;
}
return NULL;
}
cnt++;
if (cnt > 200)
return NULL;
}
}
return NULL;
}
 
A

Alf McLaughlin

You can create a temporary index for the signal data.

I could do that; would this really help gain efficiency?
 
D

Dr.Ruud

Alf McLaughlin schreef:

Alf, your quoting style is not effective. See the Posting Guidelines.
You persist in posting regardless replies. You abject appropriate
attribution, that even google now includes in their default followup
format. You don't sufficiently skip superfluous text.
 
D

Dr.Ruud

Tad McClellan schreef:
Dr.Ruud:

That has already been pointed out to him last November.

I hadn't seen him for a while, but he used a different NNTP-Posting-Host
for this thread, until today. My log showed a killed message that was a
reaction on something I posted, so it got my attention.

But did you like my alliterations?
 
A

Alf McLaughlin

You don't sufficiently skip superfluous text.

Are you commenting on the fact that I wrote "sweet!" on the top of my
message? Please tell me that isn't what you are complaining about.
 
R

robic0

Sweet! Thanks a lot everyone who helped out. I implemented this
algorithm into my program and it is 10 times faster now (probably good
enough, but I will re-code to C if I require more speed). I consider
the issue resolved (for me, anyway). Here is a mini program that
demonstrates (more or less) what I implemented:
[snipped for space]
sub bsearch {
my ($x, $a) = @_; # search for x in array a
my ($l, $u) = (0, @$a - 1); # lower, upper end of search interval
my $i; # index of probe
while ($l <= $u) {
$i = int(($l + $u)/2);
if ($a->[$i] < $x) {
$l = $i+1;
}
elsif ($a->[$i] > $x) {
$u = $i-1;
}
else {
return [($a->[$i])]; # found
}
}
return [($a->[$u], $a->[$l])];
}

Is this the first time you've heard of a binary search?
You need to implement "honing". When a small threshold
is reached, halfing is a penalty. You also need a better
schema for boundry condition checks.

See the binary search 'C' code I posted in this thread.
Try to emulate it in your Perl sub. The technique, statistically,
has a superior benchmark.

It might have been asked (dunno), obvously your keeping a column sorted
and the search yeilds the head, but what value can single column data hold
in real world algo when you have to keep it sorted and do binary searches on it?

What you describe, your column of numbers, is typically an index into a larger
dataset. But you have sorted the numbers making them the object of comparison.
One more level of indirection and its an index. What you are doing is skipping
the index step. Your column of numbers (which should be an index) is actually
your data objects.

The funny thing is "binary searches" rely on columns (indexes) to be pre-sorted
indirections to actual data. Your "data" is actually your column.

Since theres no relationship of your column data to anything else, its almost
stupid such a thing is given consideration in any context.

If you understand what I am saying provide some explanation here.
I've seen some stupid things in my life but probably none as stupid as this.

robic0
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top