Need Search::Binary examples

A

Arvin Portlock

I need to extract values from very large but numerically
sorted arrays. I can think of lots of ways to do it but
I'd like to try a binary search and prefer to use an exist-
ing module rather than reinvent the wheel.

Unfortunately I can't make sense of the documentation
in Search:Binary (perhaps there's another more appropriate
module?). Are there any scripts out there that use
Search::Binary which one can find on the web?

I have a set of tens of thousands of books and need to
take a random page sample from the set, i.e., examine
the 102,411th page of the set which might occur in the
971st book. My array is an ordered list of cumulative
page counts for each book. So if each book has a hundred
pages the first element in the array would be 1, the
second would be 101, the third 201, etc. If the random
sample required page 114, then the search would discover
it was page 14 of the second book.

Here's an inefficient way to do it. (Yes, I could
narrow each successive search by incrementing array
index start positions or shifting elements off the
array I know will never match as the random numbers
increase. But I'm not doing that here for simplicity
since I want to illustrate the problem and really
want to learn how to use Search::Binary anyway.)

# Some processing of raw data here. The result is a hash
# and an array. The hash has as keys the place within the
# larger sequence where the first page of the book falls.

my %books = (
1 => 'book 1',
116 => 'book 2',
433 => 'book 3',
762 => 'book 4'
);

# The array consists of the sorted keys of %books above.
# Created when raw data is read so no need for expensive
# sort keys.

my @pages = (1, 116, 433, 762);

# Random numbers are already sorted numerically

my @random_nos = (13, 18, 344, 390, 601);

RANDOM: foreach my $no (@random_nos) {
foreach my $page (@pages) {
if ($no >= $page) {
print "$no is in $books{$page}\n";
next RANDOM;
}
}
}
 
P

Peter J. Holzer

Arvin said:
I need to extract values from very large but numerically
sorted arrays. I can think of lots of ways to do it but
I'd like to try a binary search and prefer to use an exist-
ing module rather than reinvent the wheel.

Unfortunately I can't make sense of the documentation
in Search:Binary (perhaps there's another more appropriate
module?). Are there any scripts out there that use
Search::Binary which one can find on the web?

Search::Binary is obviously written to search files, not arrays. But it
can be used to search arrays, too. Some of the generality for searching
files isn't needed for searching arrays (for example, in a file not
every possible position is the start of a record).
I've changed your script to use Search::Binary, and included the
relevant part of the documentation before each line. That's quite a lot
of documentation per line of code, I hope it helps you to make sense
both of the docs and the code.


#!/usr/bin/perl
use warnings;
use strict;

use Search::Binary;

my %books = (
1 => 'book 1',
116 => 'book 2',
433 => 'book 3',
762 => 'book 4'
);

# The array consists of the sorted keys of %books above.
# Created when raw data is read so no need for expensive
# sort keys.

my @pages = (1, 116, 433, 762);

# For Search::Binary we need the last page of each book, not the first:
# ("whose index value is greater than or equal to $val")
my @lastpages = (115, 432, 761, 999);


# Random numbers are already sorted numerically

my @random_nos = (13, 18, 344, 390, 601);


for my $r (@random_nos) {

my $lastpos;
my $pos = binary_search(
# The input parameters $min and $max are positions
# and represents the extent of the search. Only
# records which begin at positions within this range
# (inclusive) will be searched. Moreover, $min must
# be the starting position of a record.
0, $#lastpages,
# "binary_search" implements a generic binary search
# algorithm returning the position of the first
# record whose index value is greater than or equal
# to $val.
$r,
sub {
# During the search the read function will
# be called with three arguments: the input
# parameters $handle and $val, and a
# position.
my ($handle, $val, $pos) = @_;

# If the position is not "undef", the read
# function should read the first whole
# record starting at or after the position;
# otherwise, the read function should read
# the record immediately following the last
# record it read.
$pos = $lastpos + 1 unless defined($pos);
$lastpos = $pos;

# The read function needs to return a two
# element array consisting of the result of
# comparing $val with the index value of the
# read record and the position of the read
# record. The comparison value must be
# positive if $val is strictly greater than
# the index value of the read record, 0 if
# equal, and negative if strictly less.
# Furthermore, the returned position value
# must be greater than or equal to the
# position the read function was called
# with.
return ($val <=> $handle->[$pos], $pos)
},
# The value of $handle is of no consequence to the
# binary search algorithm; it is merely passed as a
# convenience to the read function.
\@lastpages);
print "page $r is in book $pos with the title '$books{$pages[$pos]}'\n";
}
__END__

hp
 
A

Arvin Portlock

Peter said:
Search::Binary is obviously written to search files, not arrays. But it
can be used to search arrays, too. Some of the generality for searching
files isn't needed for searching arrays (for example, in a file not
every possible position is the start of a record).
I've changed your script to use Search::Binary, and included the
relevant part of the documentation before each line. That's quite a lot
of documentation per line of code, I hope it helps you to make sense
both of the docs and the code.

Hey, it works! I can't thank you enough for the example. I'll
be interested to do some benchmarks with a real data set to see
how much time I saved over something like simply incrementing
the initial array index for each random number when possible.
Still, this is a technique I've wanted to add to my tool belt
for some time.

Some of the documentation makes a bit more sense now. $handle
was confusing to me despite the comments in the documentation.
I probably would have named it $param myself. I guarantee I
NEVER would have figured out what I was supposed to do in $read,
though I understood at least that it is a callback. I've never
familiarized myself with specific binary search algorithms so
that's probably why the documentation was so opaque to me. Though
I thought modern libraries were supposed to shield users from
having to know algorithms.

The program is for QA of one of the large book scanning
collaborations between commercial search engines and libraries.
The vendor returns batches of scanned image files of around
2.5 million files and a random sampling is taken from that
and looked at by a human being. My program will read the
vendor's tab-delimited data and grab the selected files from
a website. Partial program below, minus the web grabbing
stuff which I haven't implemented yet.

Thanks again for your help!

use Term::ReadKey;
use Search::Binary;
use strict 'vars';

my $percent = 12;

my $textfile = $ARGV[0];
if ($textfile) {
if (-e $textfile) {
open (STATFILE, $textfile);
my ($current_sum, %cumulative_page_counts, %page_counts,
@first_pages, @last_pages);
while (my $line = <STATFILE>) {
if ($line =~ /^([^\t]+)\t(\d+)\s*$/) {
my $filename = $1;
my $pagecount = $2;
$cumulative_page_counts{$current_sum + 1} = $filename;
$page_counts{$filename} = $pagecount;
push @first_pages, $current_sum + 1;
$current_sum += $pagecount;
push @last_pages, $current_sum;
}
}
close (STATFILE);
my $randcount = int (($current_sum * ($percent / 100)) + .5);
my @random_numbers;
until ($#random_numbers + 1 >= $randcount) {
my $r = int (rand ($current_sum)) + 1;
push @random_numbers, $r unless grep (/^$r$/, @random_numbers);
}
foreach my $r (sort {$a <=> $b} @random_numbers) {
my $lastpos;
my $pos = binary_search (0, $#last_pages, $r,
sub {
my ($handle, $val, $pos) = @_;
$pos = $lastpos + 1 unless defined ($pos);
$lastpos = $pos;
return ($val <=> $handle->[$pos], $pos);
},
\@last_pages);
my $bookpage = ($r - $first_pages[$pos]) + 1;
my $filename = $cumulative_page_counts{$first_pages[$pos]};
# Grab file from website here
print "Cumulative page $r is page $bookpage in $filename\n";
}

} else {
print STDERR "I can't seem to find a file named\n";
print STDERR "$textfile.\n\n";
print STDERR "Press any key to exit this window and try again.\n";
ReadMode('cbreak');
my $key = ReadKey(0);
ReadMode('normal');
}
} else {
print STDERR "Hi. Don't just click the desktop icon to run this
program. That doesn't work.\n";
print STDERR "You need to drag your statistics file onto the desktop
icon and release it\n";
print STDERR "to run the program.\n\n";
print STDERR "Press any key to exit this window and try again.\n";
ReadMode('cbreak');
my $key = ReadKey(0);
ReadMode('normal');
}

__END__

E.g.,

%cumulative_page_counts = (
1 => file1,
237 => file2,
365 => file3,
872 => file4,
1084 => file5,
1481 => file6,
);

%page_counts = (
file1 => 236,
file2 => 128,
file3 => 507,
file4 => 212,
file5 => 397,
file6 => 78,
);

@first_pages = (1, 237, 365, 872, 1084, 1481);
@last_pages = (236, 364, 871, 1083, 1480, 1558);
 
J

John W. Krahn

Arvin said:
The program is for QA of one of the large book scanning
collaborations between commercial search engines and libraries.
The vendor returns batches of scanned image files of around
2.5 million files and a random sampling is taken from that
and looked at by a human being. My program will read the
vendor's tab-delimited data and grab the selected files from
a website. Partial program below, minus the web grabbing
stuff which I haven't implemented yet.

Thanks again for your help!

use Term::ReadKey;
use Search::Binary;
use strict 'vars';

my $percent = 12;

my $textfile = $ARGV[0];
if ($textfile) {
if (-e $textfile) {
open (STATFILE, $textfile);

[snip]

} else {
print STDERR "I can't seem to find a file named\n";
print STDERR "$textfile.\n\n";
print STDERR "Press any key to exit this window and try again.\n";
ReadMode('cbreak');
my $key = ReadKey(0);
ReadMode('normal');
}
} else {

You should *ALWAYS* verify that the file opened correctly. Even though the
stat() showed that the file exists that does not mean that you would be able
to open it. You should also include the $! and/or $^E variables to indicate
exactly why the file could not be opened.

Also, the string '0' is false so if you had a file named '0' you would not
find it. Use the defined() function to properly test whether a variable
contains a value.

if ( defined $textfile ) {
if ( open STATFILE, '<', $textfile ) {

[snip]

} else {
warn "Cannot open '$textfile' ", $! || $^E;
warn "Press any key to exit this window and try again.\n";
...



John
 
T

Tad McClellan

Arvin Portlock said:
if (-e $textfile) {
open (STATFILE, $textfile);


You should always, yes *always*, check the return value from open():

open(STATFILE, $textfile) or die "could not open '$textfile' $!";
 
A

Arvin Portlock

You should *ALWAYS* verify that the file opened correctly.
Even though the
stat() showed that the file exists that does not mean that you would
be able
to open it. You should also include the $! and/or $^E variables to
indicate
exactly why the file could not be opened.

Also, the string '0' is false so if you had a file named '0' you would not
find it. Use the defined() function to properly test whether a variable
contains a value.

if ( defined $textfile ) {
if ( open STATFILE, '<', $textfile ) {
Thanks for the feedback. I do have a bad habit of not
checking the status of a file open. Mostly because I
just don't like programs dieing on my users. I know about
$! but $^E is new to me. I'll check it out. Thanks also
for pointing out the string '0' bug. For some reason I
have had problems with defined() working as advertised so
typically shy away from it aside from checking for the
existence of hash keys. Don't ask me what the problem
was exactly, I simply remember having problems with it.
When it's critical I actually resort to something like:
if ($var or $var eq '0'). An ugly personal idiom that
I'd just as soon drop. I'll try and use defined () from
now on and possibly post here someday if I have problems
with it that I can't figure out.

I'm also going through and doing a few more optimizations
as they occur to me. For example I believe I don't gain
anything by sorting my random numbers. Hmm, there was one
other that I thought of but has gone out of my mind now.

Arvin
 
A

A. Sinan Unur

Thanks for the feedback. I do have a bad habit of not
checking the status of a file open. Mostly because I
just don't like programs dieing on my users.

You, as the progammer, can handle the failure in another way than just
invoking die. However, I cannot see the point of carrying on, as if
nothing is wrong, after a call to open fails. What you do is up to you,
but you *must* do something.

Sinan











I know about
$! but $^E is new to me. I'll check it out. Thanks also
for pointing out the string '0' bug. For some reason I
have had problems with defined() working as advertised so
typically shy away from it aside from checking for the
existence of hash keys. Don't ask me what the problem
was exactly, I simply remember having problems with it.
When it's critical I actually resort to something like:
if ($var or $var eq '0'). An ugly personal idiom that
I'd just as soon drop. I'll try and use defined () from
now on and possibly post here someday if I have problems
with it that I can't figure out.

I'm also going through and doing a few more optimizations
as they occur to me. For example I believe I don't gain
anything by sorting my random numbers. Hmm, there was one
other that I thought of but has gone out of my mind now.

Arvin



--
--
A. Sinan Unur <[email protected]>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
 
T

Tad McClellan

Arvin Portlock said:
I do have a bad habit of not
checking the status of a file open. Mostly because I
just don't like programs dieing on my users.


If the program's job is to process a file, and the program cannot
open the file, what is it _supposed_ to do instead?

Silently proceed as if the file was empty or fail noisily?

I don't like my programs silently failing to do their job. :)

For some reason I
have had problems with defined() working as advertised so
typically shy away from it aside from checking for the
existence of hash keys.


You cannot use defined() to check the existence of hash keys!

You must use exists() for that.


Check the output from:

my %hash = ( foo => 'FOO', bar => undef );
foreach ( 'not_a_key', keys %hash ) {
print "key '$_' defined\n" if defined $hash{$_};
print "key '$_' exists\n" if exists $hash{$_};
}
 
A

Arvin Portlock

Tad said:
You cannot use defined() to check the existence of hash keys!

You must use exists() for that.

I get confused and I'm too lazy to look up things I
don't use enough to remember. Hence my personal idiom
if ($var or $var eq '0'). There's another personal
idiom I have. I'll often use [\w\W] in regular expressions
because I'm too lazy to remember or look up when . matches
\n or \r (I think it does if you put a /m at the end
but [\w\W] is faster than looking it up to be certain).

Yeah, everyone is apalled I'm sure. Just thought I'd
post a confession from the Dark Side for if you ever
wonder why some people do the craziest things in perl.
Answer: we're lazy and it's faster/less work than
looking it up
 
B

Ben Morrow

Quoth Arvin Portlock said:
I get confused and I'm too lazy to look up things I
don't use enough to remember. Hence my personal idiom
if ($var or $var eq '0').

This doesn't cover '', which is also false. It also doesn't cover
existing-but-undef hash values, which was Tad's point.
There's another personal
idiom I have. I'll often use [\w\W] in regular expressions
because I'm too lazy to remember or look up when . matches
\n or \r (I think it does if you put a /m at the end
but [\w\W] is faster than looking it up to be certain).

No, it's /s. The way I remember it is

/s affects a *s*ingle metachar (.)
/m affects *m*any metachars (^ and $)

I find this easier than the 'single/many lines' they are meant to stand
for.

Using this will probably cause the regex optimizer to produce a slower
regex (I'm not sure if it's clever enough to convert /[\w\W]/ -> /./),
and will *certainly* confuse readers of your code.
Yeah, everyone is apalled I'm sure. Just thought I'd
post a confession from the Dark Side for if you ever
wonder why some people do the craziest things in perl.
Answer: we're lazy and it's faster/less work than
looking it up

This is false Laziness. Especially 'cos it's wrong :).

Ben
 
P

Peter J. Holzer

Arvin said:
Hey, it works! I can't thank you enough for the example. I'll
be interested to do some benchmarks with a real data set to see
how much time I saved over something like simply incrementing
the initial array index for each random number when possible.

If you do, also play around with the optional $size parameter. The
default is 512 is which probably much too high for in-memory lookups. I
suspect that the optimal value in this case is 1 but haven't tested it.
Some of the documentation makes a bit more sense now. $handle
was confusing to me despite the comments in the documentation.
I probably would have named it $param myself.

$param is an awfully non-descriptive name. All parameters are
parameters, why would you call one of them $param, but not the others?
$handle makes a lot of sense if you search a file: Most probably you
would just use the file handle. There are also other handles (database
handles, window handles), so using $handle as a generic term for the
parameter which is used to access the data to be searched seems ok to
me.
I guarantee I NEVER would have figured out what I was supposed to do
in $read, though I understood at least that it is a callback. I've
never familiarized myself with specific binary search algorithms so
that's probably why the documentation was so opaque to me. Though I
thought modern libraries were supposed to shield users from having to
know algorithms.

That it does. You don't have to know anything about the binary search
algorithm to use it (In fact, Search::Binary doesn't implement a pure
binary search - it switches to linear search for small amounts of data).
The model is rather the same as for sort: You don't have to know how the
sorting is done, but you need to supply a function which does the
comparison.

I do agree that the documentation isn't easy to understand. I had to
step through binary_search in the debugger to find an error I made at
first.

hp
 
A

Arvin Portlock

Ben said:
Quoth Arvin Portlock :


This doesn't cover '', which is also false.

Lightbulb! That explains, I think, the problems I've had with
defined and why I stopped using it. In all my programs where
I've used "if ($var or $var eq '0')" I was not interested in
'' and wanted to skip it. Looks I was doing the right thing
there after all. When I tried to use defined I didn't realize
it was true for ''. I've learned something new.
There's another personal
idiom I have. I'll often use [\w\W] in regular expressions
because I'm too lazy to remember or look up when . matches
\n or \r (I think it does if you put a /m at the end
but [\w\W] is faster than looking it up to be certain).

No, it's /s. The way I remember it is

/s affects a *s*ingle metachar (.)
/m affects *m*any metachars (^ and $)

Heh, yeah. See, I told you. Your mnemonic is good. I think
I'll really try and dump that [\w\W] idiom I use which I've
always disliked and, as pointed out, is confusing for Those
Who Come After. If someone pointed a gun at my head and made
me guess I would have said that /s was what you do when you
want to pretty up your regular expressions with white Space
and comments. I'm not insane, though--I would have looked it
up first before actually trying to use it.

Arvin
 
D

David Combs

If you do, also play around with the optional $size parameter. The
default is 512 is which probably much too high for in-memory lookups. I
suspect that the optimal value in this case is 1 but haven't tested it.


$param is an awfully non-descriptive name. All parameters are
parameters, why would you call one of them $param, but not the others?
$handle makes a lot of sense if you search a file: Most probably you
would just use the file handle. There are also other handles (database
handles, window handles), so using $handle as a generic term for the
parameter which is used to access the data to be searched seems ok to
me.


That it does. You don't have to know anything about the binary search
algorithm to use it (In fact, Search::Binary doesn't implement a pure
binary search - it switches to linear search for small amounts of data).
The model is rather the same as for sort: You don't have to know how the
sorting is done, but you need to supply a function which does the
comparison.

I do agree that the documentation isn't easy to understand. I had to
step through binary_search in the debugger to find an error I made at
first.

Maybe this followup is (way) too late for this,
but if now you see how you could (greatly) *improve*
that doc, at least here and there, why not give
it a shot, and then send it to the module-author?

Thanks,

David





hp

--
_ | Peter J. Holzer | Man könnte sich [die Diskussion] auch
|_|_) | Sysadmin WSR/LUGA | sparen, wenn man sie sich einfach sparen
| | | (e-mail address removed) | würde.
__/ | http://www.hjp.at/ | -- Ralph Angenendt in dang 2006-04-15
 
P

Peter J. Holzer

[41 lines deleted. Please quote only text relevant to your answer]
Maybe this followup is (way) too late for this,
but if now you see how you could (greatly) *improve*
that doc, at least here and there, why not give
it a shot, and then send it to the module-author?

You are of course right, that is what I should do.
There are some reasons why I didn't:

* I never used Search::Binary before. Installing it, fixing the OP's
program and explaining it was already more work than I had planned to
spend on answering a usenet posting.

* I'm not very good at improving the writing of other people. I stick
too much to what is already there, so the result is often rather
awkward (at work I tend to rather rewrite a whole chapter from scratch
rather than try to improve somebody else's writing).

* My experience with sending unsolicited patches to CPAN authors isn't
very good. Many CPAN modules seem not to be maintained any more, so
unless I have some personal interest in improving a module (e.g.,
because I'm using it myself) I won't invest time into writing a patch
which will probably just vanish into a black whole.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top