Algorithm for searching sorted strings in binary file

Hunk · Mar 30, 2007

Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count

Jim Langston · Mar 30, 2007

Hunk said:
Hi

I have a binary file which contains records sorted by Identifiers
which are strings. The Identifiers are stored in ascending order. I
would have to write a routine to give the record given the Identifier.
The logical way would be to read the record once and put it in an STL
container such as vector and then use lower_bound to search for a
given identifier. But for some strange reason i'm asked to not use a
container but instead search on the file itself(lets leave it at
that).
Which is the best algorithm for searching sorted strings. I have read
refernce of Boyer-Moore-Horspool string searching algorithm .. but
have no idea on its implementation or speed. I believed for sorted
strings , binary search is the best technique. Am i wrong in my
assumptions?
Here's the format of the file
Record Count
----------------------
Record1
----------------------
Record2
--------------------

each record is made up of a string identifier and a record pointer.
Record 1 = Identifier ( 36 bytes) unique
Record_pointer (4 bytes)

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Coontinue until the match is found.

This uses the power of 2 to determine how many reads are required to find
the correct record. With 128 records 2^2, 7 reads are required. 256, 8
reads and so on. So even in large files there are not a lot of reads
required. 4 billion records would only require 32 reads.

Hunk · Mar 30, 2007

Since the records are fixed length, I believe the old divide by two would do
the trick. You need 3 counters, lower bound, upper bound, current.

At beginning, lowerbound = 1. Upperbound = number of records. Current =
Upperbound/2.

Read the Current record (seek postion). If the string is greater than your
search string, The Upperbound becomes Current. Current becomes
(Upperbound - Lowerbound) / 2.

Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search . How about other searching
algo's?

James Kanze · Mar 30, 2007

On Mar 30, 1:27 pm, "Jim Langston" <[email protected]> wrote:

Yea, i get you. You are suggesting a binary_search method. I also had
the same opinion that this would be the best possible. But using a
strcmp would be costly right? But on the other hand , any other stl
would also use a strcmp to do a search.

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.

How about other searching
algo's?

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

Hunk · Apr 3, 2007

Costly compared to what? You're doing a disk read for each
compare. That's going to take several orders of magnitude moer
time than the strcmp.

How many searches do you make? Would it be worthwhile reading
the entire file once, and creating some auxiliary structures
(e.g. a hash table).

Other than that, you might be able to reduce the number of reads
slightly by reading a "block" (some size optimized for the
particular filesystem), and checking immediately all of the
records in it. So that the next step would be n/2 - \epsilon,
rather than just n/2.

--
James Kanze (GABI Software) email:[email protected]
Conseils en informatique orientée objet/
Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34

Yep, i would be reading the entire file into memory and then work on
it. So yes this would solve the multiple file read problem.
Considering this is binary search still a viable option? people here
are of the opinion not to put it in any structure such as hash map for
eg .. any other string searching algo's which would be useful? A large
nos of searches would be there, but no change of the alreayd sorted
list would ever take place

STL container for sorted items	18	Mar 14, 2007
Searching for a list of strings in a file with Python	3	Oct 14, 2013
Sorted indexes in Python	0	Oct 17, 2008
Do we need a more generic sequence searching algorithm?	4	Jul 24, 2008
Searching a file for multiple strings	7	Jan 31, 2009
binary for floating point numbers - small?	1	Jan 28, 2011
Best algorithm/data structure combination for word lookup?	8	Jul 27, 2007
find a pattern in binary file	16	Jun 20, 2008

Algorithm for searching sorted strings in binary file

Hunk

Jim Langston

Hunk

James Kanze

Hunk

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads