[SUMMARY] IP to Country (#139)

R

Ruby Quiz

Matthias Wätcher made a classic mistake this week. He sent in his first ever
quiz solution and told us not to expect anything special out of it. We later
learned that it was a lot faster than all of the other solutions, thanks to more
effort from Matthias. His code taught me some new tricks and I want to share
those with you.

But first, let's talk speed.

Most of the solutions are quite quick, relatively speaking. What's the shared
secret of speed? Binary search. The records are in sorted order, so it's
possible to perform a simple cut-the-possible-matches-in-half-at-each-step
lookup. That favored algorithm makes short work of what could otherwise be a
lengthy search.

Now you can do a binary search on the database file as is and several solutions
did. This is a little trickier, because the records are not of a standard size.
You have to be super careful not to accidently skip records. While the
solutions handled the challenge very well, you can make things quite a bit
easier if you are willing to preprocess the file.

You can also speed things up with some clever preprocessing. That was the other
trick Matthias used to find answers so quickly.

Matthias realized that while the actual records contain quite a bit of data, we
only really care about the start of a range, the end of a range, and the country
code. You can fit all of that in just ten bytes with four for each address and
two for the country.

The file also contains some consecutive ranges that can be collapsed for the
purposes of our search. Such ranges may differ in some attributes, but as long
as their country codes are the same we can treat them as a single unit.

Now that you know the goal, let's take a look at the packing code Matthias sent
in:

#!/usr/bin/ruby
# comment

last_start=nil
last_end=nil
last_country=nil
File.open("packed-ip.dat","wb") do |wfile|
IO.foreach("geo-ip.csv") do |line|
next if !(line =~ /^"/ )
s,e,d1,d2,co=line.delete!("\"").split(",")
s,e = s.to_i,e.to_i
if !last_start
# initialize with first entry
last_start,last_end,last_country = s,e,co
else
if s==last_end+1 and co==last_country
# squeeze if successive ranges have zero gap
last_end=e
else
# append last entry, remember new one
wfile << [last_start,last_end,last_country].pack("NNa2")
last_start,last_end,last_country = s,e,co
end
end
end
# print last entry
if last_start
wfile << [last_start,last_end,last_country].pack("NNa2")
end
end

The three variables declared up front are for tracking the ranges. These will
be used to collapse consecutive ranges.

The next bit of code creates the binary database we will write into and parses
the CSV formated data. Though the data is in the CSV format, the actual content
is trivial and you don't need a proper parser to get at it. As you can see,
Matthias just checks for lines starting with a quote, pulls out all quote
characters, and split()s on commas. This is faster than using a parser.

The if/else chain in the middle of the code does most of the work. First, it
stores the initial range in the tracking variables. For all other ranges it
tests to see if it is consecutive with the last one recorded and has the same
country code. When it does, the endpoint of the last range is just bumped to
include them both. When it doesn't, the last range is written to the file and
the code starts tracking the new range. The final if statement ensures that we
write out the final range before exiting.

This really isn't too much work and it runs in under two seconds on my box.
That's a small price to pay for a speed gain every time we look up an address.

Let's dive into the search code now. Here's how it begins:

#!/usr/bin/ruby

# take command line or stdin -- the latter has performance advantage
# for long lists
if ARGV[0]
arr=ARGV
else
arr=$stdin
end

# ...

As the comment tells us, Matthias added another way to feed the program
addresses. Where I said we should take one from the command-line arguments,
this code actually handles any number of addresses from command-line arguments
or STDIN.

This next bit of code opens up our database:

# ...

# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
rfile.seek(0,IO::SEEK_END)
record_max=rfile.pos/10-1

# ...

The seek() and pos() calls here are used to find the end of the file. You need
to know both ends of a data set to perform a binary search. Note that dividing
by ten gives us the count of records since they are a fixed width and Matthias
subtracts one because we will never need to seek() to the end of the last record
again.

Now there's a touch more preparation for each address we want to find:

# ...

arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr= argv.split(".").map {|x| x.to_i.chr}.join

# ...

In order to avoid a bunch of to_i() calls on each range extracted from the
database, Matthias just encodes the address we are hunting for into the
character encoding used in the database. This way simple String comparisons can
determine if the address it in the range.

We're now ready for the actual search code:

# ...

# low/high water marks initialized
low,high=0,record_max
while true
mid=(low+high)/2 # binary search median
rfile.seek(10*mid) # one record is 10 byte, seek to position
str=rfile.read(8) # for range matching, we need only 8 bytes
# at comparison, values are big endian, i.e. packed("N")
if ipstr>=str[0,4] # is this IP not below the current range?
if ipstr<=str[4,4] # is this IP not above the current range?
puts rfile.read(2) # a perfect match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts "no country"
break
end
end
}
end

This is a pretty textbook binary search. You always begin by going to the
middle of the current low and high. In this case that's a seek() call and we
have to remember to multiply the midpoint by our record size of ten.

Once we are at the right record, the code read()s the first eight bytes and
compares the address against the low and high for the range. When it is in the
range, the final two bytes are read and printed. If the address is below the
current range, we drop the high to below the current range. If it's above, we
raise the low to above the current range. A final check is added to catch the
cases where no match was found, in which case our low will bypass the high.

Remember that a second goal of the quiz described search was to be memory
friendly. This code does terrific on that front since only one record needs to
be in memory at a time. Once the checks are done, it can be replaced by the
next record.

My thanks to all who showed so many great examples of this classic algorithm.

Tomorrow we will write programs that can help me to better understand my
wardrobe...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top