[QUIZ] IP to Country (#139)

J

James Edward Gray II

Ah, OK. I get:

ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1
5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
US
(5.1.1.1 not found)

Well, we weren't all broken:

$ ruby ip_to_country.rb 5.1.1.1
Unknown

James Edward Gray II
 
J

Jesús Gabriel y Galán

After this:

Here's my submission. It's almost exactly the same as Jesus'. [...]
while low <= high
[...]

and this:

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :)
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

I made a couple of modifications to my code. Here is the latest version:

require 'ftools'

ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

File.open(file) do |f|
low = 0
high = f.stat.size
f.seek(high / 2)
while low < high
while (((a = f.getc) != 10) && (f.pos > 2))
f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(",")
low_range = line[0][1..-2].to_i
high_range = line[1][1..-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1..-2]
exit
end
end
puts "No country found"
end

I changed to while low < high, and also the walking backwards failed
when reading the first line (btw, Steve, won't your solution also fail
for that case?).

Now:


time ruby quiz139b.rb 2.1.1.1
No country found

real 0m0.030s
user 0m0.016s
sys 0m0.004s

time ruby quiz139b.rb 5.1.1.1
No country found

real 0m0.032s
user 0m0.008s
sys 0m0.008s

time ruby quiz139b.rb 1.1.1.1
No country found

real 0m0.033s
user 0m0.016s
sys 0m0.000s

time ruby quiz139b.rb 256.256.256.256
No country found

real 0m0.096s
user 0m0.008s
sys 0m0.000s

Thanks,

Jesus.
 
E

Erik Veenstra

I decided to make one extra, reusable (?), abstraction layer:
Class OrderedLinesFile implements method find_line, which is
given a block. This block is called with a line of the file.
The block has to parse this line and returns -1, 0 or 1 for "go
back", "bingo!" and "go forward", respectively. Method
find_line then repositions the pointer in the file and calls
block again, if necessary.

Please shoot...

gegroet,
Erik V. - http://www.erikveen.dds.nl/

BTW, lines 17899 and 17900 of the downloaded file both contain
"2624585728", which is strange... (Downloaded: 2007-09-17 08:34
(UTC).)

----------------------------------------------------------------

$ time ruby quiz139a.rb 11.22.33.44
11.22.33.44 US

real 0m0.011s
user 0m0.006s
sys 0m0.004s

----------------------------------------------------------------

$ wc all_ip_blocks
82358 82358 905115 all_ip_blocks

$ time ruby quiz139b.rb all_ip_blocks

real 0m45.681s # That's 1800 IPs per second.
user 0m40.351s
sys 0m5.300s

----------------------------------------------------------------

class OrderedLinesFile
def initialize(file_name)
@file_name = File.expand_path(file_name)
end

def find_line(&block)
line = nil

File.open(@file_name) do |io|
position = 0
delta = File.size(@file_name)/2
line = io.gets # The first line of the file.
line = io.gets while /^\s*#/ =~ line or /^\s*$/ =~
line

while delta > 0 and line and (space_ship = block.call(line)) !=
0
position += space_ship < 0 ? -delta : +delta
delta /= 2

if position > 0
io.pos = position
broken_line = io.gets # Skip the current (broken)
line.
line = io.gets # The current line of the
file.
line = io.gets while /^\s*#/ =~ line or /^\s*
$/ =~ line
else
delta = 0 # Stop.
end
end

line = nil if delta == 0 # Nothing found.
end

line
end
end

class IpToCountry
FILE = "IpToCountry.csv"

def country_of(ip_addr)
ip_addr = ip_addr.split(/\./).collect{|n| n.to_i}.inject{|n,
m| n*256+m} # "1.2.3.4" --> 16909060
olf = OrderedLinesFile.new(FILE)
res = nil

olf.find_line do |line|
ip_from, ip_to, registry, assigned, ctry, cntry, country =
line.gsub(/"/, "").split(/,/, 7)

if ip_addr < ip_from.to_i
-1 # Go back in the file.
elsif ip_addr > ip_to.to_i
+1 # Go forward in the file.
else
res = ctry
0 # Bingo!
end
end

res
end
end

itc = IpToCountry.new

ARGV.each do |ip_addr|
puts "%-16s %s" % [ip_addr, itc.country_of(ip_addr)||"??"]
end

----------------------------------------------------------------
 
E

Erik Veenstra

$ ruby quiz139.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
0.1.1.1 ZZ
1.1.1.1 ??
2.1.1.1 ??
3.1.1.1 US
4.1.1.1 US
5.1.1.1 ??

:}

gegroet,
Erik V.
 
M

Matthias Wächter

I just performed the tests another time on my desktop computer with a
freshly installed Sabayon Linux. I don't know why this one performs that
much better -- it's gentoo-based, it's the same file system, but it has
no SCSI controller - maybe that makes it faster? Or is it the dual-core
... anyway.

It's about twice as fast and requires a fraction of the "sys" times.
Also the "empty" ruby run is only 5 ms. Mysteriously.

The packer is down at 0.8s, and 100000 lookups are at 6.2 s (file-based
aka memory efficient) and 3.7 s (everything in-memory).

Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
My results for 100000 lookups are:

$ time ./packer.rb

real 0m1.634s
user 0m1.576s
sys 0m0.056s

$ time ./packer.rb

real 0m0.716s
user 0m0.708s
sys 0m0.008s
$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.797s
user 0m0.740s
sys 0m0.056s

$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.322s
user 0m0.317s
sys 0m0.005s
$ time ./search_file.rb < ip-test-list > /dev/null

real 0m11.091s
user 0m9.673s
sys 0m1.420s

$ time ./search_file.rb <ip-test-list >/dev/null

real 0m6.201s
user 0m5.316s
sys 0m0.885s
$ time ./search_str.rb < ip-test-list > /dev/null

real 0m7.131s
user 0m6.960s
sys 0m0.168s

$ time ./search_str.rb <ip-test-list >/dev/null

real 0m3.714s
user 0m3.707s
sys 0m0.006s
btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null

real 0m0.129s
user 0m0.120s
sys 0m0.012s

$ time ruby /dev/null

real 0m0.005s
user 0m0.004s
sys 0m0.001s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007
x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

- Matthias
 
M

Matthias Wächter

James said:
Great. Run it for us and let us know how we do. ;)


Here are the results of the supplied solutions so far, and it looks like
my solution can take the 100k-performance victory :)

First Table: Compilation (Table Packing)

real user sys
Adam[*] 0.005 0.002 0.003
Luis 0.655 0.648 0.007
James[**] 21.089 18.142 0.051
Jesse 1.314 1.295 0.020
Matthias 0.718 0.711 0.008

[*]: Adam does not perform a real compression but he builds two
boundaries to search within the original .csv he subsequently uses.
[**]: Upon rebuild, James fetches the .csv sources from the web making
his solution look slow. This output highly depends on your--actually
my--ISP speed.


Second Table: Run (100_000 Addresses)

real user sys
Adam 24.943 22.993 1.951
Bill 35.080 33.029 2.051
Luis 16.149 13.706 2.444
Eugene[*] 52.307 48.689 3.620
Eugene 65.790 61.984 3.805
James 14.803 12.449 2.356
Jesse 14.016 12.343 1.673
Jesus_a[**]
Jesus_b[**]
Kevin[***]
Matt_file 6.192 5.332 0.859
Matt_str 3.704 3.699 0.005
Simon 69.417 64.679 4.706
Justin 56.639 53.292 3.345
steve 63.659 54.355 9.294

[*]: Eugene already implements a random generator. But to make things
fair, I changed his implementation to read the same values from $stdin
as all the other implementations. The "Star" version is using his own
random generator and runs outside competition, the starless version is
my modified one.
[**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it comes to
detecting the commented lines in the first part of the file. I couldn't
manage to make it run, sorry.
[***]: Although I managed to write the missing SQL insertion script and
to even add separate indexes for the address limits, Kevin's SQLite3
version took simply too long. I estimated a run time of over an hour. I
am willing to replay the test if someone tells me how to speed up things
with SQLite3 to make it competitive.

Note that I slightly changed all implementations to contain a loop that
iterates on $stdin.each instead of ARGV or using just ARGV[0]. For the
test the script was run only once and was supplied with all addresses in
one run. The test set consisted of 100_000 freshly generated random IP
addresses written to a file and supplied using the following syntax:

$ (time ruby IpToCountry.rb <IP100k > /dev/null) 2>100k.time

I didn't check the output of the scripts, although I checked one address
upfront. This was mainly because all scripts have a different output
format. My tests were just for measuring the performance.


Just for Info:

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007
x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
$ cat /etc/sabayon-release
Sabayon Linux x86-64 3.4

- Matthias
 
B

Bill Kelly

From: "Matthias Wächter said:
Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory :)

Thanks for putting that together! Fun to see the different
times.

If I've understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

I wasn't expecting that, at all...

I would have bet Simon's would be faster. Strange!


Regards,

Bill
 
A

Adam Shelly

e my solution can take the 100k-performance victory :)
If I've understood correctly, it looks like my solution seems
to be the fastest (so far) of those that operate on the
unmodified .csv file?

It depends what you mean by unmodified - my algorithm runs off the
original file, the only "modification" I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.
I would have bet Simon's would be faster. Strange!

I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it's the regexp that slowed it
down.


-Adam
 
B

Bill Kelly

From: "Adam Shelly" <[email protected]>
It depends what you mean by unmodified - my algorithm runs off the
original file, the only "modification" I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.

Ah, I see. Cool.

Incidentally since the file format description indicates comment
lines may appear anywhere in the file, I allowed for that. However,
I doubt adding a loop to your gets/split logic to keep going until a
valid record was found would affect your time much at all.

Nice job :)


Regards,

Bill
 
J

James Edward Gray II

Here are the results of the supplied solutions so far, and it looks =20=
like my solution can take the 100k-performance victory :)

Thank you very much for putting this together and wow, your code is =20
lightening quick.

James Edward Gray II=
 
J

Jesús Gabriel y Galán

James said:
Great. Run it for us and let us know how we do. ;)

[**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it comes to
detecting the commented lines in the first part of the file. I couldn't
manage to make it run, sorry.

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don't think I'll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn't
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple. On the other hand I don't expect the first
version to perform anywhere near the other solutions, so it's not so
important :).

Jesus.
 
J

James Edward Gray II

James said:
On Sep 18, 2007, at 4:00 AM, Matthias W=E4chter wrote:
Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
Great. Run it for us and let us know how we do. ;)

[**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
the later version you sent your direct parsing breaks when it =20
comes to
detecting the commented lines in the first part of the file. I =20
couldn't
manage to make it run, sorry.

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don't think I'll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn't
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple.

Comments are not a part of the CSV specification, so FasterCSV =20
doesn't address them. I would like to add ignore patterns at some =20
point though.

James Edward Gray II
 
M

Mark Thomas

Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s


Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/\./).join "/"
puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :)

- Mark.
 
J

James Edward Gray II

Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s


Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/\./).join "/"
puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :)

Pretty clever.

I bet with the right prep, this could even be a pretty viable
approach. Instead of building a file for each address you could
create a directory structure for the hexadecimal representations of
each piece of the address. The final layer could be handled as you
have here or with a search through a much smaller file.

James Edward Gray II
 
B

Bill Kelly

From: "James Edward Gray II said:
Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s


Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/\./).join "/"
puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :)

LOL! Nice... :)
Pretty clever.

I bet with the right prep, this could even be a pretty viable
approach. Instead of building a file for each address you could
create a directory structure for the hexadecimal representations of
each piece of the address. The final layer could be handled as you
have here or with a search through a much smaller file.

Indeed, I was thinking last night about preprocessing the data
into an on-disk hash table. I was thinking about a flat file
at the time, but one could use a subdirectory technique like the
above... using hex components of the hash value to index through
a couple subdirs to reach a leaf file containing one or more
records.

Another thing I'd like to try but won't have time for, is to
use ruby-mmap somehow. (Preprocess the data into a flat binary
file, then use ruby-mmap when performing the binary search.)

Anyway thanks for the fun quiz. :)


Regards,

Bill
 
S

Simon Kröger

Adam said:
I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it's the regexp that slowed it
down.

Without looking at the other solutions in detail i think one of the problems
may be that my solution opens the file for each lookup - thats of course easy
to fix. I don't know if thats the problem or the overhead of creating ten
thousand IPAddr objects - i refuse to analyse this in depth because i don't
have a usecase for looking up that many locations in a single run.

(on the other hand i do understand how much fun it can be to optimize such a
problem to death - so go on if you like, i don't have the motivation - this time :)

cheers

Simon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,147
Latest member
CarenSchni
Top