suggestions for comparing two large data sets requested

T

Terry L. Ridder

hello;

background:
using the daily statistic files from the four regional internet
registries (rir), apnic, arin, lacnic, and ripe, i create a 'holes' data
set. 'holes' are network address blocks that are not reserved by iana
nor allocated or assigned by any of the four rirs. the current 'holes'
file contains 149190 entries in cidr notation.

a third party file contains ip addresses which in theory should be
blocked are various reasons. this flat ascii file has 16420 entries in
cidr notation. a review of the third party file shows ip addresses
listed which are really 'holes', i.e. they are neither reserved,
allocated, nor assigned by iana, apnic, arin, lacnic, or ripe.
however, that does not rule out that someone may actually be attempting
to use them.

the 'holes' data and the third party data need to be compared.
there are several possibilities:

for each 'holes' entry:
$holes_lo == begin ip address of network address block.
$holes_hi == end ip address of network address block.

for each 'third party' entry:
$block_lo == begin ip address of network address block.
$block_hi == end ip address of network address block.

$holes_lo < $block_lo &&
$holes_lo < $block_hi &&
$holes_hi > $block_lo &&
$holes_hi < $block_hi
partial overlap;
flag block entry;

$holes_lo > $block_lo &&
$holes_lo < $block_hi &&
$holes_hi > $block_lo &&
$holes_hi > $block_hi
partial overlap;
flag block entry;

$holes_lo > $block_lo &&
$holes_lo > $block_hi &&
$holes_hi > $block_lo &&
$holes_hi > $block_hi
no overlap;
ok;

$holes_lo < $block_lo &&
$holes_lo < $block_hi &&
$holes_hi < $block_lo &&
$holes_hi < $block_hi
no overlap;
ok;

$holes_lo < $block_lo &&
$holes_lo < $block_hi &&
$holes_hi > $block_lo &&
$holes_hi > $block_hi
total overlap;
flag block entry;

the flagged block entries will be check against the bgp routing tables
by querying the router for announced routes just to make sure someone
is not attempting to use it.

using foreach loops would be braindead given the number of entries.
149190 x 16420. ( which change daily. )

please note:
all ip addresses are stored as numbers and *not* as dotted quads.

the reason for doing this is to provide feedback to the third party
concerning their listings and to request clarification as to why they
are listing network address blocks which are neither reserved,
allocated, assigned, nor routed.

i would be the first to agree that the third party should be checking
their listings, but for whatever reason they are not. i have pointed
out several 'errors' to the third party but it falls of deaf ears or
blind eyes depending on your perspective.
 
J

James Willmore

On Tue, 14 Oct 2003 00:14:38 -0500
<snip>

There are various Perl modules you could use to aid in this task.
Check out:
http://search.cpan.org

Since you don't have a specific Perl question, this is the best answer
I can give you. Others may have somethign else to offer.

HTH

--
Jim

Copyright notice: all code written by the author in this post is
released under the GPL. http://www.gnu.org/licenses/gpl.txt
for more information.

a fortune quote ...
For some reason, this fortune reminds everyone of Marvin
Zelkowitz.
 
J

John W. Krahn

Terry L. Ridder said:
[snip]

the flagged block entries will be check against the bgp routing tables
by querying the router for announced routes just to make sure someone
is not attempting to use it.

using foreach loops would be braindead given the number of entries.
149190 x 16420. ( which change daily. )

please note:
all ip addresses are stored as numbers and *not* as dotted quads.

It sounds like you could use a bit vector to store the 'holes' data
which would require a 536,870,912 byte string and have look-ups of O(1).


John
 
A

Anno Siegel

John W. Krahn said:
Terry L. Ridder said:
[snip]

the flagged block entries will be check against the bgp routing tables
by querying the router for announced routes just to make sure someone
is not attempting to use it.

using foreach loops would be braindead given the number of entries.
149190 x 16420. ( which change daily. )

please note:
all ip addresses are stored as numbers and *not* as dotted quads.

It sounds like you could use a bit vector to store the 'holes' data
which would require a 536,870,912 byte string and have look-ups of O(1).

Ah, the sophistication of brute force :)

It may cost some time to set up the table in the first place. A (probably
substantial) subset of 8*536,870,912 bits must be set, more or less
individually.

Alternatively, one could use binary search to find the starting and
ending points of a possible enclosing interval.

Anno
 
Q

Quantum Mechanic

Alternatively, one could use binary search to find the starting and
ending points of a possible enclosing interval.

Assuming the begin/end pairs in each list are sorted, use the merge
sort algorithm, with some stream state thrown in, and it should take
linear time.

-QM
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top