suggestions for comparing two large data sets requested

Discussion in 'Perl Misc' started by Terry L. Ridder, Oct 14, 2003.

  1. hello;

    background:
    using the daily statistic files from the four regional internet
    registries (rir), apnic, arin, lacnic, and ripe, i create a 'holes' data
    set. 'holes' are network address blocks that are not reserved by iana
    nor allocated or assigned by any of the four rirs. the current 'holes'
    file contains 149190 entries in cidr notation.

    a third party file contains ip addresses which in theory should be
    blocked are various reasons. this flat ascii file has 16420 entries in
    cidr notation. a review of the third party file shows ip addresses
    listed which are really 'holes', i.e. they are neither reserved,
    allocated, nor assigned by iana, apnic, arin, lacnic, or ripe.
    however, that does not rule out that someone may actually be attempting
    to use them.

    the 'holes' data and the third party data need to be compared.
    there are several possibilities:

    for each 'holes' entry:
    $holes_lo == begin ip address of network address block.
    $holes_hi == end ip address of network address block.

    for each 'third party' entry:
    $block_lo == begin ip address of network address block.
    $block_hi == end ip address of network address block.

    $holes_lo < $block_lo &&
    $holes_lo < $block_hi &&
    $holes_hi > $block_lo &&
    $holes_hi < $block_hi
    partial overlap;
    flag block entry;

    $holes_lo > $block_lo &&
    $holes_lo < $block_hi &&
    $holes_hi > $block_lo &&
    $holes_hi > $block_hi
    partial overlap;
    flag block entry;

    $holes_lo > $block_lo &&
    $holes_lo > $block_hi &&
    $holes_hi > $block_lo &&
    $holes_hi > $block_hi
    no overlap;
    ok;

    $holes_lo < $block_lo &&
    $holes_lo < $block_hi &&
    $holes_hi < $block_lo &&
    $holes_hi < $block_hi
    no overlap;
    ok;

    $holes_lo < $block_lo &&
    $holes_lo < $block_hi &&
    $holes_hi > $block_lo &&
    $holes_hi > $block_hi
    total overlap;
    flag block entry;

    the flagged block entries will be check against the bgp routing tables
    by querying the router for announced routes just to make sure someone
    is not attempting to use it.

    using foreach loops would be braindead given the number of entries.
    149190 x 16420. ( which change daily. )

    please note:
    all ip addresses are stored as numbers and *not* as dotted quads.

    the reason for doing this is to provide feedback to the third party
    concerning their listings and to request clarification as to why they
    are listing network address blocks which are neither reserved,
    allocated, assigned, nor routed.

    i would be the first to agree that the third party should be checking
    their listings, but for whatever reason they are not. i have pointed
    out several 'errors' to the third party but it falls of deaf ears or
    blind eyes depending on your perspective.

    --
    terry l. ridder ><>
    postmaster at blauedonau.com
     
    Terry L. Ridder, Oct 14, 2003
    #1
    1. Advertising

  2. On Tue, 14 Oct 2003 00:14:38 -0500
    "Terry L. Ridder" <> wrote:
    <snip>

    There are various Perl modules you could use to aid in this task.
    Check out:
    http://search.cpan.org

    Since you don't have a specific Perl question, this is the best answer
    I can give you. Others may have somethign else to offer.

    HTH

    --
    Jim

    Copyright notice: all code written by the author in this post is
    released under the GPL. http://www.gnu.org/licenses/gpl.txt
    for more information.

    a fortune quote ...
    For some reason, this fortune reminds everyone of Marvin
    Zelkowitz.
     
    James Willmore, Oct 14, 2003
    #2
    1. Advertising

  3. "Terry L. Ridder" wrote:
    >
    > [snip]
    >
    > the flagged block entries will be check against the bgp routing tables
    > by querying the router for announced routes just to make sure someone
    > is not attempting to use it.
    >
    > using foreach loops would be braindead given the number of entries.
    > 149190 x 16420. ( which change daily. )
    >
    > please note:
    > all ip addresses are stored as numbers and *not* as dotted quads.


    It sounds like you could use a bit vector to store the 'holes' data
    which would require a 536,870,912 byte string and have look-ups of O(1).


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Oct 14, 2003
    #3
  4. Terry L. Ridder

    Anno Siegel Guest

    John W. Krahn <> wrote in comp.lang.perl.misc:
    > "Terry L. Ridder" wrote:
    > >
    > > [snip]
    > >
    > > the flagged block entries will be check against the bgp routing tables
    > > by querying the router for announced routes just to make sure someone
    > > is not attempting to use it.
    > >
    > > using foreach loops would be braindead given the number of entries.
    > > 149190 x 16420. ( which change daily. )
    > >
    > > please note:
    > > all ip addresses are stored as numbers and *not* as dotted quads.

    >
    > It sounds like you could use a bit vector to store the 'holes' data
    > which would require a 536,870,912 byte string and have look-ups of O(1).


    Ah, the sophistication of brute force :)

    It may cost some time to set up the table in the first place. A (probably
    substantial) subset of 8*536,870,912 bits must be set, more or less
    individually.

    Alternatively, one could use binary search to find the starting and
    ending points of a possible enclosing interval.

    Anno
     
    Anno Siegel, Oct 14, 2003
    #4
  5. -berlin.de (Anno Siegel) wrote in message news:<bmgils$anq$-Berlin.DE>...
    > Alternatively, one could use binary search to find the starting and
    > ending points of a possible enclosing interval.


    Assuming the begin/end pairs in each list are sorted, use the merge
    sort algorithm, with some stream state thrown in, and it should take
    linear time.

    -QM
     
    Quantum Mechanic, Oct 14, 2003
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Andy Fish

    xpath: comparing two node sets

    Andy Fish, Mar 9, 2005, in forum: XML
    Replies:
    3
    Views:
    2,859
    Andy Fish
    Mar 10, 2005
  2. Eric Lilja
    Replies:
    9
    Views:
    377
    Old Wolf
    May 26, 2005
  3. John Salerno

    comparing values in two sets

    John Salerno, May 14, 2006, in forum: Python
    Replies:
    11
    Views:
    566
    Gerard Flanagan
    May 15, 2006
  4. java
    Replies:
    7
    Views:
    291
  5. alt.testing@{g}mail.com

    suggestions on intelligent processing of data sets in a file

    alt.testing@{g}mail.com, May 9, 2007, in forum: Perl Misc
    Replies:
    2
    Views:
    81
    alt.testing@{g}mail.com
    May 14, 2007
Loading...

Share This Page