Efficient Integer Comparison Algorithm

Discussion in 'Java' started by JThangiah, Mar 4, 2008.

  1. JThangiah

    JThangiah Guest

    Hi all,
    I need some help to find an efficient range or pair comparison
    algorithm for integers.
    My requirement is to link records in two different databases with
    different set of records. Both have a pair of values called co-
    ordinates which are similar and can be used to match the records.

    My Solution:
    I am going to pull a set of records from both databases and sort them
    based on these co-ordinates. I will put them in two lists so that the
    indexes for the records to be compared in both lists are the same.
    I will then do a comparison of both lists to find the exact match.
    If I find the match, I need to record the matched records elsewhere to
    create an XML file with those results.

    Constraints:
    Both the databases are so large that I have to find the most efficient
    algorithm and datastructure to do the job.
    The execution will be done in a batch job.
    The records pulled from both databases to form the lists are unique
    and have no duplicates but each have a separate unique identifier.
    There may be excess data that cannot be compared in either of the
    databases.

    Questions:
    1. Is list the best data-structure for this since speed is my main
    concern. Or will the autoboxing of integers cause an overhead?
    2. Are there any good range-comparison or pair-comparison algorithms
    for integers that I can use?
    3. Can I make use of concurrency algorithms for this?
    JThangiah, Mar 4, 2008
    #1
    1. Advertising

  2. On 04.03.2008 18:27, JThangiah wrote:
    > I need some help to find an efficient range or pair comparison
    > algorithm for integers.
    > My requirement is to link records in two different databases with
    > different set of records. Both have a pair of values called co-
    > ordinates which are similar and can be used to match the records.


    In other words: you consider records equivalent that have the exact same
    values for both coordinates. What are the properties of your coords?
    Are there positive and / or negative values? How many are there?

    > My Solution:
    > I am going to pull a set of records from both databases and sort them
    > based on these co-ordinates. I will put them in two lists so that the
    > indexes for the records to be compared in both lists are the same.


    How can you make sure that indexes are identical? If you just pull an
    ordered result from the database you cannot expect that (4,-5) is at the
    same position in both results. Or is there another precondition you did
    not mention so far?

    > I will then do a comparison of both lists to find the exact match.
    > If I find the match, I need to record the matched records elsewhere to
    > create an XML file with those results.


    You do not mention database products that you are going to use. If at
    all possible I would use a database link (Oracle) or a similar feature
    of another product and let the DB do the work, by joining the local and
    the remote table. It's at least worth a try.

    Database extensions for spatial data are another set of features that
    might help here.

    > Constraints:
    > Both the databases are so large that I have to find the most efficient
    > algorithm and datastructure to do the job.


    How large is "large"? If we're talking about a few million records then
    that can be easily handled inside the JVM - just pull all the data into
    an appropriate data structure (see below) and go from there.

    > The execution will be done in a batch job.
    > The records pulled from both databases to form the lists are unique
    > and have no duplicates but each have a separate unique identifier.
    > There may be excess data that cannot be compared in either of the
    > databases.
    >
    > Questions:
    > 1. Is list the best data-structure for this since speed is my main
    > concern. Or will the autoboxing of integers cause an overhead?


    If you use a List you need to have entries ordered, otherwise you risk
    O(n*n) which is unnecessary.

    > 2. Are there any good range-comparison or pair-comparison algorithms
    > for integers that I can use?


    ad 1 and 2: why not create a pair type that carries two ints plus
    additional information and can be stuffed into a HashSet?

    Alternative: query both databases ordered by both coordinates and
    compare records individually. You can then decide which ResultSet you
    need to step forward based on the current comparison.

    > 3. Can I make use of concurrency algorithms for this?


    Yes, certainly. You could, for example, partition your data set into
    smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
    number of threads which each processes one chunk. You only need to make
    sure that you only have one thread that will generate the output XML.

    I have the feeling that at the moment the problem is underspecified.
    Missing bits I can see so far:

    1. more about the nature of the data
    2. database products involved
    3. volume of data

    Kind regards

    robert
    Robert Klemme, Mar 4, 2008
    #2
    1. Advertising

  3. JThangiah

    Lord Zoltar Guest

    JThangiah wrote:
    > Hi all,
    > I need some help to find an efficient range or pair comparison
    > algorithm for integers.
    > My requirement is to link records in two different databases with
    > different set of records. Both have a pair of values called co-
    > ordinates which are similar and can be used to match the records.
    >


    How are you defining range comparisons? what does it mean for range A
    > range B?

    is it A.size > B.size ?
    is it A.end > B.end && A.start > B.start ?
    is it just A.end > B.end ?

    Do you want to have a special way of handling a situation where A
    contains B?
    Lord Zoltar, Mar 4, 2008
    #3
  4. JThangiah

    Roedy Green Guest

    On Tue, 4 Mar 2008 09:27:30 -0800 (PST), JThangiah
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I need some help to find an efficient range or pair comparison
    >algorithm for integers.


    There are method for doing this in the SortedArrayList class.

    See http://mindprod.com/products2.html#SORTED
    --

    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Mar 5, 2008
    #4
  5. In article
    <>,
    JThangiah <> wrote:

    > Hi all,
    > I need some help to find an efficient range or pair comparison
    > algorithm for integers.
    > My requirement is to link records in two different databases with
    > different set of records. Both have a pair of values called co-
    > ordinates which are similar and can be used to match the records.
    >
    > My Solution:
    > I am going to pull a set of records from both databases and sort them
    > based on these co-ordinates. I will put them in two lists so that the
    > indexes for the records to be compared in both lists are the same.
    > I will then do a comparison of both lists to find the exact match.
    > If I find the match, I need to record the matched records elsewhere to
    > create an XML file with those results.
    >
    > Constraints:
    > Both the databases are so large that I have to find the most efficient
    > algorithm and datastructure to do the job.
    > The execution will be done in a batch job.
    > The records pulled from both databases to form the lists are unique
    > and have no duplicates but each have a separate unique identifier.
    > There may be excess data that cannot be compared in either of the
    > databases.
    >
    > Questions:
    > 1. Is list the best data-structure for this since speed is my main
    > concern. Or will the autoboxing of integers cause an overhead?
    > 2. Are there any good range-comparison or pair-comparison algorithms
    > for integers that I can use?
    > 3. Can I make use of concurrency algorithms for this?


    The biggest optimization here is to make sure that each database returns
    results sorted in the same order. It might require an index for the
    databases to do this efficiently. Your DBA will know how to optimize it.

    When the databases have sorted results, you simply consume rows from the
    database with the earliest sorted value and compare to the last value of
    the other database. Now you only need enough memory for... four rows I
    think. That's not too bad. Wrap the matches in XML tags and stream it
    out. No matter how many results you have, the memory consumption stays
    at a small constant value.

    If the results weren't sorted you'd have to buffer a very large number
    of rows before matching pairs could be pulled out. Don't assemble the
    XML into an in-memory document either.

    --
    I don't read Google's spam. Reply with another service.
    Kevin McMurtrie, Mar 5, 2008
    #5
  6. JThangiah

    JThangiah Guest


    > How are you defining range comparisons? what does it mean for range A> range B?
    >
    > is it A.size > B.size ?
    > is it A.end > B.end && A.start > B.start ?
    > is it just A.end > B.end ?
    >
    > Do you want to have a special way of handling a situation where A
    > contains B?


    I am not sure if I used the right term when I said "range comparison".
    By comparison, I meant exact matches of co-ordinates.

    Eg. If (190,492) is a co-ordinate pair, , if this is the same in the
    other database I would know that both records match.
    JThangiah, Mar 5, 2008
    #6
  7. JThangiah

    JThangiah Guest

    Hi Robert,
    Thank you so much for your detailed reply. I am so glad I found
    some help.
    I have the answers for your questions.

    1. Nature of data:
    I am doing this for linking two Bio-Informatics databases. One is my
    local system which uses Oracle. The other database is external and I
    only get the data in a file format.
    If you want, you can look at the file here:
    ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

    It is nearly 430MB and following are the fields of interest to me:
    tax_id Integer
    GeneID Integer
    start_position_on_the_genomic_accession Integer
    end_position_on_the_genomic_accession Integer


    The corresponding data in my local database has the following fields:
    Taxon Integer
    EXTFEATUREID String
    LEFTEND Integer
    RIGHTEND Integer

    If the leftend and rightend here is the same as the start_position and
    end_position in the external database, then I know that both represent
    the same Gene.

    These are just positive numbers and only one pair of co-ordinates.

    PreCondition:
    For a particular genome(taxon), if I sort both set of records in
    ascending order of start position, the indexes in both should be
    identical.

    So, I cannot sort the entire DB that way. It should only be sorted for
    one genome at a time.

    2. Database products:
    Do you think there is a way to link my Oracle DB and the file input
    directly to match the records?

    Since I have a key and set of values for each entry in both DB's,
    should I use a Map instead of a List? By using a TreeMap, I can keep
    the data sorted on addition and so I'm just left with the mapping of
    coordinates in the corresponding indexes.

    Will this be faster or is there a better way to do it like using a
    SortedArrayList?

    By "pair type", do you mean an object with two ints and a String value
    as members?

    Is HashSet faster than any other datastructure for this purpose? Do
    you think this will be the best one for this?


    3. Volume of data:
    It is not too large as I exaggerated and is only around 3,48000
    records in my local DB. The external DB has much more data than this
    which I do not need to deal with since I am only looking for a match
    on all records in my local DB.

    > Alternative: query both databases ordered by both coordinates and
    > compare records individually. You can then decide which ResultSet you
    > need to step forward based on the current comparison.



    Sorry, I could not understand your alternate solution. Is this
    comparing done by first sorting by start positions and then comparing
    corresponding index values?


    >
    > Yes, certainly. You could, for example, partition your data set into
    > smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
    > number of threads which each processes one chunk. You only need to make
    > sure that you only have one thread that will generate the output XML.
    >
    >


    I am not used to Concurrent programming till now. Do you know of any
    sample code that I can refer?

    Thanks again for your timely help!

    Warm Regards,
    Jeyanthi
    JThangiah, Mar 5, 2008
    #7
  8. On 05.03.2008 16:49, JThangiah wrote:
    > Hi Robert,
    > Thank you so much for your detailed reply. I am so glad I found
    > some help.
    > I have the answers for your questions.
    >
    > 1. Nature of data:
    > I am doing this for linking two Bio-Informatics databases. One is my
    > local system which uses Oracle. The other database is external and I
    > only get the data in a file format.
    > If you want, you can look at the file here:
    > ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz
    >
    > It is nearly 430MB and following are the fields of interest to me:
    > tax_id Integer
    > GeneID Integer
    > start_position_on_the_genomic_accession Integer
    > end_position_on_the_genomic_accession Integer
    >
    >
    > The corresponding data in my local database has the following fields:
    > Taxon Integer
    > EXTFEATUREID String
    > LEFTEND Integer
    > RIGHTEND Integer
    >
    > If the leftend and rightend here is the same as the start_position and
    > end_position in the external database, then I know that both represent
    > the same Gene.
    >
    > These are just positive numbers and only one pair of co-ordinates.
    >
    > PreCondition:
    > For a particular genome(taxon), if I sort both set of records in
    > ascending order of start position, the indexes in both should be
    > identical.


    I am not sure what you mean by "indexes". I certainly cannot see the
    termin "index" mentioned above.

    > So, I cannot sort the entire DB that way. It should only be sorted for
    > one genome at a time.


    I don't understand. You said, you wanted to match records based on
    identical value pairs (start, end). You can (in a relational database
    and also in Java code) easily sort by two criteria.

    > 2. Database products:
    > Do you think there is a way to link my Oracle DB and the file input
    > directly to match the records?


    Oracle can work with files, this is called "external tables". Please
    have a look at the docs at http://tahiti.oracle.com e.g.

    http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref777

    That's probably what I'd do if the format of your data can be easily
    made available via Oracle. Note though that the file needs to be on the
    server IIRC.

    > Since I have a key and set of values for each entry in both DB's,
    > should I use a Map instead of a List? By using a TreeMap, I can keep
    > the data sorted on addition and so I'm just left with the mapping of
    > coordinates in the corresponding indexes.


    Since you only want to compare according to equality of your two values
    a HashMap is more efficient.

    > Will this be faster or is there a better way to do it like using a
    > SortedArrayList?


    Don't even think about using lists for this. That's too slow. You
    probably should get yourself a book about data structures and algorithms
    to get a solid foundation.

    > By "pair type", do you mean an object with two ints and a String value
    > as members?


    .... and with properly implemented equals() and hashCode() methods, exactly.

    > Is HashSet faster than any other datastructure for this purpose? Do
    > you think this will be the best one for this?


    Yes.

    > 3. Volume of data:
    > It is not too large as I exaggerated and is only around 3,48000
    > records in my local DB. The external DB has much more data than this


    Did you mean 348,000 records? That's easily handled in memory.

    > which I do not need to deal with since I am only looking for a match
    > on all records in my local DB.


    Then it's easy: create your class (see above), read in your local DB,
    store it in a HashSet or HashMap, then query the DB and for each row
    create a temporary instance of the class and look it up in the map / set.

    >> Alternative: query both databases ordered by both coordinates and
    >> compare records individually. You can then decide which ResultSet you
    >> need to step forward based on the current comparison.

    >
    > Sorry, I could not understand your alternate solution. Is this
    > comparing done by first sorting by start positions and then comparing
    > corresponding index values?


    I don't understand why you bring up these indexes. What indexes?

    No, select from the DB with an ORDER BY with two columns. Same for the
    local store. Then iterate through the ResultSet and your local data and
    compare. If the first ordered column in the local store data is less
    than in the DB data, increase it until they are same or greater. If
    same do the same for the second column until a match is found or the end
    of one of the two sources is reached.

    >> Yes, certainly. You could, for example, partition your data set into
    >> smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
    >> number of threads which each processes one chunk. You only need to make
    >> sure that you only have one thread that will generate the output XML.

    >
    > I am not used to Concurrent programming till now. Do you know of any
    > sample code that I can refer?


    http://www.amazon.com/Concurrent-Programming-Java-Principles-Patterns/dp/0201695812
    http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601

    But it's probably not worthwhile in this case.

    Cheers

    robert
    Robert Klemme, Mar 5, 2008
    #8
  9. JThangiah

    Guest

    Thanks Robert! I now have a better understanding of the problem and a
    clear picture on how to proceed.

    This is what I intend to do based on all your suggestions:
    1. Create an external Oracle table for the external file input.
    2. Read in records from my local DB in a sorted order and load them as
    a pair type into a HashSet.
    3. Query the external table and for each row create a temporary
    instance of the class and look it up in the set. This is done by
    iterating through the ResultSet and local data and comparing each.
    a. If the first ordered column in the local store data is less than
    in the DB data, increase it until they are same or greater.
    b. If same do the same for the second column until a match is found
    or the end of one of the two sources is reached.
    4. I would wrap the matches in XML tags and stream it out.

    Thanks a lot for all your help!!!
    , Mar 6, 2008
    #9
  10. JThangiah

    JThangiah Guest

    Thanks Robert! I now have a better understanding of the problem and a
    clear picture on how to proceed.

    This is what I intend to do based on all your suggestions:
    1. Create an external Oracle table for the external file input.
    2. Read in records from my local DB in a sorted order and load them as
    a pair type into a HashSet.
    3. Query the external table and for each row create a temporary
    instance of the class and look it up in the set. This is done by
    iterating through the ResultSet and local data and comparing each.
    a. If the first ordered column in the local store data is less than
    in the DB data, increase it until they are same or greater.
    b. If same do the same for the second column until a match is found
    or the end of one of the two sources is reached.
    4. I would wrap the matches in XML tags and stream it out.

    Thanks a lot for all your help!!!
    JThangiah, Mar 6, 2008
    #10
  11. On 06.03.2008 15:48, wrote:
    > Thanks Robert! I now have a better understanding of the problem and a
    > clear picture on how to proceed.


    I am not so sure. Or maybe *I* did not understand everything properly.
    Your explanation below seems to deal with three data sources. From
    what I read earlier I understood that you have two sources: the Oracle
    DB and your local data in a file.

    > This is what I intend to do based on all your suggestions:
    > 1. Create an external Oracle table for the external file input.
    > 2. Read in records from my local DB in a sorted order and load them as
    > a pair type into a HashSet.


    If you make your local DB available as external table in Oracle then you
    can simply do the matching in the DB via a join. The ResultSet from the
    join then contains only matching records and you can directly output these.

    > 3. Query the external table and for each row create a temporary
    > instance of the class and look it up in the set. This is done by
    > iterating through the ResultSet and local data and comparing each.
    > a. If the first ordered column in the local store data is less than
    > in the DB data, increase it until they are same or greater.
    > b. If same do the same for the second column until a match is found
    > or the end of one of the two sources is reached.


    No. If you have read the local data in a Set or Map already you need to
    only look the Oracle data up via the temp instance in the Set / Map. No
    additional iterating tricks needed.

    > 4. I would wrap the matches in XML tags and stream it out.
    >
    > Thanks a lot for all your help!!!


    You're welcome.

    Kind regards

    robert
    Robert Klemme, Mar 6, 2008
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Dave
    Replies:
    2
    Views:
    7,940
    Marcin Kalicinski
    Mar 5, 2004
  2. Replies:
    6
    Views:
    1,064
  3. Dave Woodworth

    more efficient date range comparison

    Dave Woodworth, Apr 12, 2009, in forum: Ruby
    Replies:
    7
    Views:
    142
    Dave Woodworth
    Apr 16, 2009
  4. Krishna Chaitanya

    Most efficient way to do set-comparison

    Krishna Chaitanya, Mar 13, 2009, in forum: Perl Misc
    Replies:
    17
    Views:
    198
    Krishna Chaitanya
    Mar 23, 2009
  5. Deepu
    Replies:
    1
    Views:
    228
    ccc31807
    Feb 7, 2011
Loading...

Share This Page