Efficient Integer Comparison Algorithm

J

JThangiah

Hi all,
I need some help to find an efficient range or pair comparison
algorithm for integers.
My requirement is to link records in two different databases with
different set of records. Both have a pair of values called co-
ordinates which are similar and can be used to match the records.

My Solution:
I am going to pull a set of records from both databases and sort them
based on these co-ordinates. I will put them in two lists so that the
indexes for the records to be compared in both lists are the same.
I will then do a comparison of both lists to find the exact match.
If I find the match, I need to record the matched records elsewhere to
create an XML file with those results.

Constraints:
Both the databases are so large that I have to find the most efficient
algorithm and datastructure to do the job.
The execution will be done in a batch job.
The records pulled from both databases to form the lists are unique
and have no duplicates but each have a separate unique identifier.
There may be excess data that cannot be compared in either of the
databases.

Questions:
1. Is list the best data-structure for this since speed is my main
concern. Or will the autoboxing of integers cause an overhead?
2. Are there any good range-comparison or pair-comparison algorithms
for integers that I can use?
3. Can I make use of concurrency algorithms for this?
 
R

Robert Klemme

I need some help to find an efficient range or pair comparison
algorithm for integers.
My requirement is to link records in two different databases with
different set of records. Both have a pair of values called co-
ordinates which are similar and can be used to match the records.

In other words: you consider records equivalent that have the exact same
values for both coordinates. What are the properties of your coords?
Are there positive and / or negative values? How many are there?
My Solution:
I am going to pull a set of records from both databases and sort them
based on these co-ordinates. I will put them in two lists so that the
indexes for the records to be compared in both lists are the same.

How can you make sure that indexes are identical? If you just pull an
ordered result from the database you cannot expect that (4,-5) is at the
same position in both results. Or is there another precondition you did
not mention so far?
I will then do a comparison of both lists to find the exact match.
If I find the match, I need to record the matched records elsewhere to
create an XML file with those results.

You do not mention database products that you are going to use. If at
all possible I would use a database link (Oracle) or a similar feature
of another product and let the DB do the work, by joining the local and
the remote table. It's at least worth a try.

Database extensions for spatial data are another set of features that
might help here.
Constraints:
Both the databases are so large that I have to find the most efficient
algorithm and datastructure to do the job.

How large is "large"? If we're talking about a few million records then
that can be easily handled inside the JVM - just pull all the data into
an appropriate data structure (see below) and go from there.
The execution will be done in a batch job.
The records pulled from both databases to form the lists are unique
and have no duplicates but each have a separate unique identifier.
There may be excess data that cannot be compared in either of the
databases.

Questions:
1. Is list the best data-structure for this since speed is my main
concern. Or will the autoboxing of integers cause an overhead?

If you use a List you need to have entries ordered, otherwise you risk
O(n*n) which is unnecessary.
2. Are there any good range-comparison or pair-comparison algorithms
for integers that I can use?

ad 1 and 2: why not create a pair type that carries two ints plus
additional information and can be stuffed into a HashSet?

Alternative: query both databases ordered by both coordinates and
compare records individually. You can then decide which ResultSet you
need to step forward based on the current comparison.
3. Can I make use of concurrency algorithms for this?

Yes, certainly. You could, for example, partition your data set into
smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
number of threads which each processes one chunk. You only need to make
sure that you only have one thread that will generate the output XML.

I have the feeling that at the moment the problem is underspecified.
Missing bits I can see so far:

1. more about the nature of the data
2. database products involved
3. volume of data

Kind regards

robert
 
L

Lord Zoltar

JThangiah said:
Hi all,
I need some help to find an efficient range or pair comparison
algorithm for integers.
My requirement is to link records in two different databases with
different set of records. Both have a pair of values called co-
ordinates which are similar and can be used to match the records.

How are you defining range comparisons? what does it mean for range A
is it A.size > B.size ?
is it A.end > B.end && A.start > B.start ?
is it just A.end > B.end ?

Do you want to have a special way of handling a situation where A
contains B?
 
K

Kevin McMurtrie

JThangiah said:
Hi all,
I need some help to find an efficient range or pair comparison
algorithm for integers.
My requirement is to link records in two different databases with
different set of records. Both have a pair of values called co-
ordinates which are similar and can be used to match the records.

My Solution:
I am going to pull a set of records from both databases and sort them
based on these co-ordinates. I will put them in two lists so that the
indexes for the records to be compared in both lists are the same.
I will then do a comparison of both lists to find the exact match.
If I find the match, I need to record the matched records elsewhere to
create an XML file with those results.

Constraints:
Both the databases are so large that I have to find the most efficient
algorithm and datastructure to do the job.
The execution will be done in a batch job.
The records pulled from both databases to form the lists are unique
and have no duplicates but each have a separate unique identifier.
There may be excess data that cannot be compared in either of the
databases.

Questions:
1. Is list the best data-structure for this since speed is my main
concern. Or will the autoboxing of integers cause an overhead?
2. Are there any good range-comparison or pair-comparison algorithms
for integers that I can use?
3. Can I make use of concurrency algorithms for this?

The biggest optimization here is to make sure that each database returns
results sorted in the same order. It might require an index for the
databases to do this efficiently. Your DBA will know how to optimize it.

When the databases have sorted results, you simply consume rows from the
database with the earliest sorted value and compare to the last value of
the other database. Now you only need enough memory for... four rows I
think. That's not too bad. Wrap the matches in XML tags and stream it
out. No matter how many results you have, the memory consumption stays
at a small constant value.

If the results weren't sorted you'd have to buffer a very large number
of rows before matching pairs could be pulled out. Don't assemble the
XML into an in-memory document either.
 
J

JThangiah

How are you defining range comparisons? what does it mean for range A> range B?

is it A.size > B.size ?
is it A.end > B.end && A.start > B.start ?
is it just A.end > B.end ?

Do you want to have a special way of handling a situation where A
contains B?

I am not sure if I used the right term when I said "range comparison".
By comparison, I meant exact matches of co-ordinates.

Eg. If (190,492) is a co-ordinate pair, , if this is the same in the
other database I would know that both records match.
 
J

JThangiah

Hi Robert,
Thank you so much for your detailed reply. I am so glad I found
some help.
I have the answers for your questions.

1. Nature of data:
I am doing this for linking two Bio-Informatics databases. One is my
local system which uses Oracle. The other database is external and I
only get the data in a file format.
If you want, you can look at the file here:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

It is nearly 430MB and following are the fields of interest to me:
tax_id Integer
GeneID Integer
start_position_on_the_genomic_accession Integer
end_position_on_the_genomic_accession Integer


The corresponding data in my local database has the following fields:
Taxon Integer
EXTFEATUREID String
LEFTEND Integer
RIGHTEND Integer

If the leftend and rightend here is the same as the start_position and
end_position in the external database, then I know that both represent
the same Gene.

These are just positive numbers and only one pair of co-ordinates.

PreCondition:
For a particular genome(taxon), if I sort both set of records in
ascending order of start position, the indexes in both should be
identical.

So, I cannot sort the entire DB that way. It should only be sorted for
one genome at a time.

2. Database products:
Do you think there is a way to link my Oracle DB and the file input
directly to match the records?

Since I have a key and set of values for each entry in both DB's,
should I use a Map instead of a List? By using a TreeMap, I can keep
the data sorted on addition and so I'm just left with the mapping of
coordinates in the corresponding indexes.

Will this be faster or is there a better way to do it like using a
SortedArrayList?

By "pair type", do you mean an object with two ints and a String value
as members?

Is HashSet faster than any other datastructure for this purpose? Do
you think this will be the best one for this?


3. Volume of data:
It is not too large as I exaggerated and is only around 3,48000
records in my local DB. The external DB has much more data than this
which I do not need to deal with since I am only looking for a match
on all records in my local DB.
Alternative: query both databases ordered by both coordinates and
compare records individually. You can then decide which ResultSet you
need to step forward based on the current comparison.


Sorry, I could not understand your alternate solution. Is this
comparing done by first sorting by start positions and then comparing
corresponding index values?

Yes, certainly. You could, for example, partition your data set into
smaller chunks, say x=0...999, 1000..1999, 2000..2999 etc. and have a
number of threads which each processes one chunk. You only need to make
sure that you only have one thread that will generate the output XML.

I am not used to Concurrent programming till now. Do you know of any
sample code that I can refer?

Thanks again for your timely help!

Warm Regards,
Jeyanthi
 
R

Robert Klemme

Hi Robert,
Thank you so much for your detailed reply. I am so glad I found
some help.
I have the answers for your questions.

1. Nature of data:
I am doing this for linking two Bio-Informatics databases. One is my
local system which uses Oracle. The other database is external and I
only get the data in a file format.
If you want, you can look at the file here:
ftp://ftp.ncbi.nih.gov/gene/DATA/gene2refseq.gz

It is nearly 430MB and following are the fields of interest to me:
tax_id Integer
GeneID Integer
start_position_on_the_genomic_accession Integer
end_position_on_the_genomic_accession Integer


The corresponding data in my local database has the following fields:
Taxon Integer
EXTFEATUREID String
LEFTEND Integer
RIGHTEND Integer

If the leftend and rightend here is the same as the start_position and
end_position in the external database, then I know that both represent
the same Gene.

These are just positive numbers and only one pair of co-ordinates.

PreCondition:
For a particular genome(taxon), if I sort both set of records in
ascending order of start position, the indexes in both should be
identical.

I am not sure what you mean by "indexes". I certainly cannot see the
termin "index" mentioned above.
So, I cannot sort the entire DB that way. It should only be sorted for
one genome at a time.

I don't understand. You said, you wanted to match records based on
identical value pairs (start, end). You can (in a relational database
and also in Java code) easily sort by two criteria.
2. Database products:
Do you think there is a way to link my Oracle DB and the file input
directly to match the records?

Oracle can work with files, this is called "external tables". Please
have a look at the docs at http://tahiti.oracle.com e.g.

http://download.oracle.com/docs/cd/B19306_01/server.102/b14220/schema.htm#sthref777

That's probably what I'd do if the format of your data can be easily
made available via Oracle. Note though that the file needs to be on the
server IIRC.
Since I have a key and set of values for each entry in both DB's,
should I use a Map instead of a List? By using a TreeMap, I can keep
the data sorted on addition and so I'm just left with the mapping of
coordinates in the corresponding indexes.

Since you only want to compare according to equality of your two values
a HashMap is more efficient.
Will this be faster or is there a better way to do it like using a
SortedArrayList?

Don't even think about using lists for this. That's too slow. You
probably should get yourself a book about data structures and algorithms
to get a solid foundation.
By "pair type", do you mean an object with two ints and a String value
as members?

.... and with properly implemented equals() and hashCode() methods, exactly.
Is HashSet faster than any other datastructure for this purpose? Do
you think this will be the best one for this?
Yes.

3. Volume of data:
It is not too large as I exaggerated and is only around 3,48000
records in my local DB. The external DB has much more data than this

Did you mean 348,000 records? That's easily handled in memory.
which I do not need to deal with since I am only looking for a match
on all records in my local DB.

Then it's easy: create your class (see above), read in your local DB,
store it in a HashSet or HashMap, then query the DB and for each row
create a temporary instance of the class and look it up in the map / set.
Sorry, I could not understand your alternate solution. Is this
comparing done by first sorting by start positions and then comparing
corresponding index values?

I don't understand why you bring up these indexes. What indexes?

No, select from the DB with an ORDER BY with two columns. Same for the
local store. Then iterate through the ResultSet and your local data and
compare. If the first ordered column in the local store data is less
than in the DB data, increase it until they are same or greater. If
same do the same for the second column until a match is found or the end
of one of the two sources is reached.
I am not used to Concurrent programming till now. Do you know of any
sample code that I can refer?

http://www.amazon.com/Concurrent-Programming-Java-Principles-Patterns/dp/0201695812
http://www.amazon.com/Java-Concurrency-Practice-Brian-Goetz/dp/0321349601

But it's probably not worthwhile in this case.

Cheers

robert
 
J

jeyanthi.m

Thanks Robert! I now have a better understanding of the problem and a
clear picture on how to proceed.

This is what I intend to do based on all your suggestions:
1. Create an external Oracle table for the external file input.
2. Read in records from my local DB in a sorted order and load them as
a pair type into a HashSet.
3. Query the external table and for each row create a temporary
instance of the class and look it up in the set. This is done by
iterating through the ResultSet and local data and comparing each.
a. If the first ordered column in the local store data is less than
in the DB data, increase it until they are same or greater.
b. If same do the same for the second column until a match is found
or the end of one of the two sources is reached.
4. I would wrap the matches in XML tags and stream it out.

Thanks a lot for all your help!!!
 
J

JThangiah

Thanks Robert! I now have a better understanding of the problem and a
clear picture on how to proceed.

This is what I intend to do based on all your suggestions:
1. Create an external Oracle table for the external file input.
2. Read in records from my local DB in a sorted order and load them as
a pair type into a HashSet.
3. Query the external table and for each row create a temporary
instance of the class and look it up in the set. This is done by
iterating through the ResultSet and local data and comparing each.
a. If the first ordered column in the local store data is less than
in the DB data, increase it until they are same or greater.
b. If same do the same for the second column until a match is found
or the end of one of the two sources is reached.
4. I would wrap the matches in XML tags and stream it out.

Thanks a lot for all your help!!!
 
R

Robert Klemme

Thanks Robert! I now have a better understanding of the problem and a
clear picture on how to proceed.

I am not so sure. Or maybe *I* did not understand everything properly.
Your explanation below seems to deal with three data sources. From
what I read earlier I understood that you have two sources: the Oracle
DB and your local data in a file.
This is what I intend to do based on all your suggestions:
1. Create an external Oracle table for the external file input.
2. Read in records from my local DB in a sorted order and load them as
a pair type into a HashSet.

If you make your local DB available as external table in Oracle then you
can simply do the matching in the DB via a join. The ResultSet from the
join then contains only matching records and you can directly output these.
3. Query the external table and for each row create a temporary
instance of the class and look it up in the set. This is done by
iterating through the ResultSet and local data and comparing each.
a. If the first ordered column in the local store data is less than
in the DB data, increase it until they are same or greater.
b. If same do the same for the second column until a match is found
or the end of one of the two sources is reached.

No. If you have read the local data in a Set or Map already you need to
only look the Oracle data up via the temp instance in the Set / Map. No
additional iterating tricks needed.
4. I would wrap the matches in XML tags and stream it out.

Thanks a lot for all your help!!!

You're welcome.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,051
Latest member
CarleyMcCr

Latest Threads

Top