sorting/merging in C

O

Osiris

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
 
M

Malcolm McLean

Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?
It isn't really a C question.
Chug through the A-records, and keep a list of keys of B records.
Sort the B record keys.
Chug through the B records and check if each record is on the key list.
If it is not, mark as "unreferenced".

If the list of keys fits in memory there is no problem.
If it is hundreds of millions rather than tens of millions then it might not
fit into a standard PC memory, and you've got to look at other strategies.
 
R

Richard Heathfield

Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id > knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

(Translation to C is left as an exercise.)
 
R

Richard Harter

Osiris said:


It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id > knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall. Better is to sort both allB and do a compare of sorted
lists which is O(n).

There are two kinds of errors, type I and type II. Type I errors are
references to non-existent B records. Type II errors are B records without
errors. (OP does not discuss type I errors but we can get them for free.

The comparison loop has three possible terminations, (a) exhausting
"knownrefs" only, (b) exhausting "allB" only, or (c) exhausting both
simultaneously. In case (a) all remaining "allB" records are type II
errors; in case (b) all remaining "knownref" records are type I errors.
And, of course, case (c) is a successful termination with no error cleanup.


The inner comparison loop logic is something like this:

while (knownrefs.id < allB.id)
write knownrefs.idA to type I error file.
advance knownrefs
while (allB.id < knownrefs.id )
write allB.id to type II error file
advance allB
advance knownrefs
advance allB

An alternative would be to use a hash table for allB with a "seen" tag;
however despite the fact that hash tables are O(1) (modulo ignored
performance factors) using a hash table may be a poor choice due to cache
and page miss costs.
(Translation to C is left as an exercise.)

And a good one; comparison loops and merge loops are trickier to write than
is immediately apparent. Example: in the pseudo code above there is a
potential "run off the end" bug. How to avoid it is an exercise for the
programmer. :)
 
R

Richard Heathfield

Richard Harter said:
Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall.

Oh darn, I was writing Phase 4 as if both files were sorted. Yeah. Sort
that one too. Oops.
 
J

Joe Wright

Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,

How does an A record reference a B record? How many B records may be
referenced by an A record? How?
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.
However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Answer my questions and I'll try to answer yours.
 
O

Osiris

How does an A record reference a B record? How many B records may be
referenced by an A record? How?

just by record ID nr. A may have a B-record ID in it.
Most of the B records are referenced by A records.
Say, maybe 5% are not....

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.

sure, but the initial question is : how many. Later I can correct.
That will be the easy part.
The issue here is performance. The job may take 10 hours
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top