sorting/merging in C

Osiris · Jun 16, 2007

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Malcolm McLean · Jun 16, 2007

Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It isn't really a C question.
Chug through the A-records, and keep a list of keys of B records.
Sort the B record keys.
Chug through the B records and check if each record is on the key list.
If it is not, mark as "unreferenced".

If the list of keys fits in memory there is no problem.
If it is hundreds of millions rather than tens of millions then it might not
fit into a standard PC memory, and you've got to look at other strategies.

Richard Heathfield · Jun 16, 2007

Osiris said:

I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,
But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id > knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

(Translation to C is left as an exercise.)

Richard Harter · Jun 16, 2007

Osiris said:

It's tempting to answer "do it in SQL instead, because it'll be
quicker", but let's see...

PASS 1:
for each record in the database
if it's a type-A
write down the B reference in a separate file, "knownrefs"
else
write down the B ID in a separate file, "allB"
endif
endfor

That's it for the database, but we'll continue to assume that random
access is expensive.

PASS 2:
merge-sort "knownrefs"

PASS 3 (optional):
discard duplicates from "knownrefs"

PASS 4:
open "knownrefs", and read first record
for each record in "allB"
if allB.id < knownrefs.id
write allB.id to "errors"
else if allB.id > knownrefs.id
read next record from "knownrefs"
endif
endfor

This involves only one sort step. Can anyone improve on this?

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall. Better is to sort both allB and do a compare of sorted
lists which is O(n).

There are two kinds of errors, type I and type II. Type I errors are
references to non-existent B records. Type II errors are B records without
errors. (OP does not discuss type I errors but we can get them for free.

The comparison loop has three possible terminations, (a) exhausting
"knownrefs" only, (b) exhausting "allB" only, or (c) exhausting both
simultaneously. In case (a) all remaining "allB" records are type II
errors; in case (b) all remaining "knownref" records are type I errors.
And, of course, case (c) is a successful termination with no error cleanup.

The inner comparison loop logic is something like this:

while (knownrefs.id < allB.id)
write knownrefs.idA to type I error file.
advance knownrefs
while (allB.id < knownrefs.id )
write allB.id to type II error file
advance allB
advance knownrefs
advance allB

An alternative would be to use a hash table for allB with a "seen" tag;
however despite the fact that hash tables are O(1) (modulo ignored
performance factors) using a hash table may be a poor choice due to cache
and page miss costs.

(Translation to C is left as an exercise.)

And a good one; comparison loops and merge loops are trickier to write than
is immediately apparent. Example: in the pseudo code above there is a
potential "run off the end" bug. How to avoid it is an exercise for the
programmer.

Richard Heathfield · Jun 16, 2007

Richard Harter said:

Your pass 4 is O(n) for each record in allB, which makes it an O(n**2)
algorithm overall.

Oh darn, I was writing Phase 4 as if both files were sorted. Yeah. Sort
that one too. Oops.

Joe Wright · Jun 16, 2007

Osiris said:
I have a *VAST* database of 10's of millions records.
THere are two kinds of records in it, A an B.
An A-record may be referencing B records. B-records should be
referenced by one or more A-records,

How does an A record reference a B record? How many B records may be
referenced by an A record? How?

But some B records may not be refererenced by any A record. That is
considered an error in the database that must be corrected.
I want to know how many of such (unused, surplus) B-records there are
in the database.

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.

However: random access to the database is expensive and therefor not
allowed. Only sequentially reading the records is allowed.
Performance is important here.

How would I do it in C ?

Answer my questions and I'll try to answer yours.

Osiris · Jun 17, 2007

How does an A record reference a B record? How many B records may be
referenced by an A record? How?

just by record ID nr. A may have a B-record ID in it.
Most of the B records are referenced by A records.
Say, maybe 5% are not....

I suppose you need to know which B records are not referenced by A
records so that they can be removed. A count of them is a byproduct of
identifying them.

sure, but the initial question is : how many. Later I can correct.
That will be the easy part.
The issue here is performance. The job may take 10 hours

Parallel sorting algorithms...	0	Sep 7, 2012
Paralle sorting algorithms...	0	Sep 7, 2012
About parallel sorting...	2	Sep 4, 2012
Tree Merging Problem	6	Jul 6, 2007
need help in sorting	1	Nov 25, 2006
In the Matter of Herb Schildt: a Detailed Analysis of "C: TheComplete Nonsense"	109	Apr 3, 2010
[RFC] Mysql::DBLink	3	Dec 8, 2013
Algorithms in C - Robert Sedgewick	23	Sep 8, 2009

sorting/merging in C

Osiris

Malcolm McLean

Richard Heathfield

Richard Harter

Richard Heathfield

Joe Wright

Osiris

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads