C
ccc31807
I've been given an assignment to revive a project that was apparently
abandoned by the original developers, and I need some help thinking
through the logic.
The DB has Person table with the name, address, etc., of individual
people. This should have been called the Actors table, but it's too
late now. The names are identifiers, and I know names are not unique,
individuals having several names (J. Smith, Jim Smith, and James Smith
are all the same person) and names being shared by individuals (three
people all named Jim Smith) but we are dealing with this as I indicate
below.
The DB also has an Accounts table for discrete events. Should have
been called the Events table, but never mind. This list events by ID,
which is a unique identifier. The idea is that an Even (or Account)
has a primary Actor (Person) with other actors associated with the
event in one way or another. There are often multiple accounts
concerning the same person with different ID numbers. For example,
medical events may record a broken arm, sprained ankle, and cuts and
abrasions for the same person in three different records.
Accounts are pushed in CSV files and inserted into the database in an
automated manner. Each row contains the data defining the account and
the primary actor. These rows are not currently split, and all the
data gets inserted into the Account table. The secondary actors are
inserted into the Person table.
I would like to split the account records, insert the account specific
data into the Account table and the person specific data into the
Person table. This means associating each account with the primary
person, which means checking the import names against the Person
table. (The fact that we can't uniquely identify the people means that
the accounts must be manually worked, and the client has a staff that
already does this. This is a data problem, not a software problem,
that I am not responsible for.)
The Person table isn't large now, but we anticipate that it will grow
to several million records. Each import file, the accounts, has from
20 to 50 records. There can be as many as 50 per day.
Each day, I read the Person table into a hash that has elements that
look like this: %person --> $person{$last}{$first}{$middle}{$suffix}
I split each record into a series of scalars, including a hash that
looks like this: %account_person --> $account_person{$last}{$first}
($middle}($suffix}.
If there isn't a match, the account is read into Account and the
person is read into Person. If there is a match, the record is flagged
for manual handling.
I've been thinking about Schwartzian Transforms and Guttman-Rosler
Transforms, and am wondering of this is the best way to handle this.
I'm not concerned with real time processing since this is handled off-
line, but I'd like to make it as efficient as I can.
Thanks, CC.
abandoned by the original developers, and I need some help thinking
through the logic.
The DB has Person table with the name, address, etc., of individual
people. This should have been called the Actors table, but it's too
late now. The names are identifiers, and I know names are not unique,
individuals having several names (J. Smith, Jim Smith, and James Smith
are all the same person) and names being shared by individuals (three
people all named Jim Smith) but we are dealing with this as I indicate
below.
The DB also has an Accounts table for discrete events. Should have
been called the Events table, but never mind. This list events by ID,
which is a unique identifier. The idea is that an Even (or Account)
has a primary Actor (Person) with other actors associated with the
event in one way or another. There are often multiple accounts
concerning the same person with different ID numbers. For example,
medical events may record a broken arm, sprained ankle, and cuts and
abrasions for the same person in three different records.
Accounts are pushed in CSV files and inserted into the database in an
automated manner. Each row contains the data defining the account and
the primary actor. These rows are not currently split, and all the
data gets inserted into the Account table. The secondary actors are
inserted into the Person table.
I would like to split the account records, insert the account specific
data into the Account table and the person specific data into the
Person table. This means associating each account with the primary
person, which means checking the import names against the Person
table. (The fact that we can't uniquely identify the people means that
the accounts must be manually worked, and the client has a staff that
already does this. This is a data problem, not a software problem,
that I am not responsible for.)
The Person table isn't large now, but we anticipate that it will grow
to several million records. Each import file, the accounts, has from
20 to 50 records. There can be as many as 50 per day.
Each day, I read the Person table into a hash that has elements that
look like this: %person --> $person{$last}{$first}{$middle}{$suffix}
I split each record into a series of scalars, including a hash that
looks like this: %account_person --> $account_person{$last}{$first}
($middle}($suffix}.
If there isn't a match, the account is read into Account and the
person is read into Person. If there is a match, the record is flagged
for manual handling.
I've been thinking about Schwartzian Transforms and Guttman-Rosler
Transforms, and am wondering of this is the best way to handle this.
I'm not concerned with real time processing since this is handled off-
line, but I'd like to make it as efficient as I can.
Thanks, CC.