logic, algorithms, and (Perl) data structures

C

ccc31807

I've been given an assignment to revive a project that was apparently
abandoned by the original developers, and I need some help thinking
through the logic.

The DB has Person table with the name, address, etc., of individual
people. This should have been called the Actors table, but it's too
late now. The names are identifiers, and I know names are not unique,
individuals having several names (J. Smith, Jim Smith, and James Smith
are all the same person) and names being shared by individuals (three
people all named Jim Smith) but we are dealing with this as I indicate
below.

The DB also has an Accounts table for discrete events. Should have
been called the Events table, but never mind. This list events by ID,
which is a unique identifier. The idea is that an Even (or Account)
has a primary Actor (Person) with other actors associated with the
event in one way or another. There are often multiple accounts
concerning the same person with different ID numbers. For example,
medical events may record a broken arm, sprained ankle, and cuts and
abrasions for the same person in three different records.

Accounts are pushed in CSV files and inserted into the database in an
automated manner. Each row contains the data defining the account and
the primary actor. These rows are not currently split, and all the
data gets inserted into the Account table. The secondary actors are
inserted into the Person table.

I would like to split the account records, insert the account specific
data into the Account table and the person specific data into the
Person table. This means associating each account with the primary
person, which means checking the import names against the Person
table. (The fact that we can't uniquely identify the people means that
the accounts must be manually worked, and the client has a staff that
already does this. This is a data problem, not a software problem,
that I am not responsible for.)

The Person table isn't large now, but we anticipate that it will grow
to several million records. Each import file, the accounts, has from
20 to 50 records. There can be as many as 50 per day.

Each day, I read the Person table into a hash that has elements that
look like this: %person --> $person{$last}{$first}{$middle}{$suffix}

I split each record into a series of scalars, including a hash that
looks like this: %account_person --> $account_person{$last}{$first}
($middle}($suffix}.

If there isn't a match, the account is read into Account and the
person is read into Person. If there is a match, the record is flagged
for manual handling.

I've been thinking about Schwartzian Transforms and Guttman-Rosler
Transforms, and am wondering of this is the best way to handle this.
I'm not concerned with real time processing since this is handled off-
line, but I'd like to make it as efficient as I can.

Thanks, CC.
 
S

smallpond

I've been thinking about Schwartzian Transforms and Guttman-Rosler
Transforms, and am wondering of this is the best way to handle this.
I'm not concerned with real time processing since this is handled off-
line, but I'd like to make it as efficient as I can.

Thanks, CC.


My advice:
1) define the operations you want to do
2) lay out tables to simplify the operations
3) write the SQL needed to perform the operations
4) write the perl glue
5) test, test, test, test, test
....
97) worry about the efficiency of the perl code
 
C

ccc31807

My advice:
1) define the operations you want to do
2) lay out tables to simplify the operations
3) write the SQL needed to perform the operations
4) write the perl glue
5) test, test, test, test, test
...
97) worry about the efficiency of the perl code

Absolutely great advice. Two points:

3.5) write the user interface. Users must interact with the system,
and I find that writing the interface before I write the code (Perl
glue) gives me both the input and the output that I need to write the
code for.

More importantly, people can be active in different capacities in
different events, and I'm drowning in complexity. I'd like to be able
to have ONE (1) table with all my people data. The prior developers
have about ten tables representing different capacities, with the
problem that if a person's address and email changes, you've got to
chase that person over ten (worst case) different tables. My way out
is to match the incoming names against the existing names, which won't
tag duplicates but will at least tag uniques. I'll let the client sort
out the duplicates, and he understands that this isn't a software
problem.

Truthfully, this is the first application that I have been involved
with that has to track people in this way, and I'm having a problem
dealing with it. Thanks for your help.

CC
 
U

Uri Guttman

c> I've been thinking about Schwartzian Transforms and Guttman-Rosler
c> Transforms, and am wondering of this is the best way to handle this.
c> I'm not concerned with real time processing since this is handled off-
c> line, but I'd like to make it as efficient as I can.

as smallpond has said, worry about efficiency after you get it working
correctly. but in any case if you do need easier and faster sorting, use
Sort::Maker which has both. and you can select the sort style you want
and even print out the generated code for ejimication.

uri
 
S

smallpond

Absolutely great advice. Two points:

3.5) write the user interface. Users must interact with the system,
and I find that writing the interface before I write the code (Perl
glue) gives me both the input and the output that I need to write the
code for.

More importantly, people can be active in different capacities in
different events, and I'm drowning in complexity. I'd like to be able
to have ONE (1) table with all my people data. The prior developers
have about ten tables representing different capacities, with the
problem that if a person's address and email changes, you've got to
chase that person over ten (worst case) different tables. My way out
is to match the incoming names against the existing names, which won't
tag duplicates but will at least tag uniques. I'll let the client sort
out the duplicates, and he understands that this isn't a software
problem.

Truthfully, this is the first application that I have been involved
with that has to track people in this way, and I'm having a problem
dealing with it. Thanks for your help.

CC

My point is that the database design will determine 99% of
the system stability and efficiency. 10 tables is ok as long
as they are linked properly - i.e. not by selecting on non-unique
strings, but by using unique keys. Solve that problem first.

You are correct that if you have multiple tables indexed by the same
thing (person) then they should be combined. Especially in any
environment where you have to worry about multiple processes
doing updates, like a website.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top