Best way to parse file into db-type layout?

P

Peter A. Schott

Thanks for the comments. Quick notes:
1. Yes, ultimate goal is to output various fields into Excel, but I need to
enter the data into a DB as well to avoid having any strange problems. That may
be handled by a completely different process. I should have been more exact in
saying that my current goal is to get this data into a database.

2. I'm relatively new to Python and definitely new to trying to parse
Ragged-right type CSV files that are pairs of data. I'm trying to get my syntax
correct when wording my questions/phrases - please bear with me. :) Your help
(all posters) is quite appreciated.

3. regarding read/write each row - I started thinking about working my way
through the dictionary. This file could have up to 1900 pairs of data per row.
That's an awful lot of columns to put into a database table, especially if quite
a few will be NULL. I starting thinking of stepping through each pair by each
row. The rows are identified by Letter type and an account number (always the
same field #). I would want to use those, and insert something like:
LetterType, Account #, Field#, Field Data, (current date)
I wasn't quite sure how to phrase that and apologize for being unclear. I
guess if I'm doing row-by-row, I can hold that for each row without too much
trouble.

4. No control over file format. I can define a lookup table that will match
Field# to a real name, perhaps code that into my scripts, but I'm stuck with
that. As to other posters' responses - it's ANSI-coded, CSV, No Quotes (no
Commas in values). Coming from the mainframe, I don't have to worry about
embedded special characters within the file as they can't be entered into these
fields to start with.
File always starts with letter_type, 1 - no idea why they didn't do that as
0, letter_type for consistency but it isn't my file. (Heck, I would have just
specified to output all fields all the time and avoid any need to do special
parsing :)

Anyway, I appreciate the help. The code looks like it will work for me and I'll
be doing some basic tests to get up to speed on splitting/parsing the data and
putting it into a form that will be usable by all. I may have to write certain
rows out to a new file if they don't match what I'm looking for. I don't
anticipate any issues there - just read the row, write it to a new file if it
doesn't match my criteria.

Thanks to all. Off to search for dictionaries and databases now. :)

-Pete

John Machin said:
That looks promising.
The field numbers are pre-defined at the mainframe level.

Of course. Have you managed to acquire a copy of the documentation, or
do you have to reverse-engineer it?
This may help me get to my ultimate goal which is to pump these into a DB on a
row-by-row basis ( :p )

That's your *ultimate* goal? Are you running a retro-computing museum
or something? Don't you want to *USE* the data?
I'll have to do some playing around with this. I
knew that it looked like a dictionary, but wasn't sure how best to handle this.

One follow-up question: I'll end up getting multiple records for each "type".

What does that mean?? If it means that more than one customer will get
the "please settle your account" letter, and more than one customer
will get the "please buy a spangled fritzolator, only $9.99" letter,
you are stating the obvious -- otherwise, please explain.
Would I be referencing these by row[#][field#]?

Not too sure what you mean by that -- whether you can get away with a
(read a row, write a row) way of handling the data depends on its
structure (like what are the relationships if any between different
rows) and what you want to do with it -- both murky concepts at the
moment.
Minor revision to the format is that starts like:
###,1,1,val_1,....

How often do these "minor revisions" happen? How flexible do you have
to be? And the extra "1" means what? Is it ever any other number?
I think right now the plan is to parse through the file and insert the pairs
directly into a DB table. Something like RowID, LetterType, date, Field#,
Value.

Again, I'd recommend you lose the "Field#" in favour of a better
representation, ASAP.
I can get RowID and LetterType overall, date is a constant, the rest
would involve reading each pair and inserting both values into the table. Time
to hit the books a little more to get up to speed on all of this.

What you need is (a) a clear appreciation of what you are trying to do
with the data at a high level (b) then develop an understanding of
what is the underlying data model (c) then and only then worry about
technical details.

Good luck,
John
 
J

John Machin

Thanks for the comments. Quick notes:
1. Yes, ultimate goal is to output various fields into Excel,

Peter, (again) I'd like to suggest that putting some or all of the
data into storage of type X (whether X is a database or Excel or 5x3
index cards) is unlikely to be the true ultimate goal ...
but I need to
enter the data into a DB as well to avoid having any strange problems. That may
be handled by a completely different process. I should have been more exact in
saying that my current goal is to get this data into a database.

Let me get this straight: Process 1 is going to put some fields into
Excel. Process 2 is going to put the data into a DB to avoid "any
strange problems" -- with what? The (completely different) process 1?
3. regarding read/write each row - I started thinking about working my way
through the dictionary. This file could have up to 1900 pairs of data per row.
That's an awful lot of columns to put into a database table, especially if quite
a few will be NULL.

Focussing (briefly) on the storage details, it's also an awful lot
compared to the max 255 columns in Excel. Dragging ourselves back up
to the live situation that your data is modelling, it's also an awful
lot of attributes for any object (whether real or virtual) to have. It
must be possible to break each line down further. It's extremely
unlikely to be stored like that on the mainframe; it's just been spat
out like that to go to the letter-printing bureau. You may like to
consider talking to the mainframe folk about their data structures,
and how you can more or less reconstruct those structures from the
letter file.
I starting thinking of stepping through each pair by each
row. The rows are identified by Letter type and an account number (always the
same field #). I would want to use those, and insert something like:
LetterType, Account #, Field#, Field Data, (current date)

This method of representing the data is close to the other extreme,
compared to to 1900 columns per row. With this method, the five or so
components of the customer address (for example) will be in different
rows in the database i.e. not grouped together with a handle that says
"address". It doesn't appear to me to be any more usable than the
letter file.

If the customer can have more than one letter of the same type on the
same date, you will need to store also something that uniquely
identifies the letter e.g. the line number in the file; otherwise the
data from multiple letters will be mingled. Even if "they" say it's
not possible, and you believe them, it might be a good idea anyway --
might make it easier than working with (letter_type, acct_num) when
you are trying to glue the fragments back together.

If it's not possible because the 1900 fields = something like 100
fields x an allowance up to 19 repetitions of a similar scenario, then
you know what to do next :)

You probably mean "effective date as at which the file was produced"
rather than "current date". Only one file per day? There may be a
backlog when you start this exercise, there may be a backlog in the
future after a disruption in the pipeline. You may like to check if
the mainframe assigns a unique name/number to such files and use that
instead as part of your composite key.
No Quotes (no
Commas in values). Coming from the mainframe, I don't have to worry about
embedded special characters within the file as they can't be entered into these
fields to start with.

Yes, peace in our time, the cheque's in the mail, and of course
they'll still love you in the morning :)
Thanks to all. Off to search for dictionaries and databases now. :)

Searching for dictionaries???

As we are by now way OT (even by this group's standards), I'll just
leave you with my previous comments:
It's been nice kibitzing your game.
Lots of luck,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,165
Latest member
JavierBrak
Top