Process header record and concatenate files

Scott Bass · Apr 5, 2009

Hi,

I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Input File(s): (tilde delimited)
Line 1:
Header Record:
SourceSystem~EffectiveDate~ExtractDateAndTime~NumberRecords~FileFormatVersion

RemainingRecords:
72 columns of delimited data

Ouput File:
Concatenate the input files into a single output file. A subset of
the header fields are prepended to the data lines as follows:

SourceSystem~EffectiveDate~ExtractDateAndTime~72 columns of delimited
data

Design Criteria:
1) If number of records in the file does not match the number of
records reported in the header (incomplete FTP), abort the entire
file, print an error message, but continue processing the remaining
files.

(I'll use split and join to process the header and prepend to the
remainder).

2) Specify the list of input files on the command line. Specify the
output file on the command line. For example:

concat.pl -in foo.dat bar.dat blah.dat -out concat.dat

or possibly:

concat.pl -in src_*.dat -out concat.dat

(I'll use GetOptions to process the command line)

My thoughts:

1) Slurp the file into an array (minus first record). Count the
elements in the array. Abort if not equal to the number in the
header, else concat to the output file.

2) Process the file, reading records. At EOF, get record number from
$. . If correct, rewind to beginning of file handle and concat to
output file. (Not sure how to do the rewind bit).

3) Process the file, writing to a temp file. At EOF, get record
number from $. . If correct, concat the temp file to the output file.

Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

C) Of the three approaches above, which is the "best"? Performance
is important but not critical. I lean toward #3, since I need to
cater for files too large for #1. Or if you have a better idea please
let me know.

I hope this wasn't too cryptic...I was trying to keep it short.

Thanks,
Scott

John W. Krahn · Apr 5, 2009

Scott said:
I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Input File(s): (tilde delimited)
Line 1:
Header Record:
SourceSystem~EffectiveDate~ExtractDateAndTime~NumberRecords~FileFormatVersion

RemainingRecords:
72 columns of delimited data

Ouput File:
Concatenate the input files into a single output file. A subset of
the header fields are prepended to the data lines as follows:

SourceSystem~EffectiveDate~ExtractDateAndTime~72 columns of delimited
data

Design Criteria:
1) If number of records in the file does not match the number of
records reported in the header (incomplete FTP), abort the entire
file, print an error message, but continue processing the remaining
files.

(I'll use split and join to process the header and prepend to the
remainder).

2) Specify the list of input files on the command line. Specify the
output file on the command line. For example:

concat.pl -in foo.dat bar.dat blah.dat -out concat.dat

or possibly:

concat.pl -in src_*.dat -out concat.dat

(I'll use GetOptions to process the command line)

My thoughts:

1) Slurp the file into an array (minus first record). Count the
elements in the array. Abort if not equal to the number in the
header, else concat to the output file.

2) Process the file, reading records. At EOF, get record number from
$. . If correct, rewind to beginning of file handle and concat to
output file. (Not sure how to do the rewind bit).

3) Process the file, writing to a temp file. At EOF, get record
number from $. . If correct, concat the temp file to the output file.

Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

perldoc -f eof

[ snip ]

In a "while (<>)" loop, "eof" or "eof(ARGV)" can be used to
detect the end of each file, "eof()" will only detect the
end of the last file.

B) When that happens, how do I reset $. to 1?

When you reach the end-of-file as determined by the eof function close
the ARGV filehandle and $. will be reset.

C) Of the three approaches above, which is the "best"? Performance
is important but not critical.

You'd probably have to test them with real data to determine the "best".

John

Tad J McClellan · Apr 5, 2009

Scott Bass said:
Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

perldoc -f eof

Eric Pozharski · Apr 5, 2009

On 2009-04-05 said:
A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

If I got your problem right, you've missed I<@ARGV>; then you could
make your own cycle off command-line files. Or alternatively monitor

I said:
C) Of the three approaches above, which is the "best"? Performance
is important but not critical. I lean toward #3, since I need to
cater for files too large for #1. Or if you have a better idea please
let me know.

I hope this wasn't too cryptic...I was trying to keep it short.

You better show your code. Perl is powerfully expressive or
expressively powerful (I doubt I would ever get that right).

sln · Apr 6, 2009

Hi,

I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Just submit this to my office for a bid. Arch advice is not available
however, since we consider that non-advice and a free service, which
we don't offer.

Any other advice is free however. Like should you buy GM stock.
That advice is always free.

-sln

smallpond · Apr 6, 2009

Hi,

I'm not looking for a full blown solution, just architectural advice
for the following design criteria...

Input File(s): (tilde delimited)
Line 1:
Header Record:
SourceSystem~EffectiveDate~ExtractDateAndTime~NumberRecords~FileFormatVersion

RemainingRecords:
72 columns of delimited data

Ouput File:
Concatenate the input files into a single output file. A subset of
the header fields are prepended to the data lines as follows:

SourceSystem~EffectiveDate~ExtractDateAndTime~72 columns of delimited
data

Design Criteria:
1) If number of records in the file does not match the number of
records reported in the header (incomplete FTP), abort the entire
file, print an error message, but continue processing the remaining
files.

(I'll use split and join to process the header and prepend to the
remainder).

2) Specify the list of input files on the command line. Specify the
output file on the command line. For example:

concat.pl -in foo.dat bar.dat blah.dat -out concat.dat

or possibly:

concat.pl -in src_*.dat -out concat.dat

(I'll use GetOptions to process the command line)

My thoughts:

1) Slurp the file into an array (minus first record). Count the
elements in the array. Abort if not equal to the number in the
header, else concat to the output file.

2) Process the file, reading records. At EOF, get record number from
$. . If correct, rewind to beginning of file handle and concat to
output file. (Not sure how to do the rewind bit).

3) Process the file, writing to a temp file. At EOF, get record
number from $. . If correct, concat the temp file to the output file.

Questions:

A) If I've globbed the files on the command line and am processing
the file handle <>, how do I know when the file name has changed?

B) When that happens, how do I reset $. to 1?

C) Of the three approaches above, which is the "best"? Performance
is important but not critical. I lean toward #3, since I need to
cater for files too large for #1. Or if you have a better idea please
let me know.

I hope this wasn't too cryptic...I was trying to keep it short.

Thanks,
Scott

If you want to process your input in one pass, use tell to save the
position of the output at the start of each input file. If you get
to the end of input and the number of records is wrong, use seek
to discard it.

match, concatenate based on filename	3	Nov 4, 2011
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
Find and count strings of text from multiple files	17	Dec 16, 2021
FAQ 5.8 How can I manipulate fixed-record-length files?	0	Apr 16, 2011
How to connect to shared folder and access files from it using python	1	Jun 25, 2020
Problem Splitting Text String	2	Dec 29, 2022

Process header record and concatenate files

Scott Bass

John W. Krahn

Tad J McClellan

Eric Pozharski

sln

smallpond

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads