How do I: Split a large file on record and data (file = 3GB)

S

seansan

Hi,


I have been set out to investigate howto split a large file in PERL.
My question is as follows.

I have a large file that is built up off data chunks of record sets.
Every new record set start with /^010/ and continues some lines (it
varies), until wel find the next '010' recordset. Finding these recs
doesnt seem so difficult, but other subject that I am not familiar
with fogg my mind.

I was thinking of opening 2 output files. I wanted to loop through the
file and according on 3-characters on the 1235th column (or
text-place) of the 010 line I have to print to either file A or file
B. How do I accomplish this?
- How do I read the 1235th 3 character identifier? and - How do I
switch between OUTPUT files? (remember I have to write several lines
to file A or B, until the next 010 line is encountered) and last, -
What considerations should I make for working with 3-4 GB files?

Any help, or examples will be appreciated

Sean Heukels
 
G

gnari

I was thinking of opening 2 output files. I wanted to loop through the
file and according on 3-characters on the 1235th column (or
text-place) of the 010 line I have to print to either file A or file
B. How do I accomplish this?

it would help if we knew exactly what your problem is.
what have you tried, how does it fail ?
- How do I read the 1235th 3 character identifier?

many ways spring to mind:
substr()
//
split// and array manipulations
and - How do I
switch between OUTPUT files?

again many ways, among them plain old if/else
What considerations should I make for working with 3-4 GB files?

depends on your OS, probably.
if there is a problem, just split the file
Any help, or examples will be appreciated
again, what have you done (or planned) and
what exactly is your problem?

if you want us to do the program for you, just say so.

gnari
 
W

Walter Roberson

:I was thinking of opening 2 output files. I wanted to loop through the
:file and according on 3-characters on the 1235th column (or
:text-place) of the 010 line I have to print to either file A or file
:B. How do I accomplish this?
:- How do I read the 1235th 3 character identifier?

If you already have the line read in to a string, then
use substr $string, 1234, 3

:and - How do I
:switch between OUTPUT files? (remember I have to write several lines
:to file A or B, until the next 010 line is encountered) and last, -

Switching between output files:

$ perldoc -f print
=item print FILEHANDLE LIST

Prints a string or a comma-separated list of strings. Returns TRUE
if successful. FILEHANDLE may be a scalar variable name, in which case
the variable contains the name of or a reference to the filehandle, thus
introducing one level of indirection.


:What considerations should I make for working with 3-4 GB files?

If you are just doing linear processing you should be okay, provided
your filesystem supports files that are large enough.

If, though, you need to skip around in the file, you need
to use 'seek' and 'tell' (or sysseek instead of either),
and that can be a problem because on many unix systems the
underlying system calls 'seek' and 'tell' are *signed* 32 bit
numbers -- which gives out after 2 Gb.


Other than that... the usual tricks. e.g., if your filesystem
supports "holes" and you are writing bunches of binary zeroes,
use seek to position to the new location rather than
writing the zeroes: systems that support holes often do not
convert blocks of zeroes to holes, and instead require
repositioning to accomplish it. This isn't a trick specific
to very large files, but it's hard to put a large hole in a
small file ;-)
 
A

Anno Siegel

seansan said:
Hi,


I have been set out to investigate howto split a large file in PERL.
My question is as follows.

I have a large file that is built up off data chunks of record sets.
Every new record set start with /^010/ and continues some lines (it
varies), until wel find the next '010' recordset. Finding these recs
doesnt seem so difficult, but other subject that I am not familiar
with fogg my mind.

I was thinking of opening 2 output files. I wanted to loop through the
file and according on 3-characters on the 1235th column (or
text-place) of the 010 line I have to print to either file A or file
B. How do I accomplish this?
- How do I read the 1235th 3 character identifier? and - How do I
switch between OUTPUT files? (remember I have to write several lines
to file A or B, until the next 010 line is encountered) and last, -
What considerations should I make for working with 3-4 GB files?


First off, make "\n010" the input record separator. The each "line"
will essentially contain one chunk of data.

Then loop over the chunks, determine the output file for each, and
print it out.

There will be a certain skew since each chunk contains the initial bit
of the *following* record (if any). There will also be a spurious record
before the first one. The code below tries to take that into account, but
these things are *never* correct on the first try, so get yourself a
smallish test file and debug it. Untested:

open my $in, $infile or die "Can't read $infile: $!";
open my $out1, '>', $outfile1 or die "Can't create $outfile1: $!";
open my $out2, '>', $outfile2 or die "Can't create $outfile2: $!";

$/ = "\n010"
<$in>; # discard spurious "first" record
while ( <$in> ) {
# there are length( $/) characters missing from the beginning
my $tag = substr( $_, 1235 - length $/, 3);
# decide which output file to use (pseudocode)
my $out = $tag =~ /.../ ? $out1 : $out2;
print $out $/, $_; # add missing record separator
}
# add final linefeeds
print $out1, "\n";
print $out2, "\n";

Anno
 
A

Anno Siegel

seansan said:
Hi,


I have been set out to investigate howto split a large file in PERL.
My question is as follows.

I have a large file that is built up off data chunks of record sets.
Every new record set start with /^010/ and continues some lines (it
varies), until wel find the next '010' recordset. Finding these recs
doesnt seem so difficult, but other subject that I am not familiar
with fogg my mind.

I was thinking of opening 2 output files. I wanted to loop through the
file and according on 3-characters on the 1235th column (or
text-place) of the 010 line I have to print to either file A or file
B. How do I accomplish this?
- How do I read the 1235th 3 character identifier? and - How do I
switch between OUTPUT files? (remember I have to write several lines
to file A or B, until the next 010 line is encountered) and last, -
What considerations should I make for working with 3-4 GB files?


First off, make "\n010" the input record separator. The each "line"
will essentially contain one chunk of data.

Then loop over the chunks, determine the output file for each, and
print it out.

There will be a certain skew since each chunk contains the initial bit
of the *following* record (if any). There will also be a spurious record
before the first one. The code below tries to take that into account, but
these things are *never* correct on the first try, so get yourself a
smallish test file and debug it. Untested:

open my $in, $infile or die "Can't read $infile: $!";
open my $out1, '>', $outfile1 or die "Can't create $outfile1: $!";
open my $out2, '>', $outfile2 or die "Can't create $outfile2: $!";

$/ = "\n010"
<$in>; # discard spurious "first" record
while ( <$in> ) {
chomp; # remove record separator
# there are length( $/) characters missing from the beginning
my $tag = substr( $_, 1235 - length $/, 3);
# decide which output file to use (pseudocode)
my $out = $tag =~ /.../ ? $out1 : $out2;
print $out $/, $_; # add missing record separator to previous entry
}
# add final linefeeds
print $out1, "\n";
print $out2, "\n";

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top