How do I: Split a large file on record and data (file = 3GB)

Discussion in 'Perl Misc' started by seansan, Jan 5, 2004.

  1. seansan

    seansan Guest

    Hi,


    I have been set out to investigate howto split a large file in PERL.
    My question is as follows.

    I have a large file that is built up off data chunks of record sets.
    Every new record set start with /^010/ and continues some lines (it
    varies), until wel find the next '010' recordset. Finding these recs
    doesnt seem so difficult, but other subject that I am not familiar
    with fogg my mind.

    I was thinking of opening 2 output files. I wanted to loop through the
    file and according on 3-characters on the 1235th column (or
    text-place) of the 010 line I have to print to either file A or file
    B. How do I accomplish this?
    - How do I read the 1235th 3 character identifier? and - How do I
    switch between OUTPUT files? (remember I have to write several lines
    to file A or B, until the next 010 line is encountered) and last, -
    What considerations should I make for working with 3-4 GB files?

    Any help, or examples will be appreciated

    Sean Heukels
    seansan, Jan 5, 2004
    #1
    1. Advertising

  2. seansan

    gnari Guest

    "seansan" <> wrote in message
    news:...

    > I was thinking of opening 2 output files. I wanted to loop through the
    > file and according on 3-characters on the 1235th column (or
    > text-place) of the 010 line I have to print to either file A or file
    > B. How do I accomplish this?


    it would help if we knew exactly what your problem is.
    what have you tried, how does it fail ?

    > - How do I read the 1235th 3 character identifier?


    many ways spring to mind:
    substr()
    //
    split// and array manipulations

    > and - How do I
    > switch between OUTPUT files?


    again many ways, among them plain old if/else

    > What considerations should I make for working with 3-4 GB files?


    depends on your OS, probably.
    if there is a problem, just split the file

    >
    > Any help, or examples will be appreciated

    again, what have you done (or planned) and
    what exactly is your problem?

    if you want us to do the program for you, just say so.

    gnari
    gnari, Jan 5, 2004
    #2
    1. Advertising

  3. In article <>,
    seansan <> wrote:
    :I was thinking of opening 2 output files. I wanted to loop through the
    :file and according on 3-characters on the 1235th column (or
    :text-place) of the 010 line I have to print to either file A or file
    :B. How do I accomplish this?
    :- How do I read the 1235th 3 character identifier?

    If you already have the line read in to a string, then
    use substr $string, 1234, 3

    :and - How do I
    :switch between OUTPUT files? (remember I have to write several lines
    :to file A or B, until the next 010 line is encountered) and last, -

    Switching between output files:

    $ perldoc -f print
    =item print FILEHANDLE LIST

    Prints a string or a comma-separated list of strings. Returns TRUE
    if successful. FILEHANDLE may be a scalar variable name, in which case
    the variable contains the name of or a reference to the filehandle, thus
    introducing one level of indirection.


    :What considerations should I make for working with 3-4 GB files?

    If you are just doing linear processing you should be okay, provided
    your filesystem supports files that are large enough.

    If, though, you need to skip around in the file, you need
    to use 'seek' and 'tell' (or sysseek instead of either),
    and that can be a problem because on many unix systems the
    underlying system calls 'seek' and 'tell' are *signed* 32 bit
    numbers -- which gives out after 2 Gb.


    Other than that... the usual tricks. e.g., if your filesystem
    supports "holes" and you are writing bunches of binary zeroes,
    use seek to position to the new location rather than
    writing the zeroes: systems that support holes often do not
    convert blocks of zeroes to holes, and instead require
    repositioning to accomplish it. This isn't a trick specific
    to very large files, but it's hard to put a large hole in a
    small file ;-)
    --
    Contents: 100% recycled post-consumer statements.
    Walter Roberson, Jan 5, 2004
    #3
  4. seansan

    Anno Siegel Guest

    seansan <> wrote in comp.lang.perl.misc:
    > Hi,
    >
    >
    > I have been set out to investigate howto split a large file in PERL.
    > My question is as follows.
    >
    > I have a large file that is built up off data chunks of record sets.
    > Every new record set start with /^010/ and continues some lines (it
    > varies), until wel find the next '010' recordset. Finding these recs
    > doesnt seem so difficult, but other subject that I am not familiar
    > with fogg my mind.
    >
    > I was thinking of opening 2 output files. I wanted to loop through the
    > file and according on 3-characters on the 1235th column (or
    > text-place) of the 010 line I have to print to either file A or file
    > B. How do I accomplish this?
    > - How do I read the 1235th 3 character identifier? and - How do I
    > switch between OUTPUT files? (remember I have to write several lines
    > to file A or B, until the next 010 line is encountered) and last, -
    > What considerations should I make for working with 3-4 GB files?



    First off, make "\n010" the input record separator. The each "line"
    will essentially contain one chunk of data.

    Then loop over the chunks, determine the output file for each, and
    print it out.

    There will be a certain skew since each chunk contains the initial bit
    of the *following* record (if any). There will also be a spurious record
    before the first one. The code below tries to take that into account, but
    these things are *never* correct on the first try, so get yourself a
    smallish test file and debug it. Untested:

    open my $in, $infile or die "Can't read $infile: $!";
    open my $out1, '>', $outfile1 or die "Can't create $outfile1: $!";
    open my $out2, '>', $outfile2 or die "Can't create $outfile2: $!";

    $/ = "\n010"
    <$in>; # discard spurious "first" record
    while ( <$in> ) {
    # there are length( $/) characters missing from the beginning
    my $tag = substr( $_, 1235 - length $/, 3);
    # decide which output file to use (pseudocode)
    my $out = $tag =~ /.../ ? $out1 : $out2;
    print $out $/, $_; # add missing record separator
    }
    # add final linefeeds
    print $out1, "\n";
    print $out2, "\n";

    Anno
    Anno Siegel, Jan 5, 2004
    #4
  5. seansan

    Anno Siegel Guest

    seansan <> wrote in comp.lang.perl.misc:
    > Hi,
    >
    >
    > I have been set out to investigate howto split a large file in PERL.
    > My question is as follows.
    >
    > I have a large file that is built up off data chunks of record sets.
    > Every new record set start with /^010/ and continues some lines (it
    > varies), until wel find the next '010' recordset. Finding these recs
    > doesnt seem so difficult, but other subject that I am not familiar
    > with fogg my mind.
    >
    > I was thinking of opening 2 output files. I wanted to loop through the
    > file and according on 3-characters on the 1235th column (or
    > text-place) of the 010 line I have to print to either file A or file
    > B. How do I accomplish this?
    > - How do I read the 1235th 3 character identifier? and - How do I
    > switch between OUTPUT files? (remember I have to write several lines
    > to file A or B, until the next 010 line is encountered) and last, -
    > What considerations should I make for working with 3-4 GB files?



    First off, make "\n010" the input record separator. The each "line"
    will essentially contain one chunk of data.

    Then loop over the chunks, determine the output file for each, and
    print it out.

    There will be a certain skew since each chunk contains the initial bit
    of the *following* record (if any). There will also be a spurious record
    before the first one. The code below tries to take that into account, but
    these things are *never* correct on the first try, so get yourself a
    smallish test file and debug it. Untested:

    open my $in, $infile or die "Can't read $infile: $!";
    open my $out1, '>', $outfile1 or die "Can't create $outfile1: $!";
    open my $out2, '>', $outfile2 or die "Can't create $outfile2: $!";

    $/ = "\n010"
    <$in>; # discard spurious "first" record
    while ( <$in> ) {
    chomp; # remove record separator
    # there are length( $/) characters missing from the beginning
    my $tag = substr( $_, 1235 - length $/, 3);
    # decide which output file to use (pseudocode)
    my $out = $tag =~ /.../ ? $out1 : $out2;
    print $out $/, $_; # add missing record separator to previous entry
    }
    # add final linefeeds
    print $out1, "\n";
    print $out2, "\n";

    Anno
    Anno Siegel, Jan 5, 2004
    #5
  6. seansan

    gnari Guest

    "Anno Siegel" <-berlin.de> wrote in message
    news:btbp2f$n7v$-Berlin.DE...
    > seansan <> wrote in comp.lang.perl.misc:


    [snipped problem and proposed solution]

    > # add final linefeeds
    > print $out1, "\n";
    > print $out2, "\n";
    >


    skip the commas

    gnari
    gnari, Jan 5, 2004
    #6
  7. seansan

    Anno Siegel Guest

    gnari <> wrote in comp.lang.perl.misc:
    > "Anno Siegel" <-berlin.de> wrote in message
    > news:btbp2f$n7v$-Berlin.DE...
    > > seansan <> wrote in comp.lang.perl.misc:

    >
    > [snipped problem and proposed solution]
    >
    > > # add final linefeeds
    > > print $out1, "\n";
    > > print $out2, "\n";
    > >


    Ugh, yes. Thanks.

    Anno
    Anno Siegel, Jan 5, 2004
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    505
  2. Bryan Parkoff

    Use Real 1GB and Fake 3GB Memory?

    Bryan Parkoff, Mar 9, 2005, in forum: C++
    Replies:
    1
    Views:
    395
    Victor Bazarov
    Mar 9, 2005
  3. neil

    python 2.5 and 3gb switch

    neil, Sep 29, 2007, in forum: Python
    Replies:
    10
    Views:
    571
    Lawrence D'Oliveiro
    Oct 3, 2007
  4. Replies:
    4
    Views:
    508
  5. Replies:
    5
    Views:
    845
    Xho Jingleheimerschmidt
    Apr 2, 2009
Loading...

Share This Page