processing large files

Discussion in 'Perl Misc' started by Andreja, Apr 6, 2006.

  1. Andreja

    Andreja Guest

    I'm wondering if anyone in the group has experience with handling large
    files. If you are a consultant with proven capability to handle a
    project as (roughly) summarized below please contact me at your
    convenience.

    Here's what I have:
    1. File A: fixed length 830 bytes/row, 193 million rows
    2. File B: fixed length 760 bytes/row, 122 million rows
    3. File C: 10 bytes/row, 50 million rows
    4. File D: fixed length 11 bytes/row, 15 million rows
    5. File E: 4 bytes/row, 50,000 rows

    Here's what I need to do:
    1. merge files A and B, dedupe and create output file - expected to be
    +/- 210 million rows
    2. append files C, D and E to this output file
    3. Collapse this output file into a smaller file (essentially group
    individual records with certain common characteristics into one record)
    4. create final output, tab del, ascii

    Any insight as to feasability is much appreciated, in particular also
    with regard to hardware requirement considerations.

    Thank you all.
     
    Andreja, Apr 6, 2006
    #1
    1. Advertising

  2. Andreja

    Paul Lalli Guest

    Andreja wrote:
    > I'm wondering if anyone in the group has experience with handling large
    > files. If you are a consultant with proven capability to handle a
    > project as (roughly) summarized below please contact me at your
    > convenience.


    You will get better responses at http://jobs.perl.org

    This Usenet newsgroup is for helping people with their existing Perl
    programmers, not for seeking or offering jobs.

    Paul Lalli
     
    Paul Lalli, Apr 6, 2006
    #2
    1. Advertising

  3. Paul Lalli <> wrote:

    > This Usenet newsgroup is for helping people with their existing Perl
    > programmers,

    ^^^
    ^^^

    I have this Perl programmer that I'd like some help with, he just
    spends all his time posting to Usenet instead of writing Perl code,
    and his feet smell bad.

    Can you guys help me with _that_ ?


    (Sorry!)


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Apr 6, 2006
    #3
  4. Andreja

    Paul Lalli Guest

    Tad McClellan wrote:
    > Paul Lalli <> wrote:
    >
    > > This Usenet newsgroup is for helping people with their existing Perl
    > > programmers,

    > ^^^
    > ^^^
    >
    > I have this Perl programmer that I'd like some help with, he just
    > spends all his time posting to Usenet instead of writing Perl code,
    > and his feet smell bad.
    >
    > Can you guys help me with _that_ ?
    >
    > (Sorry!)


    ..... it was a long day.

    My bad.

    Paul Lalli
     
    Paul Lalli, Apr 6, 2006
    #4
  5. Andreja

    Guest

    "Andreja" <> wrote:
    > I'm wondering if anyone in the group has experience with handling large
    > files.


    Yes, I do.

    > If you are a consultant with proven capability to handle a
    > project as (roughly) summarized below please contact me at your
    > convenience.


    I find it convenient to contact you by responding to your post. But this
    isn't a job group, so I'll contact you with advice rather than
    solicitation.

    > Here's what I have:
    > 1. File A: fixed length 830 bytes/row, 193 million rows
    > 2. File B: fixed length 760 bytes/row, 122 million rows
    > 3. File C: 10 bytes/row, 50 million rows
    > 4. File D: fixed length 11 bytes/row, 15 million rows
    > 5. File E: 4 bytes/row, 50,000 rows
    >
    > Here's what I need to do:
    > 1. merge files A and B, dedupe and create output file - expected to be
    > +/- 210 million rows


    Since the fix lengths are not the same, you cannot have any duplicates
    using the obvious interpretation. Once you define precisely what you
    mean, you could use Perl to canonicalize each file, then use OS tools to
    sort, merge, and either OS tools or Perl to dedup.


    > 2. append files C, D and E to this output file


    Again, appending fixed-record length files doesn't make much sense when
    they have different record lengths. But ignoring that:

    cat C D E >> merged

    > 3. Collapse this output file into a smaller file (essentially group
    > individual records with certain common characteristics into one record)


    use OS sort tools to do the grouping, after using Perl to massage the data
    such that the OS sort tools will be able to group by sorting. Then use
    perl to read the grouped data and do whatever it is supposed to do.

    How large would the largest individual group be?

    > 4. create final output, tab del, ascii


    OK.

    > Any insight as to feasability is much appreciated, in particular also
    > with regard to hardware requirement considerations.


    Quite feasible. Since you are not going to fit everything in memory
    anyway, memory is not a major concern, except perhaps at the last step of
    dealing with the thing after they are grouped. 1 Gig should probably be
    plenty. Obviously, you will need a lot of disk scratch space. I think 3
    times the total unzipped size of all the files would be enough. You can
    make do with a lot less if the data is highly compressible, but that makes
    things much more challenging. And you will need a system with large file
    support.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Apr 6, 2006
    #5
  6. Andreja

    robic0 Guest

    On 6 Apr 2006 11:42:03 -0700, "Andreja" <> wrote:

    >I'm wondering if anyone in the group has experience with handling large
    >files. If you are a consultant with proven capability to handle a
    >project as (roughly) summarized below please contact me at your
    >convenience.
    >
    >Here's what I have:
    >1. File A: fixed length 830 bytes/row, 193 million rows
    >2. File B: fixed length 760 bytes/row, 122 million rows
    >3. File C: 10 bytes/row, 50 million rows
    >4. File D: fixed length 11 bytes/row, 15 million rows
    >5. File E: 4 bytes/row, 50,000 rows
    >
    >Here's what I need to do:
    >1. merge files A and B, dedupe and create output file - expected to be
    >+/- 210 million rows
    >2. append files C, D and E to this output file
    >3. Collapse this output file into a smaller file (essentially group
    >individual records with certain common characteristics into one record)
    >4. create final output, tab del, ascii
    >
    >Any insight as to feasability is much appreciated, in particular also
    >with regard to hardware requirement considerations.
    >
    >Thank you all.


    I find it hard to believe you would have amassed 190+ gigabyte file
    in A for instance. There would never be a need for this unless it were
    from an unintelligent data logging system (recieved back from space
    after 10 years).

    I don't know what you mean by A and B need to be deduped after a merge.
    And I don't know what you consider a dupe. If they are records, then a
    SELECT DISTINCT A.field(1..n) etc... should do it.

    If say you want to remove dups within A for example, weather its certain
    field or all fields combined as one string, on a per-record basis, then
    you will have to create seperate index per file A and B. Then use those
    index's to merge.

    Of course creating an index requires a location (pointer) to the
    actual data so the data is not moved. But the data is not in memory so
    it will have to be the absolute file pointer position of the data that is stored
    in the index.

    When the index is being created, as it grows, a binary search is being used so
    that in-file duplicates will not be stored. That takes care of sorting the file
    and removing duplicates.

    Of course when creating an index, data is not actually moved around, however
    to acess the data for sort comparisons requires the file position to be moved
    and data read in during the binary search. This results in a constant
    thrashing of the hard drive and would take an incredibly long time since
    the same data (albeit a binary search and not linear) will need to be brought
    in time and time again.

    In this case, because the files are so big, even buffering won't help, actually it
    would hurt.

    To get around this problem create 24 un-indexed data files. 1 for each letter of
    the alphabet. Each line in a letter file contains a file position into file A that
    starts with that letter. At this point, indexes will be small enough to be created
    in memory, however, the index is now twice indirect to the data in file A. This
    of course is not problem whatsoever as this scheme could go multiple levels of
    indirection if necessary, but this should do it.

    Open 24 letter files, open file A. Sequentially read each line in A and write its
    file pointer (imagine it would be a big int) to the appropriate letter file.
    If you have a 'square' file of given line length then just store the line number and
    let the absolute file position for that line be calculated. If the lines differ
    in size, another scheme would be to factor out a common denominator, reducing the
    size of the letter file.

    Now create a seperate sort index for each of the 24 letter data files.
    These 'indexed' files now contain the position into the letter files.
    And the 'letter' files contain the postion into file A.

    What this does is exponentially reduce redundant acess to file A during
    binary searches.

    At this point you could reconstruct file A using the index's to allow it to
    be either viewed in a sorted manner or written out to a new file.
    However it would be better to index file B in the same fashion, then,
    use sequential acess to the index's for file A and B, to construct either
    a merged index per letter, or create a merged file altogether (without dups).

    File A sorted data traversal example. Assume letter 'a' has its index and
    file position already read into an into a couple of array's:

    # file A is open in binary mode
    foreach $ndxto_a (@A_ndx_a)
    {
    $A_fpos = 830 * $A_letter_a[$ndxto_a];
    # seek to $A_fpos, read 830 bytes
    }

    The lfcr will have to be taken into account when seeking, so maybe 832 or so.

    That about handles sorting and in-file removal of dups.
    To do the other stuff, a further breakdown of data will be necessary,
    similar but probably with a different strategy.

    Creating the @A_ndx_xxx index's will need a bit of creativity in Perl since
    it doesn't have memmov's as in C. This would be needed to 'insert' the letter
    array index into the sorted index array. This oly applies if you would be
    doing a binary search for the insertion.

    An alternative to the binary search index creation for the letter array,
    would be to use Perls 'sort' function on the letter array directly, making
    it the index. This would do away with the extra step. Its hard to tell which
    would be faster.

    To make an index out of the letter array, create the letter file as above.
    Read it into an array and use Perls 'sort' function with a custom compare
    subroutine. The file acess on file A would be the same.

    For instance:

    @A_letter_a = sort special @letter_a;

    sub special {
    ($Afpos1, $Afpos2 = (830*$a, 830*$b);
    # seek to $A_fpos1, read 830 bytes into data1
    # seek to $A_fpos2, read 830 bytes into data2
    return $data1 cmp $data2;
    }

    Then as above to acess the sorted data, with a slight change:

    foreach $fline (@A_letter_a)
    {
    $A_fpos = 830 * $fline;
    # seek to $A_fpos, read 830 bytes
    }

    I would try this last sort method first. Make the letter file into
    a sorted index. The best way is to make a test case on one letter
    and see how it goes. If too slow, a sorted index of the letter file
    may be necessary.
     
    robic0, Apr 7, 2006
    #6
  7. Tad McClellan wrote:
    > Paul Lalli <> wrote:
    >
    >>This Usenet newsgroup is for helping people with their existing Perl
    >>programmers,

    > ^^^
    > ^^^
    >
    > I have this Perl programmer that I'd like some help with, he just
    > spends all his time posting to Usenet instead of writing Perl code,
    > and his feet smell bad.
    >
    > Can you guys help me with _that_ ?


    Help you with your smelly feet? Sorry, no. :)


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 7, 2006
    #7
  8. Andreja wrote:
    > I'm wondering if anyone in the group has experience with handling large
    > files. If you are a consultant with proven capability to handle a
    > project as (roughly) summarized below please contact me at your
    > convenience.


    Have you tried posting this on the IEEE Job Site:

    http://careers.ieee.org/



    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Apr 7, 2006
    #8
  9. At 2006-04-06 08:46PM, robic0 <> wrote:
    > To get around this problem create 24 un-indexed data files. 1 for
    > each letter of the alphabet.


    No wonder it's so hard to understand you sometimes.

    --
    Glenn Jackman
    Ulterior Designer
     
    Glenn Jackman, Apr 7, 2006
    #9
  10. Glenn Jackman wrote:
    > At 2006-04-06 08:46PM, robic0 <> wrote:
    > > To get around this problem create 24 un-indexed data files. 1 for
    > > each letter of the alphabet.

    >
    > No wonder it's so hard to understand you sometimes.



    lmao!
     
    it_says_BALLS_on_your forehead, Apr 7, 2006
    #10
  11. Andreja

    robic0 Guest

    On 7 Apr 2006 17:33:26 GMT, Glenn Jackman <> wrote:

    >At 2006-04-06 08:46PM, robic0 <> wrote:
    >> To get around this problem create 24 un-indexed data files. 1 for
    >> each letter of the alphabet.

    >
    >No wonder it's so hard to understand you sometimes.


    Oh yeah, one for y,z also.
    Good catch.
     
    robic0, Apr 7, 2006
    #11
  12. Andreja

    robic0 Guest

    On Thu, 06 Apr 2006 17:46:48 -0700, robic0 wrote:

    >On 6 Apr 2006 11:42:03 -0700, "Andreja" <> wrote:
    >
    >>I'm wondering if anyone in the group has experience with handling large
    >>files. If you are a consultant with proven capability to handle a
    >>project as (roughly) summarized below please contact me at your
    >>convenience.
    >>
    >>Here's what I have:
    >>1. File A: fixed length 830 bytes/row, 193 million rows
    >>2. File B: fixed length 760 bytes/row, 122 million rows
    >>3. File C: 10 bytes/row, 50 million rows
    >>4. File D: fixed length 11 bytes/row, 15 million rows
    >>5. File E: 4 bytes/row, 50,000 rows
    >>
    >>Here's what I need to do:
    >>1. merge files A and B, dedupe and create output file - expected to be
    >>+/- 210 million rows
    >>2. append files C, D and E to this output file
    >>3. Collapse this output file into a smaller file (essentially group
    >>individual records with certain common characteristics into one record)
    >>4. create final output, tab del, ascii
    >>
    >>Any insight as to feasability is much appreciated, in particular also
    >>with regard to hardware requirement considerations.
    >>
    >>Thank you all.

    >
    >I find it hard to believe you would have amassed 190+ gigabyte file
    >in A for instance. There would never be a need for this unless it were
    >from an unintelligent data logging system (recieved back from space
    >after 10 years).
    >
    >I don't know what you mean by A and B need to be deduped after a merge.
    >And I don't know what you consider a dupe. If they are records, then a
    >SELECT DISTINCT A.field(1..n) etc... should do it.
    >
    >If say you want to remove dups within A for example, weather its certain
    >field or all fields combined as one string, on a per-record basis, then
    >you will have to create seperate index per file A and B. Then use those
    >index's to merge.
    >
    >Of course creating an index requires a location (pointer) to the
    >actual data so the data is not moved. But the data is not in memory so
    >it will have to be the absolute file pointer position of the data that is stored
    >in the index.
    >
    >When the index is being created, as it grows, a binary search is being used so
    >that in-file duplicates will not be stored. That takes care of sorting the file
    >and removing duplicates.
    >
    >Of course when creating an index, data is not actually moved around, however
    >to acess the data for sort comparisons requires the file position to be moved
    >and data read in during the binary search. This results in a constant
    >thrashing of the hard drive and would take an incredibly long time since
    >the same data (albeit a binary search and not linear) will need to be brought
    >in time and time again.
    >
    >In this case, because the files are so big, even buffering won't help, actually it
    >would hurt.
    >
    >To get around this problem create 24 un-indexed data files. 1 for each letter of
    >the alphabet. Each line in a letter file contains a file position into file A that
    >starts with that letter. At this point, indexes will be small enough to be created
    >in memory, however, the index is now twice indirect to the data in file A. This
    >of course is not problem whatsoever as this scheme could go multiple levels of
    >indirection if necessary, but this should do it.
    >
    >Open 24 letter files, open file A. Sequentially read each line in A and write its
    >file pointer (imagine it would be a big int) to the appropriate letter file.
    >If you have a 'square' file of given line length then just store the line number and
    >let the absolute file position for that line be calculated. If the lines differ
    >in size, another scheme would be to factor out a common denominator, reducing the
    >size of the letter file.
    >
    >Now create a seperate sort index for each of the 24 letter data files.
    >These 'indexed' files now contain the position into the letter files.
    >And the 'letter' files contain the postion into file A.
    >
    >What this does is exponentially reduce redundant acess to file A during
    >binary searches.
    >
    >At this point you could reconstruct file A using the index's to allow it to
    >be either viewed in a sorted manner or written out to a new file.
    >However it would be better to index file B in the same fashion, then,
    >use sequential acess to the index's for file A and B, to construct either
    >a merged index per letter, or create a merged file altogether (without dups).
    >
    >File A sorted data traversal example. Assume letter 'a' has its index and
    >file position already read into an into a couple of array's:
    >
    ># file A is open in binary mode
    >foreach $ndxto_a (@A_ndx_a)
    >{
    > $A_fpos = 830 * $A_letter_a[$ndxto_a];
    > # seek to $A_fpos, read 830 bytes
    >}
    >
    >The lfcr will have to be taken into account when seeking, so maybe 832 or so.
    >
    >That about handles sorting and in-file removal of dups.
    >To do the other stuff, a further breakdown of data will be necessary,
    >similar but probably with a different strategy.
    >
    >Creating the @A_ndx_xxx index's will need a bit of creativity in Perl since
    >it doesn't have memmov's as in C. This would be needed to 'insert' the letter
    >array index into the sorted index array. This oly applies if you would be
    >doing a binary search for the insertion.
    >
    >An alternative to the binary search index creation for the letter array,
    >would be to use Perls 'sort' function on the letter array directly, making
    >it the index. This would do away with the extra step. Its hard to tell which
    >would be faster.
    >
    >To make an index out of the letter array, create the letter file as above.
    >Read it into an array and use Perls 'sort' function with a custom compare
    >subroutine. The file acess on file A would be the same.
    >
    >For instance:
    >
    >@A_letter_a = sort special @letter_a;
    >
    >sub special {
    > ($Afpos1, $Afpos2 = (830*$a, 830*$b);
    > # seek to $A_fpos1, read 830 bytes into data1
    > # seek to $A_fpos2, read 830 bytes into data2
    > return $data1 cmp $data2;
    >}
    >
    >Then as above to acess the sorted data, with a slight change:
    >
    >foreach $fline (@A_letter_a)
    >{
    > $A_fpos = 830 * $fline;
    > # seek to $A_fpos, read 830 bytes
    >}
    >
    >I would try this last sort method first. Make the letter file into
    >a sorted index. The best way is to make a test case on one letter
    >and see how it goes. If too slow, a sorted index of the letter file
    >may be necessary.


    Just a follow-up with some possible strategies to consider that would
    speed things up on whatever this is, data-mining or whatever.
    Btw, it was pointed out to me that there is 26 letters in the alphabet
    and NOT 24 as stated (no I'm not brain dead yet, pretty close).
    I'm almost sure the number of letter arrays won't be 26 though (see below).

    CREATING THE LETTER FILES
    - Try not to read the entire file, use seek then read the first character
    (this applies if it is a record as well).
    - Store the file position (the calculated line number) in the particular
    letter array.
    - Create an additional 26 files if being case-sensitive
    - This process should consume 193 million x 4 @= 7.7 gigabytes in total letter
    files for file A. The goal is to move individual letter files into memory
    for processing. Ideally, 7.7 gigabytes divided by 26 @= 8 megabytes or less,
    per file. This is managable. The problem comes when an individual letter file
    is very big. Depending on the data, the skew would tend toward 's' and the
    more used letters.

    JOINING SMALL LETTER FILES
    In this step, it is asumed that each letter file size is
    less than < sum of all the letter file sizes divided by / 26
    (see below if a file is bigger). This step will actually slow the process down
    but it will give a somewhat equal time differential between each run of a letter group.
    - Join/Append alphabetic sequential files into a single file up to the file size limit
    imposed, @ 8 megabytes.
    - Sort this file as if it were a single letter file.

    SPLITTING LARGE LETTER FILES
    This presents the biggest problem.
    There are solutions to this as well.
    Hopefully this step is not necessary.
     
    robic0, Apr 8, 2006
    #12
  13. Andreja

    robic0 Guest

    On Fri, 07 Apr 2006 17:04:35 -0700, robic0 wrote:

    >On Thu, 06 Apr 2006 17:46:48 -0700, robic0 wrote:
    >
    >>On 6 Apr 2006 11:42:03 -0700, "Andreja" <> wrote:
    >>
    >>>I'm wondering if anyone in the group has experience with handling large
    >>>files. If you are a consultant with proven capability to handle a
    >>>project as (roughly) summarized below please contact me at your
    >>>convenience.
    >>>
    >>>Here's what I have:
    >>>1. File A: fixed length 830 bytes/row, 193 million rows
    >>>2. File B: fixed length 760 bytes/row, 122 million rows
    >>>3. File C: 10 bytes/row, 50 million rows
    >>>4. File D: fixed length 11 bytes/row, 15 million rows
    >>>5. File E: 4 bytes/row, 50,000 rows
    >>>
    >>>Here's what I need to do:
    >>>1. merge files A and B, dedupe and create output file - expected to be
    >>>+/- 210 million rows
    >>>2. append files C, D and E to this output file
    >>>3. Collapse this output file into a smaller file (essentially group
    >>>individual records with certain common characteristics into one record)
    >>>4. create final output, tab del, ascii
    >>>
    >>>Any insight as to feasability is much appreciated, in particular also
    >>>with regard to hardware requirement considerations.
    >>>
    >>>Thank you all.

    >>

    [snip]
    >Just a follow-up with some possible strategies to consider that would
    >speed things up on whatever this is, data-mining or whatever.
    >Btw, it was pointed out to me that there is 26 letters in the alphabet
    >and NOT 24 as stated (no I'm not brain dead yet, pretty close).
    >I'm almost sure the number of letter arrays won't be 26 though (see below).
    >
    >CREATING THE LETTER FILES
    >- Try not to read the entire file, use seek then read the first character
    > (this applies if it is a record as well).
    >- Store the file position (the calculated line number) in the particular
    > letter array.
    >- Create an additional 26 files if being case-sensitive
    >- This process should consume 193 million x 4 @= 7.7 gigabytes in total letter
    > files for file A. The goal is to move individual letter files into memory
    > for processing. Ideally, 7.7 gigabytes divided by 26 @= 8 megabytes or less,
    > per file. This is managable. The problem comes when an individual letter file
    > is very big. Depending on the data, the skew would tend toward 's' and the
    > more used letters.
    >
    >JOINING SMALL LETTER FILES
    > In this step, it is asumed that each letter file size is
    > less than < sum of all the letter file sizes divided by / 26
    > (see below if a file is bigger). This step will actually slow the process down
    > but it will give a somewhat equal time differential between each run of a letter group.
    >- Join/Append alphabetic sequential files into a single file up to the file size limit
    > imposed, @ 8 megabytes.
    >- Sort this file as if it were a single letter file.
    >
    >SPLITTING LARGE LETTER FILES
    > This presents the biggest problem.
    > There are solutions to this as well.
    > Hopefully this step is not necessary.
    >

    More on 'Splitting large letter files':
    I didn't want to get into the statistics for letter combinations (like 'st') but it
    can be used analytically in a automated algo.

    Given that, statistically, it should arm you with apriori knowledge of the distribution
    curve enough to split files that are large. For instance, if you have a letter such as
    'S' that might produce a large file, even if you know this after a semi-auto pass,
    all thats necessary is to create another 26 sub-letter files based on the next character.
    Ei: sa, st, so, ... etc. This could all be done programatically, the entire thing.
    Statistics could help with some apriori knowledge necessary for an algol.
    There is no doubt however that this is entirely possible.

    I guess if the world were geared for such large data sets now, absolutely anything can
    be done as far as data reduction. When we had 64k of ram in the old days, large didn't stop
    anything.

    Don't think you have come up with something that is not possible, this group is read in America!
    Given the knowledge displayed here without a response would lead me to believe your a crackpot
    lang bullshit/phishing artist who needs sexual relief. So sperm on these words jamocc....
     
    robic0, Apr 10, 2006
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxim
    Replies:
    0
    Views:
    412
    Maxim
    Jul 7, 2003
  2. Jon Spivey
    Replies:
    3
    Views:
    2,086
    Gregory A. Beamer
    Dec 1, 2009
  3. Clement Ow
    Replies:
    0
    Views:
    121
    Clement Ow
    Apr 4, 2008
  4. Scott Stark
    Replies:
    1
    Views:
    114
    Scott Stark
    Aug 3, 2003
  5. Replies:
    20
    Views:
    290
    News123
    Jan 2, 2009
Loading...

Share This Page