Parsing/sorting big file problem

Discussion in 'Perl Misc' started by mcvallet@hotmail.com, Feb 24, 2006.

  1. Guest

    Hi,
    I am coding a program that parses a file 370Mb. As long as I keep this
    number less than a 1000 in this portion :
    # basicly tells me until when i should continue to read the file)
    if ($ligne =~ m/^.*1000>>>(\w+).*/){
    $stop= 1;
    }
    it works, but as soon as I increase the number (the max number being
    2225) so I am not even reading 1/2 of it, the program does not respond.
    Does anybody have a suggestion for this ?
    thank you,


    ##############################################################################"
    $#complete = 4000000;

    open(OUTPUTFILE, $outPut)
    || die "cannot open file";

    #variable initialisation
    my $countTotPositive = 0;
    my $countTotNegative = 0;
    my $stop= 0;
    my $countTotProt = 0;
    my @start = times();


    while(($ligne = <OUTPUTFILE> ) && $stop == 0){
    #identifying the protein being compared
    if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
    #the next commented lignes are here for test purposes
    if ($ligne =~ m/^.*1200>>>(\w+).*/){
    $stop= 1;
    }
    $protName1 = $2;
    $protName1 =~ s/_//g;
    $count = 0;
    }
    #parsing the results
    else{
    $_=$ligne ;
    my $evalue= 0;
    /^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*)\W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
    my $protName2=$1;
    my $nbAa=$2;
    my $eval3=$3;
    my $eval4=$4;
    my $eval5=$5;
    $eval[0]="$6";
    $eval[1]=$7;
    my $eval8=$8;
    $protName2 =~ s/_//g;
    #finding out what is the evalue for this result
    if ($ligne =~ m/e\+(\d{2,2})$/so){
    $evalue = $eval[0].".".@eval;
    for ($i = 0; $i < $eval8; $i++){
    $evalue = $evalue * 10;
    }
    }else{
    if ($eval[0] =~ m/^0/){
    $evalue = $eval[0].".".$eval[1].$eval8;
    }else{
    $evalue = $eval[0].$eval[1].$eval8;
    }
    }

    @sortedCouple = sort($protName1,$protName2);

    if ($complete{"$sortedCouple[0]-$sortedCouple[1]"}[0]
    || $sortedCouple[0] =~ m/$sortedCouple[1]/i){

    $evalue2 = $evalue;
    #modifying the evalue 1 if the identical couple
    if($sortedCouple[0] =~ m/$sortedCouple[1]/i){
    $evalue1 = $evalue;
    $identical =1;
    $countTotPositive++;
    }else{
    $evalue1 = $complete{"$sortedCouple[0]-$sortedCouple[1]"}[0];
    $identical =$complete{"$sortedCouple[0]-$sortedCouple[1]"}[1];
    }
    $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
    $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
    $count++;
    }
    # temporaly saving the partial results
    else{
    $class1 = $classes{$protName1};
    $class2 = $classes{$protName2};
    $identical = ( $class1=~ m/$class2/ ? 1 : 0);
    if ($identical == 1){
    $countTotPositive++;
    }else{
    $countTotNegative++;
    }
    $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$evalue,
    $identical];
    }

    }

    }
    close OUTPUTFILE;
    #variable initialisation
    $countPositive = 0;
    $countNegative = 0;
    foreach $complete (sort{$complete{$a}[2]<=> $complete{$b}[2]} keys
    %complete) {
    if ($complete{$complete}[3] == 1){
    $countPositive++;
    }else{
    $countNegative++;
    }
    $newLigne =
    $complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";
    push @results,$newLigne;

    }

    @end = times();
    # ============= Analyse results

    print "Reading and parsing file took ",$end[0]-$start[0]," cpu
    seconds\n";

    # creation du document
    print "\n";
    @start = times();
    open (F,">results/5out.test");
    print F "@results";
    close F;
    @end = times();
    # ============= Analyse results

    print "Writting the file results/5out.test",$end[0]-$start[0]," cpu
    seconds\n";


    }
    ##############################################################################""
     
    , Feb 24, 2006
    #1
    1. Advertising

  2. wrote:
    > I am coding a program that parses a file 370Mb. As long as I keep this
    > number less than a 1000 in this portion :
    > # basicly tells me until when i should continue to read the file)
    > if ($ligne =~ m/^.*1000>>>(\w+).*/){
    > $stop= 1;
    > }
    > it works, but as soon as I increase the number (the max number being
    > 2225) so I am not even reading 1/2 of it, the program does not respond.
    > Does anybody have a suggestion for this ?
    > thank you,
    >
    >
    > ##############################################################################"
    > $#complete = 4000000;


    You are expanding the array @complete to contain 4,000,001 elements but it
    doesn't look like you are using that array anywhere. Perhaps it is causing
    your problem?


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Feb 24, 2006
    #2
    1. Advertising

  3. Guest

    The only thing I know is that the array will contain 2225*2225 = 4 950
    625 and I thought I was using this array here
    $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
    $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
    Did I mix up the $ and @ ?

    Furthermore, at the beginning I was not expanding the array to this
    size, but it was not working either this is why I tried to expand the
    array.

    mc
     
    , Feb 24, 2006
    #3
  4. wrote:
    > The only thing I know is that the array will contain 2225*2225 = 4 950
    > 625 and I thought I was using this array here
    > $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,


    That is using the hash %complete, not the array @complete.


    John
    --
    use Perl;
    program
    fulfillment
     
    John W. Krahn, Feb 24, 2006
    #4
  5. MSG Guest

    wrote:
    > The only thing I know is that the array will contain 2225*2225 = 4 950
    > 625 and I thought I was using this array here
    > $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
    > $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
    > Did I mix up the $ and @ ?
    >
    > Furthermore, at the beginning I was not expanding the array to this
    > size, but it was not working either this is why I tried to expand the
    > array.
    >
    > mc


    Where are 'use strict' and 'use warnings' ?!!
    You can catch a lot of problems simply by using those. such as your
    using complete{ } and $#complete ( hash / array ).
     
    MSG, Feb 24, 2006
    #5
  6. wrote:
    > Hi,


    Hello,
    first of all: I think you are parsing output of some sequence comparison
    program. Maybe you could describe in more detail what you are trying to
    do? Your code is long, incomplete, with messy intendation and
    practically uncommented, so it is hard to see what you are doing. For
    example, what about the %classes hash? Where does it come from, where is
    it defined?

    > 2225) so I am not even reading 1/2 of it, the program does not respond.
    > Does anybody have a suggestion for this ?
    > thank you,


    Hm. From my experience with large protein data sets -- looks like your
    program exhausts all of the memory. A couple of suggestions:

    1) As far as I can tell, you do the following: you first parse the search
    results (I assume these are search results) and evaluate them at the
    same time, then you sort them according to e-value, then you save them
    in a file. You can do the following:

    - first do the parsing, and save the data on the fly to a temporary
    file

    - then open the temporary file, make the evaluation, sort the
    results, remove redundant etc.

    - how long are the protein names? Maybe that is the problem? If you
    have hundreds of thousands of fasta-style descriptions, using them
    for a hash table in Perl (your "%complete" hash) may be very
    inefficient. Try to use only short ids.

    - if everything else fails, instead of spending weeks on correcting
    your program (and there is, methinks, a lot to correct), try to get
    your hands on a machine with more memory or a better OS and run
    your calculations there.

    - clean up your code, comment it, post it again here.

    2) if I am correct in my assumption and you are writing a parser for
    blast or ssearch or the results of a similar program, why don't you
    use Bioperl?

    (snip the code fragment)

    j.

    --
    ------------ January Weiner 3 -------------------------------------
    Division of Bioinformatics, University of Muenster
     
    January Weiner, Feb 24, 2006
    #6
  7. wrote:
    > The only thing I know is that the array will contain 2225*2225 = 4 950
    > 625 and I thought I was using this array here
    > $complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,


    this is a hash. When you write $blah{foo}, you access the hash %blah and
    get the value stored for the key 'foo'.

    > $protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
    > Did I mix up the $ and @ ?


    you mixed up the % and the @.

    However, I think that your problem is rather the size of your data. You
    have a hash with 5 million elements, right? Try to roughly estimate how
    much memory this will take. You need to store 5 million keys, right? Each
    key being at least some 10 characters, right? Not to mention the arrays
    that you store in the hash, correct?

    1)Make the hash keys as short as possible.

    2)Maybe instead of using protein names as keys, encode the file with
    results (protein name1 = 0 ; protein name2 = 1 etc.). And instead of
    using a hash, use a two-dimensional array:

    my $matrix = [ ] ;

    while( <INPUT_FILE> ) {
    ... # do your stuff

    my ($prot_a, $prot_b) ; # these will be numerical IDs, and not names

    if($prot_a > $prot_b) { # sort
    ($prot_a, $prot_b) = ($prot_b, $prot_a) ;
    }

    $result = [ ] ;
    ... # do some more stuff
    # fill up $result

    # store the $result in the matrix
    $matrix->[$prot_a][$prot_b] = $result ;
    }

    j.

    --
    ------------ January Weiner 3 ---------------------+---------------
    Division of Bioinformatics, University of Muenster
     
    January Weiner, Feb 24, 2006
    #7
  8. Guest

    the entire code is not here, but you were correct, Iwas not using them.
    thanks,
    mc
     
    , Feb 24, 2006
    #8
  9. Guest


    > first of all: I think you are parsing output of some sequence

    comparison
    > program.

    exactly
    > Maybe you could describe in more detail what you are trying to
    > do? Your code is long, incomplete, with messy intendation and
    > practically uncommented, so it is hard to see what you are doing.

    Sorry
    >For example, what about the %classes hash? Where does it come from,

    where is
    >it defined?

    the %classes is a class contains the structural family of the classes
    -it is at the begining of my wode witch I did not post because, it
    works correctly.



    >1) As far as I can tell, you do the following: you first parse the search
    >results (I assume these are search results) and evaluate them at

    the
    >same time, then you sort them according to e-value, then you

    save them
    > in a file. You can do the following:
    > - first do the parsing, and save the data on the fly to a temporary
    > file


    Not exactly, the results are already pre-parsed, but there are still
    thing that are not necessary. The file look a bit like this :
    1>>> d1tima_ 244 fragments - 244 aa
    1dqzB0 ( 277) 4276 20.6
    99
    1hbnC0 ( 244) 4193 20.4
    1e+02
    1cxpD0 ( 463) 4140 20.3
    2e+02
    ......
    2225>>> another protein
    the last 2225 results....

    > - first do the parsing, and save the data on the fly to a

    temporary
    > file


    > - then open the temporary file, make the evaluation, sort the
    > results, remove redundant etc.


    > - how long are the protein names? Maybe that is the problem?

    If you
    > have hundreds of thousands of fasta-style descriptions, using

    them
    > for a hash table in Perl (your "%complete" hash) may be very
    > inefficient. Try to use only short ids.

    5 letters long

    > - if everything else fails, instead of spending weeks on

    correcting
    > your program (and there is, methinks, a lot to correct), try

    to get
    > your hands on a machine with more memory or a better OS and

    run
    > your calculations there.


    >- clean up your code, comment it, post it again here.

    ok
    thanks again,
    mc
     
    , Feb 24, 2006
    #9
  10. Guest

    > Maybe you could describe in more detail what you are trying to
    > do?

    I want to get all the couples a-b and the sum of there evalues eval_ab
    + eval_ba and sort the results according to that sum
     
    , Feb 24, 2006
    #10
  11. <> wrote:
    > Hi,
    > I am coding a program that parses a file 370Mb. As long as I keep this
    > number less than a 1000 in this portion :
    > # basicly tells me until when i should continue to read the file)
    > if ($ligne =~ m/^.*1000>>>(\w+).*/){
    > $stop= 1;
    > }
    > it works, but as soon as I increase the number (the max number being
    > 2225) so I am not even reading 1/2 of it, the program does not respond.
    > Does anybody have a suggestion for this ?
    > thank you,

    [ snip ]
    >
    >
    > if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
    > #the next commented lignes are here for test purposes
    > if ($ligne =~ m/^.*1200>>>(\w+).*/){
    > $stop= 1;
    > }


    I think that the problem is in your regexps. A leading or trailing
    ".*" is almost always a mistake. It says "match 0 or more of
    any single character" (not exactly, but pretty much). If it doesn't
    match using zero characters, it will try again with one, ...

    Doing that at both the beginning and end of the line can lead to an
    enormous amount of backtracking. You could try adding a non-greedy
    qualifier ("?") after the ".*", or better yet, just drop the ".*"
    entirely since it always matches and thus doesn't change the overall
    outcome of the attempted match.


    Mike

    --
    Michael Zawrotny
    Institute of Molecular Biophysics
    Florida State University | email:
    Tallahassee, FL 32306-4380 | phone: (850) 644-0069
     
    Michael Zawrotny, Feb 24, 2006
    #11
  12. <> wrote:

    > I am coding a program that parses a file 370Mb. As long as I keep this
    > number less than a 1000 in this portion :
    > # basicly tells me until when i should continue to read the file)
    > if ($ligne =~ m/^.*1000>>>(\w+).*/){
    > $stop= 1;
    > }
    > it works, but as soon as I increase the number



    There is NO number in your pattern.

    The "1000" is a string, not a number.


    > $#complete = 4000000;



    You can avoid getting fingerprints on the screen (from counting zeros):

    $#complete = 4_000_000;



    > open(OUTPUTFILE, $outPut)
    > || die "cannot open file";



    You are opening OUTPUTFILE for *input*.

    That is a pretty strange choice of filehandle name...

    You should include the $! variable in your die message.


    > while(($ligne = <OUTPUTFILE> ) && $stop == 0){



    You don't need the $stop flag if you simply last() out of the
    while loop at the appropriate place.


    > #identifying the protein being compared
    > if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){

    ^^^^^^
    ^^^^^^

    That part of your pattern makes no sense to me.

    Did you mean (\d+) instead?


    > #the next commented lignes are here for test purposes



    The next lines are not "commented"...


    > if ($ligne =~ m/^.*1200>>>(\w+).*/){
    > $stop= 1;



    last; # exit the while loop, avoid the problem immediately below


    > }
    > $protName1 = $2;



    If that pattern matches, then it will wipe out $2 from the
    earlier pattern match, and you will store an undef into $protName1.

    The dollar-digit variables are set/reset at each successful pattern match.


    > $protName1 =~ s/_//g;



    Regexes are for strings. tr/// is for characters.

    $protName1 =~ tr/_//d;


    > /^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*)\W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
    > my $protName2=$1;



    You should *never* use the dollar-digit variables unless you
    have first ensured that the pattern match *succeeded*.


    > my $eval3=$3;
    > my $eval4=$4;
    > my $eval5=$5;



    Sequentially named variables very often indicate that there is
    a better choice of data structure, such as an array rather than
    a bunch of independant scalars.


    > $eval[0]="$6";



    What were you hoping that those double quotes would do for you?

    perldoc -q vars


    > #finding out what is the evalue for this result
    > if ($ligne =~ m/e\+(\d{2,2})$/so){



    You should not throw modifiers on the end willy-nilly like that.

    Add modifiers when they will make a difference, and that difference
    is what you want to happen.


    m//s changes the meaning of dot (.), it has no effect when there
    is no dot in your pattern.

    m//o is used when you have variables in your pattern, it has
    no effect when there are no variables in your pattern.

    if ($ligne =~ m/e\+(\d{2})$/){
    or
    if ($ligne =~ m/e\+(\d\d)$/){

    Is probably easier to read and understand.



    > for ($i = 0; $i < $eval8; $i++){
    > $evalue = $evalue * 10;
    > }



    $evalue *= 10 for 1 .. $eval8; # replaces that entire if-block



    > $newLigne =
    > $complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";



    That is simply to horrid to look upon.

    This should do the same thing (assuming that there are only 6
    elements in the array) without making you scream:

    $newLigne = join("\t", @{ $complete{$complete} }) . "\n";


    > push @results,$newLigne;



    You don't even need the $newLigne temporary variable:

    push @results, join("\t", @{ $complete{$complete} }) . "\n";



    > open (F,">results/5out.test");



    You should always, yes *always*, check the return value from open():

    open (F,">results/5out.test") or
    die "could not open 'results/5out.test' $!";


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Feb 24, 2006
    #12
  13. wrote:
    > Hi,
    > I am coding a program that parses a file 370Mb. As long as I keep this
    > number less than a 1000 in this portion :
    > # basicly tells me until when i should continue to read the file)
    > if ($ligne =~ m/^.*1000>>>(\w+).*/){
    > $stop= 1;
    > }
    > it works, but as soon as I increase the number (the max number being
    > 2225) so I am not even reading 1/2 of it, the program does not respond.
    > Does anybody have a suggestion for this ?
    > thank you,
    > ...


    read the file in blocks, sort and save them in temp. files and finally
    perform a merge sort:

    Untested:

    use warnings;
    use strict;
    use Sort::Key 'keysort_inplace';
    use Sort::Key::Merger 'filekeymerger';
    use File::Temp ...;

    my @lines;
    my @tempfn;

    sub extract_sorting_key {
    # extract the key that has to be used for sorting
    # from $_, for instance:
    /foo: (/w+)/;
    $1
    }

    sub sort_and_write_block {
    &keysort_inplace(\&extract_key, \@lines);
    my ($fh, $filename) = File::Temp->new(...);
    print $fh $_ for @lines;
    close $fh;
    push @tempfn, $filename;
    @lines = ();
    }

    while (<>) {
    unless ($fh) {
    ($fh, $fn) = File::Temp->new(...);
    }
    sort_and_write_block() if @lines > 1000000
    }

    sort_and_write_block() if @lines;

    my $merger = &filekeymerger(\&extract_key, @tempfn);

    while (defined (my $line = $merger->())) {
    # your lines arrive sorted here,
    # do whatever you need with them!
    ...
    }



    Cheers,

    - Salva
     
    Salvador Fandino, Feb 25, 2006
    #13
  14. Salvador Fandino wrote:

    > ...
    > while (<>) {
    > unless ($fh) {
    > ($fh, $fn) = File::Temp->new(...);
    > }
    > sort_and_write_block() if @lines > 1000000
    > }


    oops, that should be...

    while(<>) {
    push @lines, $_;
    sort_and_write_block() if @lines > 1000000;
    }


    Cheers,

    - Salva
     
    Salvador Fandino, Feb 25, 2006
    #14
  15. Guest

    thank you everybody,
    it seems to work know, not on my computer but on a bigger computer, and
    it does not take verylong either...
    thanks,
    mc
     
    , Feb 27, 2006
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    2
    Views:
    341
    Larry I Smith
    Jun 21, 2005
  2. Shaguf
    Replies:
    0
    Views:
    363
    Shaguf
    Dec 24, 2008
  3. Shaguf
    Replies:
    0
    Views:
    458
    Shaguf
    Dec 26, 2008
  4. Shaguf
    Replies:
    0
    Views:
    244
    Shaguf
    Dec 26, 2008
  5. Shaguf
    Replies:
    0
    Views:
    220
    Shaguf
    Dec 24, 2008
Loading...

Share This Page