Parsing/sorting big file problem

M

mcvallet

Hi,
I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number (the max number being
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you,


##############################################################################"
$#complete = 4000000;

open(OUTPUTFILE, $outPut)
|| die "cannot open file";

#variable initialisation
my $countTotPositive = 0;
my $countTotNegative = 0;
my $stop= 0;
my $countTotProt = 0;
my @start = times();


while(($ligne = <OUTPUTFILE> ) && $stop == 0){
#identifying the protein being compared
if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
#the next commented lignes are here for test purposes
if ($ligne =~ m/^.*1200>>>(\w+).*/){
$stop= 1;
}
$protName1 = $2;
$protName1 =~ s/_//g;
$count = 0;
}
#parsing the results
else{
$_=$ligne ;
my $evalue= 0;
/^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*)\W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
my $protName2=$1;
my $nbAa=$2;
my $eval3=$3;
my $eval4=$4;
my $eval5=$5;
$eval[0]="$6";
$eval[1]=$7;
my $eval8=$8;
$protName2 =~ s/_//g;
#finding out what is the evalue for this result
if ($ligne =~ m/e\+(\d{2,2})$/so){
$evalue = $eval[0].".".@eval;
for ($i = 0; $i < $eval8; $i++){
$evalue = $evalue * 10;
}
}else{
if ($eval[0] =~ m/^0/){
$evalue = $eval[0].".".$eval[1].$eval8;
}else{
$evalue = $eval[0].$eval[1].$eval8;
}
}

@sortedCouple = sort($protName1,$protName2);

if ($complete{"$sortedCouple[0]-$sortedCouple[1]"}[0]
|| $sortedCouple[0] =~ m/$sortedCouple[1]/i){

$evalue2 = $evalue;
#modifying the evalue 1 if the identical couple
if($sortedCouple[0] =~ m/$sortedCouple[1]/i){
$evalue1 = $evalue;
$identical =1;
$countTotPositive++;
}else{
$evalue1 = $complete{"$sortedCouple[0]-$sortedCouple[1]"}[0];
$identical =$complete{"$sortedCouple[0]-$sortedCouple[1]"}[1];
}
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
$count++;
}
# temporaly saving the partial results
else{
$class1 = $classes{$protName1};
$class2 = $classes{$protName2};
$identical = ( $class1=~ m/$class2/ ? 1 : 0);
if ($identical == 1){
$countTotPositive++;
}else{
$countTotNegative++;
}
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$evalue,
$identical];
}

}

}
close OUTPUTFILE;
#variable initialisation
$countPositive = 0;
$countNegative = 0;
foreach $complete (sort{$complete{$a}[2]<=> $complete{$b}[2]} keys
%complete) {
if ($complete{$complete}[3] == 1){
$countPositive++;
}else{
$countNegative++;
}
$newLigne =
$complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";
push @results,$newLigne;

}

@end = times();
# ============= Analyse results

print "Reading and parsing file took ",$end[0]-$start[0]," cpu
seconds\n";

# creation du document
print "\n";
@start = times();
open (F,">results/5out.test");
print F "@results";
close F;
@end = times();
# ============= Analyse results

print "Writting the file results/5out.test",$end[0]-$start[0]," cpu
seconds\n";


}
##############################################################################""
 
J

John W. Krahn

I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number (the max number being
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you,


##############################################################################"
$#complete = 4000000;

You are expanding the array @complete to contain 4,000,001 elements but it
doesn't look like you are using that array anywhere. Perhaps it is causing
your problem?


John
 
M

mcvallet

The only thing I know is that the array will contain 2225*2225 = 4 950
625 and I thought I was using this array here
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
Did I mix up the $ and @ ?

Furthermore, at the beginning I was not expanding the array to this
size, but it was not working either this is why I tried to expand the
array.

mc
 
J

John W. Krahn

The only thing I know is that the array will contain 2225*2225 = 4 950
625 and I thought I was using this array here
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,

That is using the hash %complete, not the array @complete.


John
 
M

MSG

The only thing I know is that the array will contain 2225*2225 = 4 950
625 and I thought I was using this array here
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
Did I mix up the $ and @ ?

Furthermore, at the beginning I was not expanding the array to this
size, but it was not working either this is why I tried to expand the
array.

mc

Where are 'use strict' and 'use warnings' ?!!
You can catch a lot of problems simply by using those. such as your
using complete{ } and $#complete ( hash / array ).
 
J

January Weiner


Hello,
first of all: I think you are parsing output of some sequence comparison
program. Maybe you could describe in more detail what you are trying to
do? Your code is long, incomplete, with messy intendation and
practically uncommented, so it is hard to see what you are doing. For
example, what about the %classes hash? Where does it come from, where is
it defined?
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you,

Hm. From my experience with large protein data sets -- looks like your
program exhausts all of the memory. A couple of suggestions:

1) As far as I can tell, you do the following: you first parse the search
results (I assume these are search results) and evaluate them at the
same time, then you sort them according to e-value, then you save them
in a file. You can do the following:

- first do the parsing, and save the data on the fly to a temporary
file

- then open the temporary file, make the evaluation, sort the
results, remove redundant etc.

- how long are the protein names? Maybe that is the problem? If you
have hundreds of thousands of fasta-style descriptions, using them
for a hash table in Perl (your "%complete" hash) may be very
inefficient. Try to use only short ids.

- if everything else fails, instead of spending weeks on correcting
your program (and there is, methinks, a lot to correct), try to get
your hands on a machine with more memory or a better OS and run
your calculations there.

- clean up your code, comment it, post it again here.

2) if I am correct in my assumption and you are writing a parser for
blast or ssearch or the results of a similar program, why don't you
use Bioperl?

(snip the code fragment)

j.
 
J

January Weiner

The only thing I know is that the array will contain 2225*2225 = 4 950
625 and I thought I was using this array here
$complete{"$sortedCouple[0]-$sortedCouple[1]"} = [$protName1,

this is a hash. When you write $blah{foo}, you access the hash %blah and
get the value stored for the key 'foo'.
$protName2, $evalue1 + $evalue2, $identical, $evalue1, $evalue2];
Did I mix up the $ and @ ?

you mixed up the % and the @.

However, I think that your problem is rather the size of your data. You
have a hash with 5 million elements, right? Try to roughly estimate how
much memory this will take. You need to store 5 million keys, right? Each
key being at least some 10 characters, right? Not to mention the arrays
that you store in the hash, correct?

1)Make the hash keys as short as possible.

2)Maybe instead of using protein names as keys, encode the file with
results (protein name1 = 0 ; protein name2 = 1 etc.). And instead of
using a hash, use a two-dimensional array:

my $matrix = [ ] ;

while( <INPUT_FILE> ) {
... # do your stuff

my ($prot_a, $prot_b) ; # these will be numerical IDs, and not names

if($prot_a > $prot_b) { # sort
($prot_a, $prot_b) = ($prot_b, $prot_a) ;
}

$result = [ ] ;
... # do some more stuff
# fill up $result

# store the $result in the matrix
$matrix->[$prot_a][$prot_b] = $result ;
}

j.
 
M

mcvallet

the entire code is not here, but you were correct, Iwas not using them.
thanks,
mc
 
M

mcvallet

first of all: I think you are parsing output of some sequence comparison
exactly
Maybe you could describe in more detail what you are trying to
do? Your code is long, incomplete, with messy intendation and
practically uncommented, so it is hard to see what you are doing. Sorry
For example, what about the %classes hash? Where does it come from, where is
it defined?
the %classes is a class contains the structural family of the classes
-it is at the begining of my wode witch I did not post because, it
works correctly.


1) As far as I can tell, you do the following: you first parse the search
results (I assume these are search results) and evaluate them at the
same time, then you sort them according to e-value, then you save them
in a file. You can do the following:
- first do the parsing, and save the data on the fly to a temporary
file

Not exactly, the results are already pre-parsed, but there are still
thing that are not necessary. The file look a bit like this :
1>>> d1tima_ 244 fragments - 244 aa
1dqzB0 ( 277) 4276 20.6
99
1hbnC0 ( 244) 4193 20.4
1e+02
1cxpD0 ( 463) 4140 20.3
2e+02
......
2225>>> another protein
the last 2225 results....
- first do the parsing, and save the data on the fly to a temporary

- then open the temporary file, make the evaluation, sort the
results, remove redundant etc.
- how long are the protein names? Maybe that is the problem? If you
have hundreds of thousands of fasta-style descriptions, using them
for a hash table in Perl (your "%complete" hash) may be very
inefficient. Try to use only short ids.
5 letters long
- if everything else fails, instead of spending weeks on correcting
your program (and there is, methinks, a lot to correct), try to get
your hands on a machine with more memory or a better OS and run
your calculations there.
- clean up your code, comment it, post it again here.
ok
thanks again,
mc
 
M

mcvallet

Maybe you could describe in more detail what you are trying to
I want to get all the couples a-b and the sum of there evalues eval_ab
+ eval_ba and sort the results according to that sum
 
M

Michael Zawrotny

Hi,
I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number (the max number being
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you, [ snip ]


if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
#the next commented lignes are here for test purposes
if ($ligne =~ m/^.*1200>>>(\w+).*/){
$stop= 1;
}

I think that the problem is in your regexps. A leading or trailing
".*" is almost always a mistake. It says "match 0 or more of
any single character" (not exactly, but pretty much). If it doesn't
match using zero characters, it will try again with one, ...

Doing that at both the beginning and end of the line can lead to an
enormous amount of backtracking. You could try adding a non-greedy
qualifier ("?") after the ".*", or better yet, just drop the ".*"
entirely since it always matches and thus doesn't change the overall
outcome of the attempted match.


Mike
 
T

Tad McClellan

I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number


There is NO number in your pattern.

The "1000" is a string, not a number.

$#complete = 4000000;


You can avoid getting fingerprints on the screen (from counting zeros):

$#complete = 4_000_000;


open(OUTPUTFILE, $outPut)
|| die "cannot open file";


You are opening OUTPUTFILE for *input*.

That is a pretty strange choice of filehandle name...

You should include the $! variable in your die message.

while(($ligne = <OUTPUTFILE> ) && $stop == 0){


You don't need the $stop flag if you simply last() out of the
while loop at the appropriate place.

#identifying the protein being compared
if ($ligne =~ m/^.+(\d*)+>>>\s*(\w+).*/){
^^^^^^
^^^^^^

That part of your pattern makes no sense to me.

Did you mean (\d+) instead?

#the next commented lignes are here for test purposes


The next lines are not "commented"...

if ($ligne =~ m/^.*1200>>>(\w+).*/){
$stop= 1;


last; # exit the while loop, avoid the problem immediately below

}
$protName1 = $2;


If that pattern matches, then it will wipe out $2 from the
earlier pattern match, and you will store an undef into $protName1.

The dollar-digit variables are set/reset at each successful pattern match.

$protName1 =~ s/_//g;


Regexes are for strings. tr/// is for characters.

$protName1 =~ tr/_//d;

/^\s?(\w+).*\s+\(\s*(\d+)\)\W+(\d+)\W+(\d*)\.?(\d*)\W+(\d*)\.?(\d*)e?\+?(\d{1,2})$/so;
my $protName2=$1;


You should *never* use the dollar-digit variables unless you
have first ensured that the pattern match *succeeded*.

my $eval3=$3;
my $eval4=$4;
my $eval5=$5;


Sequentially named variables very often indicate that there is
a better choice of data structure, such as an array rather than
a bunch of independant scalars.

$eval[0]="$6";


What were you hoping that those double quotes would do for you?

perldoc -q vars

#finding out what is the evalue for this result
if ($ligne =~ m/e\+(\d{2,2})$/so){


You should not throw modifiers on the end willy-nilly like that.

Add modifiers when they will make a difference, and that difference
is what you want to happen.


m//s changes the meaning of dot (.), it has no effect when there
is no dot in your pattern.

m//o is used when you have variables in your pattern, it has
no effect when there are no variables in your pattern.

if ($ligne =~ m/e\+(\d{2})$/){
or
if ($ligne =~ m/e\+(\d\d)$/){

Is probably easier to read and understand.


for ($i = 0; $i < $eval8; $i++){
$evalue = $evalue * 10;
}


$evalue *= 10 for 1 .. $eval8; # replaces that entire if-block


$newLigne =
$complete{$complete}[0]."\t".$complete{$complete}[1]."\t".$complete{$complete}[2]."\t".$complete{$complete}[3]."\t".$countPositive/$countTotPositive."\t".$countNegative/$countTotNegative."\t".$complete{$complete}[4]."\t".$complete{$complete}[5]."\n";


That is simply to horrid to look upon.

This should do the same thing (assuming that there are only 6
elements in the array) without making you scream:

$newLigne = join("\t", @{ $complete{$complete} }) . "\n";

push @results,$newLigne;


You don't even need the $newLigne temporary variable:

push @results, join("\t", @{ $complete{$complete} }) . "\n";


open (F,">results/5out.test");


You should always, yes *always*, check the return value from open():

open (F,">results/5out.test") or
die "could not open 'results/5out.test' $!";
 
S

Salvador Fandino

Hi,
I am coding a program that parses a file 370Mb. As long as I keep this
number less than a 1000 in this portion :
# basicly tells me until when i should continue to read the file)
if ($ligne =~ m/^.*1000>>>(\w+).*/){
$stop= 1;
}
it works, but as soon as I increase the number (the max number being
2225) so I am not even reading 1/2 of it, the program does not respond.
Does anybody have a suggestion for this ?
thank you,
...

read the file in blocks, sort and save them in temp. files and finally
perform a merge sort:

Untested:

use warnings;
use strict;
use Sort::Key 'keysort_inplace';
use Sort::Key::Merger 'filekeymerger';
use File::Temp ...;

my @lines;
my @tempfn;

sub extract_sorting_key {
# extract the key that has to be used for sorting
# from $_, for instance:
/foo: (/w+)/;
$1
}

sub sort_and_write_block {
&keysort_inplace(\&extract_key, \@lines);
my ($fh, $filename) = File::Temp->new(...);
print $fh $_ for @lines;
close $fh;
push @tempfn, $filename;
@lines = ();
}

while (<>) {
unless ($fh) {
($fh, $fn) = File::Temp->new(...);
}
sort_and_write_block() if @lines > 1000000
}

sort_and_write_block() if @lines;

my $merger = &filekeymerger(\&extract_key, @tempfn);

while (defined (my $line = $merger->())) {
# your lines arrive sorted here,
# do whatever you need with them!
...
}



Cheers,

- Salva
 
S

Salvador Fandino

Salvador said:
...
while (<>) {
unless ($fh) {
($fh, $fn) = File::Temp->new(...);
}
sort_and_write_block() if @lines > 1000000
}

oops, that should be...

while(<>) {
push @lines, $_;
sort_and_write_block() if @lines > 1000000;
}


Cheers,

- Salva
 
M

mcvallet

thank you everybody,
it seems to work know, not on my computer but on a bigger computer, and
it does not take verylong either...
thanks,
mc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top