C
Cheez
Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.
I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.
hashsequence16.txt is the 16-letter word file (203MB)
rawdata.txt is the raw data file (93MB)
I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.
Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code. Any suggestions/comments would be greatly
appreciated.
Thanks,
Dan
========================
print "**fisher**";
$flatfile = "newrawdata.txt";
# 95MB in size
$datafile = "hashsequence16.txt";
# 203MB in size
my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation
open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
\n";
@preparse = <FILE>;
@hashdata = <FILE2>;
close(FILE);
close(FILE2);
for my $list1 (@hashdata) {
# iterating through hash16 data
$finish++;
if ($finish ==10 ) {
# line counter
$marker = $marker + $finish;
$finish =0;
$left = $filesize - $marker;
printf "$left\/$filesize\n";
# this prints every 17 seconds
}
($line, $freq) = split(/\t/, $list1);
for my $rawdata (@preparse) {
# iterating through rawdata
$rawdata=~ s/\n//;
if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line
my $first_pos = index $rawdata,$line;
print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file
}
}
print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"
}
comments go to both groups.
I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.
hashsequence16.txt is the 16-letter word file (203MB)
rawdata.txt is the raw data file (93MB)
I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.
Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code. Any suggestions/comments would be greatly
appreciated.
Thanks,
Dan
========================
print "**fisher**";
$flatfile = "newrawdata.txt";
# 95MB in size
$datafile = "hashsequence16.txt";
# 203MB in size
my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation
open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
\n";
@preparse = <FILE>;
@hashdata = <FILE2>;
close(FILE);
close(FILE2);
for my $list1 (@hashdata) {
# iterating through hash16 data
$finish++;
if ($finish ==10 ) {
# line counter
$marker = $marker + $finish;
$finish =0;
$left = $filesize - $marker;
printf "$left\/$filesize\n";
# this prints every 17 seconds
}
($line, $freq) = split(/\t/, $list1);
for my $rawdata (@preparse) {
# iterating through rawdata
$rawdata=~ s/\n//;
if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line
my $first_pos = index $rawdata,$line;
print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file
}
}
print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"
}