File size too big for perl processing

C

Cheez

Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.

hashsequence16.txt is the 16-letter word file (203MB)
rawdata.txt is the raw data file (93MB)

I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.

Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code. Any suggestions/comments would be greatly
appreciated.

Thanks,
Dan

========================

print "**fisher**";

$flatfile = "newrawdata.txt";
# 95MB in size

$datafile = "hashsequence16.txt";
# 203MB in size

my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation

open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
\n";

@preparse = <FILE>;
@hashdata = <FILE2>;

close(FILE);
close(FILE2);


for my $list1 (@hashdata) {
# iterating through hash16 data

$finish++;

if ($finish ==10 ) {
# line counter

$marker = $marker + $finish;

$finish =0;

$left = $filesize - $marker;

printf "$left\/$filesize\n";
# this prints every 17 seconds
}

($line, $freq) = split(/\t/, $list1);

for my $rawdata (@preparse) {
# iterating through rawdata

$rawdata=~ s/\n//;

if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line

my $first_pos = index $rawdata,$line;

print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file

}

}

print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"

}
 
J

Jim Gibson

Cheez said:
Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.

hashsequence16.txt is the 16-letter word file (203MB)

Hmm. How many 16-letter words are in this file? I see from your code
that the file contains the word and a frequency count. Estimating at
about 25 bytes per word, that represents 8 million words.
rawdata.txt is the raw data file (93MB)

I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.

Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code. Any suggestions/comments would be greatly
appreciated.

Thanks,
Dan

========================

You should have

use strict;
use warnings;

in your program. This is very important if you wish to get help from
this newsgroup.
print "**fisher**";

$flatfile = "newrawdata.txt";
# 95MB in size

$datafile = "hashsequence16.txt";
# 203MB in size

my $filesize = -s "hashsequence16.txt";
# for use in processing time calculation

open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";
open (SEQFILE, ">fishersearch.txt") || die "Can't open '$seqparsed': $!
\n";

You should be using lexically-scoped file handle variables, the
3-argument version of open, and 'or' instead of '||'.
@preparse = <FILE>;
@hashdata = <FILE2>;

Well at least you have enough memory to read the files into memory.
That helps. If you apply the chomp operator to these arrays, you can
save yourself some repetitive processing later:

chomp(@preparse);
chomp(@hashdata);
close(FILE);
close(FILE2);


for my $list1 (@hashdata) {
# iterating through hash16 data


$finish++;

if ($finish ==10 ) {
# line counter

$marker = $marker + $finish;

$finish =0;

$left = $filesize - $marker;

printf "$left\/$filesize\n";
# this prints every 17 seconds
}

When you are asking for help, it is best to leave out irrelevant
details such as periodic printing statements. It doesn't help anybody
help you.
($line, $freq) = split(/\t/, $list1);

for my $rawdata (@preparse) {
# iterating through rawdata

$rawdata=~ s/\n//;

No need for this if you chomp the arrays after reading.
if ($rawdata =~ m/$line/) {
# matching hash16 word with rawdata line

my $first_pos = index $rawdata,$line;

You first use a regex to find if $line appears in $rawdata, then use
index to find out where it appears. Just test the return value from
index to see if the substring appears. It will be -1 if it does not.
This will give you a significant speed-up.
print SEQFILE "$first_pos\t$rawdata\n";
# printing to info to new file

}

}

print SEQFILE "PROCESS\t$line\n";
# printing hash16 word and "process"

}

You only make one pass through FILE2, so you can save some memory by
processing the contents of this file one line at a time, instead of
reading it into the @hashdata array. It looks like you could also swap
the order of the for loops and only make one pass through FILE,
instead, but that may take more memory.

It is difficult to see why this program will take 9500 hours to run.
Make the above changes and try again. Without your data files or a look
at some sample data, it is difficult for anyone to really help you.
 
X

xhoster

Cheez said:
Hi, I posted this to perl.beginners as well and will make sure
comments go to both groups.

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file. I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word. I then go to the next line in the word list and
repeat.

hashsequence16.txt is the 16-letter word file (203MB)

How many lines? (it seems the 16-letter part is only the first column
of the file, so it is not simply 203MB / 17bytes)
rawdata.txt is the raw data file (93MB)

How many lines is it?
I have a counter in the code to tell me how long it's taking to
process... 9500 hours or so to complete... I definitely have time to
pursue other alternatives.

Scripting with perl is a hobby and not a vocation so I apologize in
advance for ugly code.

As a hobbyer, you should have the leisure to make it less ugly, while
someone working under the clock might not!
open(FILE, "$flatfile") || die "Can't open '$flatfile': $!\n";
open(FILE2, "$datafile") || die "Can't open '$flatfile': $!\n";

Wrong variable in the die, $datafile not $flatfile

@preparse = <FILE>;
@hashdata = <FILE2>;

Do you have a lot of memory, or is your system swapping? If swapping,
that right there will slow it down dramatically. In this case, if you
close(FILE);
close(FILE2);

for my $list1 (@hashdata) {
$finish++;
if ($finish ==10 ) {
$marker = $marker + $finish;
$finish =0;
$left = $filesize - $marker;

$filesize is in bytes, while $marker is in lines. This isn't gonna give
meaningful information.

printf "$left\/$filesize\n";
# this prints every 17 seconds
}

($line, $freq) = split(/\t/, $list1);

for my $rawdata (@preparse) {
$rawdata=~ s/\n//;

This substitution only needs to be done once, not for every @hashdata.
Put "chomp @preparse" outside of the loop.
if ($rawdata =~ m/$line/) {

In my test case, I had to add \Q before $line, otherwise the odd
special character in it caused regex syntax errors.
my $first_pos = index $rawdata,$line;

On success, you are doing the search twice. If success is rare, then
of course this is not important speedwise. Get rid of one or the other,
I'd prefer to get rid of the regex and do only the index.


Anyway, I'd write it to load hashdata into a hash (surprise!), and then
probe a 16 byte sliding window of newdata against that hash.

my %hashdata;
while (<FILE2>) {
chomp;
my ($t)=split /\t/;
$hashdata{$t}=();
};
close(FILE2);
my ($finish,$marker,$left);
while (my $rawdata=<FILE>) {
chomp $rawdata;
foreach (0..(length $rawdata) - 16) {
if (exists $hashdata{substr $rawdata,$_,16}) {
print SEQFILE "$_\t$rawdata\n";
}
}
}

The whole thing takes about a minute on files of about the size you
specified.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
C

Cheez

I have a big file of 16-letter words that I am using as "bait" to
capture larger words in a raw data file.  I loop through all of the
rawdata with a single word for 1) matches and 2) to associate the raw
data with the word.  I then go to the next line in the word list and
repeat.

I just want to thank Sherm, Jim and Xho for their generous code
snippets and suggestions. One overlying theme is the use of "strict"
and "warnings" and I will certainly add those to each script.

Regarding the actual problem to be solved with perl, I initially used
Xho's hash-based script. Once I get this running I will report back
with the actual script. Again thanks for everyone's input and
patience. As a self-learner, these NGs are invaluable!

Thanks,
Dan
 
C

Cheez

The subject implies that you have a problem that is producing an E2BIG
error (say a file > 2GB or, even, 2^63 bytes - that would be impressive).

Thanks for the comment. I think I lacked the precision to properly
describe my issue but with only a simple modification of my subject -
File Size too Big for Badly Written Perl Script - we're getting
closer ;)

Thanks,
Dan
 
X

xhoster

Big and Blue said:
The subject implies that you have a problem that is producing an E2BIG
error (say a file > 2GB or, even, 2^63 bytes - that would be impressive).

That was my first thought as well, but I still suspected the question would
be about something else, either memory or speed.
In fact you seem to have an slow algorithm that you expect can be
improved. That is something very different.

I'd say it is close enough. It isn't really reasonable to expect people
to know the answer before they post the question, and I'd say this subject
line is well above the median for the group.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,577
Members
45,054
Latest member
LucyCarper

Latest Threads

Top