Reading chunks from file?

B

Bryan

Hi, I'm reading in a file in fasta format:
header DATADATADATA
DATADATA

header
DATA

I have been doing this:
open (INFILE, "< $filename") or die "Cannot open $filename] for read\n\n";
undef $/;
my @chunks = split(/>/, <INFILE>);
$/ = "\n";
close INFILE;

This works, but this split loses the '>' from the header part of the
file, which I would rather keep for identifying header info later. So
first, why do I lose the '>' on this particular split, is there
something I can do to keep it? Second, is there a better way to split
this file into chunks than I am doing?

Thanks,
Bryan
 
P

Paul Lalli

Hi, I'm reading in a file in fasta format:
header DATADATADATA
DATADATA

header
DATA

I have been doing this:
open (INFILE, "< $filename") or die "Cannot open $filename] for read\n\n";
undef $/;
my @chunks = split(/>/, <INFILE>);
$/ = "\n";
close INFILE;

This works, but this split loses the '>' from the header part of the
file, which I would rather keep for identifying header info later. So
first, why do I lose the '>' on this particular split, is there
something I can do to keep it?

Have you read the documentation for split? The answer to both questions
is found within.

perldoc -f split
Second, is there a better way to split
this file into chunks than I am doing?

Do you need to store the whole file in memory at once? Might it be a
better idea to read one record at a time? Rather than undefining the
input record separator, maybe you want to set that variable to the actual
string which separates your records, and then read a file in one record at
a time.

perldoc perlop
for info on $/

Hope this helps,
Paul Lalli
 
C

ctcgag

Bryan said:
Hi, I'm reading in a file in fasta format:
header DATADATADATA
DATADATA

header
DATA

I have been doing this:
open (INFILE, "< $filename") or die "Cannot open $filename] for
read\n\n"; undef $/;
my @chunks = split(/>/, <INFILE>);
$/ = "\n";
close INFILE;

This works, but this split loses the '>' from the header part of the
file, which I would rather keep for identifying header info later. So
first, why do I lose the '>' on this particular split, is there
something I can do to keep it?

You lose the '>' because that is what split does.

You could keep it by using a look-ahead assertion.

split /(?=>)/ , <DATA>

This will probably produce an empty string or a sting containing just
whitespace as the first element.
Second, is there a better way to split
this file into chunks than I am doing?

If the file is big, it would probably be better not to slurp it all
at once. You could set $/ ='>', but then you would have an '>' at the
end of every record (except the last), and not one at the beginning if
every record. (You would also have a blank record as the first one read).
This is kind of ugly, but what you gonna do?

Xho
 
B

Brian McCauley

If the file is big, it would probably be better not to slurp it all
at once. You could set $/ ='>', but then you would have an '>' at the
end of every record (except the last), and not one at the beginning if
every record. (You would also have a blank record as the first one read).
This is kind of ugly, but what you gonna do?

Perpaps File::Stream would help?

--
\\ ( )
. _\\__[oo
.__/ \\ /\@
. l___\\
# ll l\\
###LL LL\\
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top