Additional regex question

B

Bryan

Looked in the faq, helpful but still didnt figure out a clean way to do
something more advanced (for me):

I have a file with this kind of stuff in it:

+some identifying string 1
aaabbbbcccccdddd
eeeffffaaabcbcbaad
jjkalddd

+some identifying string 2
ggaadryyyyssaaad
ddddeeeakkkkalllla
asdfffff

I need to process the file and dump the results into a new file. The
file should be processed in the following manner:
1. Any line that starts with '+' should be untouched and dumped to the
new file
2. Any lines that are not empty should be joined with whatever lines are
not empty following them, up to the empty line.
3. The joined line needs to be searched for a pattern and then truncated
after the pattern.

So if my search string was (case insensitive) ddddeee, the output file
would look like this:
+some identifying string 1
aaabbbbcccccddddeee

+some identifying string 2
ggaadryyyyssaaadddddeee

Using index and substr, I can match and get the truncated version of the
joined string... but I am not sure how to loop over my file, and in some
cases use just one line and in others join lines.

I tried fiddling with $/ = "", or $/ = "+", but couldn't get what I wanted.

Suggestions appreciated,
B
 
A

Austin P. So (Hae Jin)

Bryan said:
I have a file with this kind of stuff in it:

+some identifying string 1
aaabbbbcccccdddd
eeeffffaaabcbcbaad
jjkalddd

+some identifying string 2
ggaadryyyyssaaad
ddddeeeakkkkalllla
asdfffff

Geez...are you trying to mask your homework problem?

This is a common thing for biological databases...best thing to do is to
convert things from FASTA format into a HASH (or array I suppose) and
then play with that information as needed...

i.e. convert:
identifier1 sequence
sequence
sequence

identifier2
sequence
sequence
sequence

to:

$hash{$identifier1}=$sequence.$sequence.$sequence...
$hash{$identifier2}=$sequence.$sequence.$sequence...
....

Are you sure this isn't a homework problem or something? Seems like you
are asking everyone to manipulate FASTA sequences for you...

Oh well...in this case, use a flag to indicate a new sequence and read
the lines into a hash as above...

Good luck...

Austin
 
J

Jay Tilton

: I have a file with this kind of stuff in it:
:
: +some identifying string 1
: aaabbbbcccccdddd
: eeeffffaaabcbcbaad
: jjkalddd
:
: +some identifying string 2
: ggaadryyyyssaaad
: ddddeeeakkkkalllla
: asdfffff
:
: I need to process the file and dump the results into a new file. The
: file should be processed in the following manner:
: 1. Any line that starts with '+' should be untouched and dumped to the
: new file
: 2. Any lines that are not empty should be joined with whatever lines are
: not empty following them, up to the empty line.
: 3. The joined line needs to be searched for a pattern and then truncated
: after the pattern.
:
: So if my search string was (case insensitive) ddddeee, the output file
: would look like this:
: +some identifying string 1
: aaabbbbcccccddddeee
:
: +some identifying string 2
: ggaadryyyyssaaadddddeee
:
: Using index and substr, I can match and get the truncated version of the
: joined string... but I am not sure how to loop over my file, and in some
: cases use just one line and in others join lines.

#!perl
use warnings;
use strict;
{
local $/ = '';
while(<DATA>) {
chomp;
(my $lines =
join '',
grep !( /^\+/ && print "$_\n"),
split /\n/
) =~ s/(ddddeee).+/$1/;
print "$lines\n\n";
}
}
__DATA__
+some identifying string 1
aaabbbbcccccdddd
eeeffffaaabcbcbaad
jjkalddd

+some identifying string 2
ggaadryyyyssaaad
ddddeeeakkkkalllla
asdfffff
 
T

Tad McClellan

Bryan said:
I need to process the file and dump the results into a new file. The
file should be processed in the following manner:
1. Any line that starts with '+' should be untouched and dumped to the
new file
2. Any lines that are not empty should be joined with whatever lines are
not empty following them, up to the empty line.
3. The joined line needs to be searched for a pattern and then truncated
after the pattern.

Suggestions appreciated,


------------------------------------------
#!/usr/bin/perl
use strict;
use warnings;

my $search = 'ddddeee';
local $/ = ''; # enable paragraph mode
while ( <DATA> ) {
my($first, $rest) = split /\n/, $_, 2;
$rest =~ tr/\n//d;
$rest =~ s/($search).*/$1/;
print "$first\n$rest\n"
}

__DATA__
+some identifying string 1
aaabbbbcccccdddd
eeeffffaaabcbcbaad
jjkalddd

+some identifying string 2
ggaadryyyyssaaad
ddddeeeakkkkalllla
asdfffff
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,608
Members
45,250
Latest member
Charlesreero

Latest Threads

Top