convert protein fasta stream into harsh table

Z

zhong.huang

hi,

Can anyone suggest me a simple way to convert multiple sequences fasta
(in Bio::SeqIO object) into harsh table (sequence annotation as key,
sequence as value)?

The fasta file looks like this:
gi|9049352|dbj|BAA99407.1| 3-methylcrotonyl-CoA carboxylase biotin-containing subunit [Homo sapiens] MAAASAVSVLLVAAERNRWHRLPSLLLPPRTWVWRQRTMKYTTATGRNITKVLIANRGEIACRVMRTAKKLGVQTVAVYSEADRNSMHVDMADEAYSIGPAPSQQSYLSMEKIIQVAKTSAAQAIHPGCGFLSENMEFAE

gi|4504067|ref|NP_002070.1| aspartate aminotransferase 1 [Homo sapiens]
MAPPSVFAEVPQAQPVLVFKLTADFREDPDPRKVNLGVGAYRTDDCHPWVLPVVKKVEQKIANDNSLNHEYLPILGLAEFRSCASRLALGD

I want to have the harsh table %seqharsh to hold sequences like this:

# my %seqharsh = ('seq1', MAAASAVSVL......',
# 'seq2', MAPPSVFAEVPQ......,);


My code is like this:


my $seqio = new Bio::SeqIO(-format => $format,
-file => $file);

while ( my $seq = $seqio->next_seq ) {
if( $seq->alphabet ne 'protein' ) {
confess("Skipping non protein sequence...");
next;
}

#write code here to assign each entry into harsh %seqharsh
my $seqharsh{$seq->primary_id} = $seq->seq();
bla bla bla


Thank you very much for your help!


zhong
 
S

Samwyse

hi,

Can anyone suggest me a simple way to convert multiple sequences fasta
(in Bio::SeqIO object) into harsh table (sequence annotation as key,
sequence as value)?

They are called HASH tables, not HARSH tables.
The fasta file looks like this:

gi|9049352|dbj|BAA99407.1| 3-methylcrotonyl-CoA carboxylase biotin-containing subunit [Homo sapiens]
MAAASAVSVLLVAAERNRWHRLPSLLLPPRTWVWRQRTMKYTTATGRNITKVLIANRGEIACRVMRTAKKLGVQTVAVYSEADRNSMHVDMADEAYSIGPAPSQQSYLSMEKIIQVAKTSAAQAIHPGCGFLSENMEFAE


gi|4504067|ref|NP_002070.1| aspartate aminotransferase 1 [Homo sapiens]

MAPPSVFAEVPQAQPVLVFKLTADFREDPDPRKVNLGVGAYRTDDCHPWVLPVVKKVEQKIANDNSLNHEYLPILGLAEFRSCASRLALGD

I want to have the harsh table %seqharsh to hold sequences like this:

# my %seqharsh = ('seq1', MAAASAVSVL......',
# 'seq2', MAPPSVFAEVPQ......,);

I'm not seeing where the 'seq1' and 'seq2' values are coming from in
your input. If I'm allowed to make up hash keys, the problem is pretty
simple.
My code is like this:


my $seqio = new Bio::SeqIO(-format => $format,
-file => $file);

my %seqharsh; # declare the hash table
while ( my $seq = $seqio->next_seq ) {
if( $seq->alphabet ne 'protein' ) {
confess("Skipping non protein sequence...");
next;
}

#write code here to assign each entry into harsh %seqharsh
my $seqharsh{$seq->primary_id} = $seq->seq();

You neither need nor want the 'my'; this will add items to the hash:
$seqharsh{$seq->primary_id} = $seq->seq;
 
K

Keith Keller

Can anyone suggest me a simple way to convert multiple sequences fasta
(in Bio::SeqIO object) into harsh table (sequence annotation as key,
sequence as value)?

It's a hash, not harsh (at least in English). I suppose it could be
a harsh hash.
I want to have the harsh table %seqharsh to hold sequences like this:

# my %seqharsh = ('seq1', MAAASAVSVL......',
# 'seq2', MAPPSVFAEVPQ......,);

It's not such great style to write it this way, even though
we know what you mean. Try this instead:

# my %seqharsh = ('seq1' => 'MAAASAVSVL......',
# 'seq2' => 'MAPPSVFAEVPQ......',);

Under use strict the quotes are required on the right-hand side.
The => behaves like , in this context, but looks nicer for humans.
My code is like this:


my $seqio = new Bio::SeqIO(-format => $format,
-file => $file);

while ( my $seq = $seqio->next_seq ) {
if( $seq->alphabet ne 'protein' ) {
confess("Skipping non protein sequence...");
next;
}

#write code here to assign each entry into harsh %seqharsh
my $seqharsh{$seq->primary_id} = $seq->seq();

This line will not do what you want. AFAIK you can't even my
a hash key/value. Remove the my and see if you get what you want.
If running under use strict, as you should, you will want to define
my %seqharsh (or my %seqhash) before the while loop begins.

If you have further troubles, you should be sure to post a
short but complete script that contains your difficulty, as
well as the expected and actual results, as described in the
posting guidelines for this newsgroup.

--keith
 
J

January Weiner

Can anyone suggest me a simple way to convert multiple sequences fasta
(in Bio::SeqIO object) into harsh table (sequence annotation as key,
sequence as value)?

Using Bioperl to read in fasta _is_ an overkill. Bioperl is painfully
slow. Of course, then you can read 1001 formats, check the sequence for
correctness etc etc, but if you need to run a bigger job, forget Bioperl.
The fasta file looks like this:
gi|9049352|dbj|BAA99407.1| 3-methylcrotonyl-CoA carboxylase biotin-containing subunit [Homo sapiens]
MAAASAVSVLLVAAERNRWHRLPSLLLPPRTWVWRQRTMKYTTATGRNITKVLIANRGEIACRVMRTAKKLGVQTVAVYSEADRNSMHVDMADEAYSIGPAPSQQSYLSMEKIIQVAKTSAAQAIHPGCGFLSENMEFAE

# here is a very simple code; it does not check whether the sequence is
# correct, but it is fast and does not depend on the Bioperl modules

my %seq_hash ;
my $cur_seq ;

while( <$infile> ) { # you need to open() you file earlier
chomp ;
next if( /^(#.*|\s*)$/ ) ; # skip comments and empty lines

if( /^>(.*)/ ) { # new sequence
$cur_seq = $1 ;
next ;
}

next unless( $cur_seq ) ; # skip lines that do not belong to a sequence

$seq_hash{ $cur_seq } .= $_ ;
}
#write code here to assign each entry into harsh %seqharsh
my $seqharsh{$seq->primary_id} = $seq->seq();
bla bla bla

(OK, now I see the problem)

you cannot write "my $blah{foo};". Remove the "my" and it should work
fine.

Please use Perl's warnings and strict pragmas ('use strict; use
warnings;').

j.

--
 
A

Anno Siegel

January Weiner said:
# here is a very simple code; it does not check whether the sequence is ^
# correct, but it is fast and does not depend on the Bioperl modules

Just a language-related note: "Code" in the sense of "program text"
is a mass noun that doesn't take an indefinite article and doesn't form
a plural.

Anno
 
J

January Weiner

Anno Siegel said:
Just a language-related note: "Code" in the sense of "program text"
is a mass noun that doesn't take an indefinite article and doesn't form
a plural.

Thanks for that! Corrected version: a piece of very simple code.

j.

P.S. What does Lublin do at the tu-berlin? :)

--
 
A

Anno Siegel

January Weiner said:
Thanks for that! Corrected version: a piece of very simple code.

Nicely applied.
P.S. What does Lublin do at the tu-berlin? :)

Nothing at all. I happen to use Polish place names when I get to name
a machine. There are quite a few short and euphonious ones when you
look on a map, enough for the foreseeable demand. It means I don't have
to come up with something original when I name a machine, but it's less
predictable than flowers.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,045
Latest member
DRCM

Latest Threads

Top