convert protein fasta stream into harsh table

Discussion in 'Perl Misc' started by zhong.huang@gmail.com, Feb 28, 2006.

  1. Guest

    hi,

    Can anyone suggest me a simple way to convert multiple sequences fasta
    (in Bio::SeqIO object) into harsh table (sequence annotation as key,
    sequence as value)?

    The fasta file looks like this:

    >gi|9049352|dbj|BAA99407.1| 3-methylcrotonyl-CoA carboxylase biotin-containing subunit [Homo sapiens]

    MAAASAVSVLLVAAERNRWHRLPSLLLPPRTWVWRQRTMKYTTATGRNITKVLIANRGEIACRVMRTAKKLGVQTVAVYSEADRNSMHVDMADEAYSIGPAPSQQSYLSMEKIIQVAKTSAAQAIHPGCGFLSENMEFAE

    >gi|4504067|ref|NP_002070.1| aspartate aminotransferase 1 [Homo sapiens]

    MAPPSVFAEVPQAQPVLVFKLTADFREDPDPRKVNLGVGAYRTDDCHPWVLPVVKKVEQKIANDNSLNHEYLPILGLAEFRSCASRLALGD

    I want to have the harsh table %seqharsh to hold sequences like this:

    # my %seqharsh = ('seq1', MAAASAVSVL......',
    # 'seq2', MAPPSVFAEVPQ......,);


    My code is like this:


    my $seqio = new Bio::SeqIO(-format => $format,
    -file => $file);

    while ( my $seq = $seqio->next_seq ) {
    if( $seq->alphabet ne 'protein' ) {
    confess("Skipping non protein sequence...");
    next;
    }

    #write code here to assign each entry into harsh %seqharsh
    my $seqharsh{$seq->primary_id} = $seq->seq();
    bla bla bla


    Thank you very much for your help!


    zhong
    , Feb 28, 2006
    #1
    1. Advertising

  2. Samwyse Guest

    wrote:
    > hi,
    >
    > Can anyone suggest me a simple way to convert multiple sequences fasta
    > (in Bio::SeqIO object) into harsh table (sequence annotation as key,
    > sequence as value)?


    They are called HASH tables, not HARSH tables.

    > The fasta file looks like this:
    >
    >
    >>gi|9049352|dbj|BAA99407.1| 3-methylcrotonyl-CoA carboxylase biotin-containing subunit [Homo sapiens]

    >
    > MAAASAVSVLLVAAERNRWHRLPSLLLPPRTWVWRQRTMKYTTATGRNITKVLIANRGEIACRVMRTAKKLGVQTVAVYSEADRNSMHVDMADEAYSIGPAPSQQSYLSMEKIIQVAKTSAAQAIHPGCGFLSENMEFAE
    >
    >
    >>gi|4504067|ref|NP_002070.1| aspartate aminotransferase 1 [Homo sapiens]

    >
    > MAPPSVFAEVPQAQPVLVFKLTADFREDPDPRKVNLGVGAYRTDDCHPWVLPVVKKVEQKIANDNSLNHEYLPILGLAEFRSCASRLALGD
    >
    > I want to have the harsh table %seqharsh to hold sequences like this:
    >
    > # my %seqharsh = ('seq1', MAAASAVSVL......',
    > # 'seq2', MAPPSVFAEVPQ......,);


    I'm not seeing where the 'seq1' and 'seq2' values are coming from in
    your input. If I'm allowed to make up hash keys, the problem is pretty
    simple.

    > My code is like this:
    >
    >
    > my $seqio = new Bio::SeqIO(-format => $format,
    > -file => $file);


    my %seqharsh; # declare the hash table

    > while ( my $seq = $seqio->next_seq ) {
    > if( $seq->alphabet ne 'protein' ) {
    > confess("Skipping non protein sequence...");
    > next;
    > }
    >
    > #write code here to assign each entry into harsh %seqharsh
    > my $seqharsh{$seq->primary_id} = $seq->seq();


    You neither need nor want the 'my'; this will add items to the hash:
    $seqharsh{$seq->primary_id} = $seq->seq;
    Samwyse, Feb 28, 2006
    #2
    1. Advertising

  3. Keith Keller Guest

    On 2006-02-28, <> wrote:
    >
    > Can anyone suggest me a simple way to convert multiple sequences fasta
    > (in Bio::SeqIO object) into harsh table (sequence annotation as key,
    > sequence as value)?


    It's a hash, not harsh (at least in English). I suppose it could be
    a harsh hash.

    > I want to have the harsh table %seqharsh to hold sequences like this:
    >
    > # my %seqharsh = ('seq1', MAAASAVSVL......',
    > # 'seq2', MAPPSVFAEVPQ......,);


    It's not such great style to write it this way, even though
    we know what you mean. Try this instead:

    # my %seqharsh = ('seq1' => 'MAAASAVSVL......',
    # 'seq2' => 'MAPPSVFAEVPQ......',);

    Under use strict the quotes are required on the right-hand side.
    The => behaves like , in this context, but looks nicer for humans.

    > My code is like this:
    >
    >
    > my $seqio = new Bio::SeqIO(-format => $format,
    > -file => $file);
    >
    > while ( my $seq = $seqio->next_seq ) {
    > if( $seq->alphabet ne 'protein' ) {
    > confess("Skipping non protein sequence...");
    > next;
    > }
    >
    > #write code here to assign each entry into harsh %seqharsh
    > my $seqharsh{$seq->primary_id} = $seq->seq();


    This line will not do what you want. AFAIK you can't even my
    a hash key/value. Remove the my and see if you get what you want.
    If running under use strict, as you should, you will want to define
    my %seqharsh (or my %seqhash) before the while loop begins.

    If you have further troubles, you should be sure to post a
    short but complete script that contains your difficulty, as
    well as the expected and actual results, as described in the
    posting guidelines for this newsgroup.

    --keith


    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom
    see X- headers for PGP signature information
    Keith Keller, Feb 28, 2006
    #3
  4. wrote:
    > hi,


    > Can anyone suggest me a simple way to convert multiple sequences fasta
    > (in Bio::SeqIO object) into harsh table (sequence annotation as key,
    > sequence as value)?


    Using Bioperl to read in fasta _is_ an overkill. Bioperl is painfully
    slow. Of course, then you can read 1001 formats, check the sequence for
    correctness etc etc, but if you need to run a bigger job, forget Bioperl.

    > The fasta file looks like this:


    > >gi|9049352|dbj|BAA99407.1| 3-methylcrotonyl-CoA carboxylase biotin-containing subunit [Homo sapiens]

    > MAAASAVSVLLVAAERNRWHRLPSLLLPPRTWVWRQRTMKYTTATGRNITKVLIANRGEIACRVMRTAKKLGVQTVAVYSEADRNSMHVDMADEAYSIGPAPSQQSYLSMEKIIQVAKTSAAQAIHPGCGFLSENMEFAE


    # here is a very simple code; it does not check whether the sequence is
    # correct, but it is fast and does not depend on the Bioperl modules

    my %seq_hash ;
    my $cur_seq ;

    while( <$infile> ) { # you need to open() you file earlier
    chomp ;
    next if( /^(#.*|\s*)$/ ) ; # skip comments and empty lines

    if( /^>(.*)/ ) { # new sequence
    $cur_seq = $1 ;
    next ;
    }

    next unless( $cur_seq ) ; # skip lines that do not belong to a sequence

    $seq_hash{ $cur_seq } .= $_ ;
    }

    > #write code here to assign each entry into harsh %seqharsh
    > my $seqharsh{$seq->primary_id} = $seq->seq();
    > bla bla bla


    (OK, now I see the problem)

    you cannot write "my $blah{foo};". Remove the "my" and it should work
    fine.

    Please use Perl's warnings and strict pragmas ('use strict; use
    warnings;').

    j.

    --
    January Weiner, Feb 28, 2006
    #4
  5. Guest

  6. Guest

  7. Guest

  8. Anno Siegel Guest

    January Weiner <> wrote in comp.lang.perl.misc:

    > # here is a very simple code; it does not check whether the sequence is

    ^
    > # correct, but it is fast and does not depend on the Bioperl modules


    Just a language-related note: "Code" in the sense of "program text"
    is a mass noun that doesn't take an indefinite article and doesn't form
    a plural.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, Feb 28, 2006
    #8
  9. Anno Siegel <-berlin.de> wrote:
    > Just a language-related note: "Code" in the sense of "program text"
    > is a mass noun that doesn't take an indefinite article and doesn't form
    > a plural.


    Thanks for that! Corrected version: a piece of very simple code.

    j.

    P.S. What does Lublin do at the tu-berlin? :)

    --
    January Weiner, Mar 1, 2006
    #9
  10. Anno Siegel Guest

    January Weiner <> wrote in comp.lang.perl.misc:
    > Anno Siegel <-berlin.de> wrote:
    > > Just a language-related note: "Code" in the sense of "program text"
    > > is a mass noun that doesn't take an indefinite article and doesn't form
    > > a plural.

    >
    > Thanks for that! Corrected version: a piece of very simple code.


    Nicely applied.

    > P.S. What does Lublin do at the tu-berlin? :)


    Nothing at all. I happen to use Polish place names when I get to name
    a machine. There are quite a few short and euphonious ones when you
    look on a map, enough for the foreseeable demand. It means I don't have
    to come up with something original when I name a machine, but it's less
    predictable than flowers.

    Anno


    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, Mar 1, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rasmusson, Lars
    Replies:
    1
    Views:
    755
    popov
    Apr 30, 2004
  2. Chris Lasher
    Replies:
    26
    Views:
    753
    Bengt Richter
    Jan 16, 2005
  3. Jeroen Ruigrok van der Werven

    Re: centre of mass of protein

    Jeroen Ruigrok van der Werven, Jan 9, 2008, in forum: Python
    Replies:
    0
    Views:
    513
    Jeroen Ruigrok van der Werven
    Jan 9, 2008
  4. PeroMHC

    concatenate fasta file

    PeroMHC, Feb 12, 2010, in forum: Python
    Replies:
    3
    Views:
    1,631
    Grant Edwards
    Feb 13, 2010
  5. Sally Wang
    Replies:
    5
    Views:
    342
    Phil Tomson
    Jan 22, 2004
Loading...

Share This Page