DNA String Compression For Storing in Data Structure

Discussion in 'C++' started by Gundala Viswanath, Jan 17, 2009.

  1. Hi all,

    I am new in C/C++. I am wondering if there is any
    existing implementation to compress such string in
    shorter format (e.g. 64 base).

    AAAAAAAAAAAAGTCGCGCCGCCGCGGGGAGGAA

    The reason I want to do this is because there are ~10millions of such
    tags I want to process forming a matrix. There fore I need
    to compress such a string for handling.

    For example the implementation in R will give this:

    > seq2id("AAAAAAAAAAAAGTCGCGCCGCCGCGGGGAGGAA")

    [1] "IAAAAtmWWaooA

    The R code can be viewed here: http://dpaste.com/110009/

    But I am not sure how to implement this in C/C++.
    Thanks before hand.


    - GV
     
    Gundala Viswanath, Jan 17, 2009
    #1
    1. Advertising

  2. On Jan 17, 8:12 am, Gundala Viswanath <> wrote:
    > I am new in C/C++. I am wondering if there is any
    > existing implementation to compress such string in
    > shorter format (e.g. 64 base).
    >
    > AAAAAAAAAAAAGTCGCGCCGCCGCGGGGAGGAA


    I am no expert in DNA but I understand there are only 4 possible
    symbols: A,C,G,T. In that case 2 bits are enough to encode each
    symbol. This would make a 2 bit encoded sequence 4 times smaller than
    the equivalent char string. A fixed 2 bit/symbol also makes it quite
    easy to index a sequence at random positions and insert/extract
    symbols. I suggest you make a class that uses a vector<unsigned> to
    store the encoded symbol bits and give it a vector like interface to
    index individual symbols as plain chars.
     
    Gert-Jan de Vos, Jan 17, 2009
    #2
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Leo Nunez
    Replies:
    3
    Views:
    1,292
    Neil Kurzman
    Feb 9, 2005
  2. Replies:
    5
    Views:
    377
    Paul McGuire
    Mar 20, 2009
  3. cyber science

    Cloning PCR DNA

    cyber science, Sep 11, 2009, in forum: Python
    Replies:
    0
    Views:
    276
    cyber science
    Sep 11, 2009
  4. Bruno Beam

    Bill Gates' dna is inside every Windows copy !!!!

    Bruno Beam, Dec 14, 2004, in forum: ASP .Net Web Controls
    Replies:
    0
    Views:
    118
    Bruno Beam
    Dec 14, 2004
  5. George George

    Describing degerate dna strings

    George George, Jan 16, 2009, in forum: Ruby
    Replies:
    11
    Views:
    278
    Jesús Gabriel y Galán
    Jan 17, 2009
Loading...

Share This Page