Organize large DNA txt files

Discussion in 'Python' started by thomasvangurp@gmail.com, Mar 20, 2009.

  1. Guest

    Dear Fellow programmers,

    I'm using Python scripts too organize some rather large datasets
    describing DNA variation. Information is read, processed and written
    too a file in a sequential order, like this
    1+
    1-
    2+
    2-

    etc.. The files that i created contain positional information
    (nucleotide position) and some other info, like this:

    file 1+:
    --------------------------------------------
    1 73 0 1 0 0
    1 76 1 0 0 0
    1 77 0 1 0 0
    --------------------------------------------
    file 1-
    --------------------------------------------
    1 74 0 0 6 0
    1 78 0 0 4 0
    1 89 0 0 0 2

    Now the trick is that i want this:

    File 1+ AND File 1-
    --------------------------------------------
    1 73 0 1 0 0
    1 74 0 0 6 0
    1 76 1 0 0 0
    1 77 0 1 0 0
    1 78 0 0 4 0
    1 89 0 0 0 2
    -------------------------------------------

    So the information should be sorted onto position. Right now I've
    written some very complicated scripts that read a number of lines from
    file 1- and 1+ and then combine this output. The problem is of course
    that the running number of file 1- can be lower then 1+, resulting in
    a incorrect order. Since both files are too large to input in a
    dictionary at once (both are 100 MB+) I need some sort of a
    alternative that can quickly sort everything without crashing my pc..

    Your thoughts are appreciated..
    Kind regards,
    Thomas
     
    , Mar 20, 2009
    #1
    1. Advertising

  2. MRAB Guest

    wrote:
    > Dear Fellow programmers,
    >
    > I'm using Python scripts too organize some rather large datasets
    > describing DNA variation. Information is read, processed and written
    > too a file in a sequential order, like this
    > 1+
    > 1-
    > 2+
    > 2-
    >
    > etc.. The files that i created contain positional information
    > (nucleotide position) and some other info, like this:
    >
    > file 1+:
    > --------------------------------------------
    > 1 73 0 1 0 0
    > 1 76 1 0 0 0
    > 1 77 0 1 0 0
    > --------------------------------------------
    > file 1-
    > --------------------------------------------
    > 1 74 0 0 6 0
    > 1 78 0 0 4 0
    > 1 89 0 0 0 2
    >
    > Now the trick is that i want this:
    >
    > File 1+ AND File 1-
    > --------------------------------------------
    > 1 73 0 1 0 0
    > 1 74 0 0 6 0
    > 1 76 1 0 0 0
    > 1 77 0 1 0 0
    > 1 78 0 0 4 0
    > 1 89 0 0 0 2
    > -------------------------------------------
    >
    > So the information should be sorted onto position. Right now I've
    > written some very complicated scripts that read a number of lines from
    > file 1- and 1+ and then combine this output. The problem is of course
    > that the running number of file 1- can be lower then 1+, resulting in
    > a incorrect order. Since both files are too large to input in a
    > dictionary at once (both are 100 MB+) I need some sort of a
    > alternative that can quickly sort everything without crashing my pc..
    >

    Here's my attempt:

    line_1 = input_1.readline()
    line_2 = input_2.readline()
    while line_1 and line_2:
    pos_1 = int(line_1.split(None, 2)[1])
    pos_2 = int(line_2.split(None, 2)[1])
    if pos_1 < pos_2:
    output.write(line_1)
    line_1 = input_1.readline()
    else:
    output.write(line_2)
    line_2 = input_2.readline()
    while line_1:
    output.write(line_1)
    line_1 = input_1.readline()
    while line_2:
    output.write(line_2)
    line_2 = input_2.readline()
     
    MRAB, Mar 20, 2009
    #2
    1. Advertising

  3. Guest

    Thanks,
    This works great!
    I did not know that it is possible to iterate through the file lines
    with a while function that's conditional on additional lines being
    present or not.
     
    , Mar 20, 2009
    #3
  4. MRAB Guest

    wrote:
    > Thanks,
    > This works great!
    > I did not know that it is possible to iterate through the file lines
    > with a while function that's conditional on additional lines being
    > present or not.
    >

    It relies on file.readline() returning an empty string when it's at the
    end of the file (and that's the only time it does) and empty strings
    being treated as False by 'while' (and non-empty strings being treated
    as True). It's all in the docs! :)
     
    MRAB, Mar 20, 2009
    #4
  5. > I'm using Python scripts too organize some rather large datasets
    > describing DNA variation. Information is read, processed and written
    > too a file in a sequential order, like this
    > 1+
    > 1-
    > 2+
    > 2-
    >
    > etc.. The files that i created contain positional information
    > (nucleotide position) and some other info, like this:
    >
    > file 1+:
    > --------------------------------------------
    > 1 73 0 1 0 0
    > 1 76 1 0 0 0
    > 1 77 0 1 0 0
    > --------------------------------------------
    > file 1-
    > --------------------------------------------
    > 1 74 0 0 6 0
    > 1 78 0 0 4 0
    > 1 89 0 0 0 2
    >
    > Now the trick is that i want this:
    >
    > File 1+ AND File 1-
    > --------------------------------------------
    > 1 73 0 1 0 0
    > 1 74 0 0 6 0
    > 1 76 1 0 0 0
    > 1 77 0 1 0 0
    > 1 78 0 0 4 0
    > 1 89 0 0 0 2
    > -------------------------------------------
    >
    > So the information should be sorted onto position. Right now I've
    > written some very complicated scripts that read a number of lines from
    > file 1- and 1+ and then combine this output. The problem is of course
    > that the running number of file 1- can be lower then 1+, resulting in
    > a incorrect order. Since both files are too large to input in a
    > dictionary at once (both are 100 MB+) I need some sort of a
    > alternative that can quickly sort everything without crashing my pc..


    Have you considered using a lightweight database solution? Sqlite is a
    really simple, zero configuration, server-less db and a python binding
    for it comes with python itself. I'd give it a try, it will simplify
    tasks like these a great deal.

    http://docs.python.org/library/sqlite3.html

    Cheers,
    Daniel


    --
    Psss, psss, put it down! - http://www.cafepress.com/putitdown
     
    Daniel Fetchinson, Mar 20, 2009
    #5
  6. Paul McGuire Guest

    On Mar 20, 11:59 am, Daniel Fetchinson <>
    wrote:
    > Have you considered using a lightweight database solution? Sqlite is a
    > really simple, zero configuration, server-less db and a python binding
    > for it comes with python itself. I'd give it a try, it will simplify
    > tasks like these a great deal.
    >
    > http://docs.python.org/library/sqlite3.html
    >


    I second Daniel's recommendation of using Sqlite. Just as easily as
    you create output files 1+ and 1-, you can work with a sqlite databsae
    file. If you are worried that your database file may not be as easy
    to read as using Notepad on 1+ and 1-, you can download the freeware
    SQLiteDatabase Browser (http://sqlitebrowser.sourceforge.net/) - think
    of it as the Notepad for Sqlite database files.

    -- Paul
     
    Paul McGuire, Mar 20, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. cnb
    Replies:
    6
    Views:
    274
    jay graves
    Sep 2, 2008
  2. Gundala Viswanath
    Replies:
    1
    Views:
    616
    Gert-Jan de Vos
    Jan 17, 2009
  3. cyber science

    Cloning PCR DNA

    cyber science, Sep 11, 2009, in forum: Python
    Replies:
    0
    Views:
    269
    cyber science
    Sep 11, 2009
  4. Bruno Beam

    Bill Gates' dna is inside every Windows copy !!!!

    Bruno Beam, Dec 14, 2004, in forum: ASP .Net Web Controls
    Replies:
    0
    Views:
    113
    Bruno Beam
    Dec 14, 2004
  5. Jayden Shui
    Replies:
    13
    Views:
    569
    Jorgen Grahn
    Nov 20, 2011
Loading...

Share This Page