What strategy for random accession of records in massive FASTA file?

Discussion in 'Python' started by Chris Lasher, Jan 12, 2005.

  1. Chris Lasher

    Chris Lasher Guest

    Hello,
    I have a rather large (100+ MB) FASTA file from which I need to
    access records in a random order. The FASTA format is a standard format
    for storing molecular biological sequences. Each record contains a
    header line for describing the sequence that begins with a '>'
    (right-angle bracket) followed by lines that contain the actual
    sequence data. Three example FASTA records are below:

    >CW127_A01

    TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    GCATTAAACAT
    >CW127_A02

    TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
    >CW127_A03

    TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    GCATTAAACATTCCGCCTGGG
    ....

    Since the file I'm working with contains tens of thousands of these
    records, I believe I need to find a way to hash this file such that I
    can retrieve the respective sequence more quickly than I could by
    parsing through the file request-by-request. However, I'm very new to
    Python and am still very low on the learning curve for programming and
    algorithms in general; while I'm certain there are ubiquitous
    algorithms for this type of problem, I don't know what they are or
    where to look for them. So I turn to the gurus and accost you for help
    once again. :) If you could help me figure out how to code a solution
    that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
    keep it in Python only, even though I know interaction with a
    relational database would provide the fastest method--the group I'm
    trying to write this for does not have access to a RDBMS.)
    Thanks very much in advance,
    Chris
    Chris Lasher, Jan 12, 2005
    #1
    1. Advertising

  2. Re: What strategy for random accession of records in massive FASTAfile?

    Chris Lasher wrote:

    > Since the file I'm working with contains tens of thousands of these
    > records, I believe I need to find a way to hash this file such that I
    > can retrieve the respective sequence more quickly than I could by
    > parsing through the file request-by-request. However, I'm very new to
    > Python and am still very low on the learning curve for programming and
    > algorithms in general; while I'm certain there are ubiquitous
    > algorithms for this type of problem, I don't know what they are or
    > where to look for them. So I turn to the gurus and accost you for help
    > once again. :) If you could help me figure out how to code a solution
    > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
    > keep it in Python only, even though I know interaction with a
    > relational database would provide the fastest method--the group I'm
    > trying to write this for does not have access to a RDBMS.)


    keeping an index in memory might be reasonable. the following class
    creates an index file by scanning the FASTA file, and uses the "marshal"
    module to save it to disk. if the index file already exists, it's used as is.
    to regenerate the index, just remove the index file, and run the program
    again.

    import os, marshal

    class FASTA:

    def __init__(self, file):
    self.file = open(file)
    self.checkindex()

    def __getitem__(self, key):
    try:
    pos = self.index[key]
    except KeyError:
    raise IndexError("no such item")
    else:
    f = self.file
    f.seek(pos)
    header = f.readline()
    assert ">" + header + "\n"
    data = []
    while 1:
    line = f.readline()
    if not line or line[0] == ">":
    break
    data.append(line)
    return data

    def checkindex(self):
    indexfile = self.file.name + ".index"

    try:
    self.index = marshal.load(open(indexfile, "rb"))
    except IOError:
    print "building index..."

    index = {}

    # scan the file
    f = self.file
    f.seek(0)
    while 1:
    pos = f.tell()
    line = f.readline()
    if not line:
    break
    if line[0] == ">":
    # save offset to header line
    header = line[1:].strip()
    index[header] = pos

    # save index to disk
    f = open(indexfile, "wb")
    marshal.dump(index, f)
    f.close()

    self.index = index

    db = FASTA("myfastafile.dat")
    print db["CW127_A02"]

    ['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

    tweak as necessary.

    </F>
    Fredrik Lundh, Jan 12, 2005
    #2
    1. Advertising

  3. Chris Lasher

    John Lenton Guest

    > If you could help me figure out how to code a solution
    > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer

    to
    > keep it in Python only, even though I know interaction with a
    > relational database would provide the fastest method--the group I'm
    > trying to write this for does not have access to a RDBMS.)

    You don't need a RDBMS; I'd put it in a DBM or CDB myself.
    John Lenton, Jan 12, 2005
    #3
  4. Chris Lasher

    Larry Bates Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    You don't say how this will be used, but here goes:

    1) Read the records and put into dictionary with key
    of sequence (from header) and data being the sequence
    data. Use shelve to store the dictionary for subsequent
    runs (if load time is excessive).

    2) Take a look at Gadfly (gadfly.sourceforge.net). It
    provides you with Python SQL-like database and may be
    better solution if data is basically static and you
    do lots of processing.

    All depends on how you use the data.

    Regards,
    Larry Bates
    Syscon, Inc.

    Chris Lasher wrote:
    > Hello,
    > I have a rather large (100+ MB) FASTA file from which I need to
    > access records in a random order. The FASTA format is a standard format
    > for storing molecular biological sequences. Each record contains a
    > header line for describing the sequence that begins with a '>'
    > (right-angle bracket) followed by lines that contain the actual
    > sequence data. Three example FASTA records are below:
    >
    >
    >>CW127_A01

    >
    > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    > GCATTAAACAT
    >
    >>CW127_A02

    >
    > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    > GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
    >
    >>CW127_A03

    >
    > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    > GCATTAAACATTCCGCCTGGG
    > ...
    >
    > Since the file I'm working with contains tens of thousands of these
    > records, I believe I need to find a way to hash this file such that I
    > can retrieve the respective sequence more quickly than I could by
    > parsing through the file request-by-request. However, I'm very new to
    > Python and am still very low on the learning curve for programming and
    > algorithms in general; while I'm certain there are ubiquitous
    > algorithms for this type of problem, I don't know what they are or
    > where to look for them. So I turn to the gurus and accost you for help
    > once again. :) If you could help me figure out how to code a solution
    > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
    > keep it in Python only, even though I know interaction with a
    > relational database would provide the fastest method--the group I'm
    > trying to write this for does not have access to a RDBMS.)
    > Thanks very much in advance,
    > Chris
    >
    Larry Bates, Jan 12, 2005
    #4
  5. In article <>, Chris Lasher wrote:
    > Hello,
    > I have a rather large (100+ MB) FASTA file from which I need to
    > access records in a random order. The FASTA format is a standard format
    > for storing molecular biological sequences. Each record contains a
    > header line for describing the sequence that begins with a '>'
    > (right-angle bracket) followed by lines that contain the actual
    > sequence data. Three example FASTA records are below:


    Use biopython. They have dictionary-style classes which wrap FASTA files using indexes.

    http://www.biopython.org

    Dave
    David E. Konerding DSD staff, Jan 13, 2005
    #5
  6. Chris Lasher

    John Machin Guest

    Chris Lasher wrote:
    > Hello,
    > I have a rather large (100+ MB) FASTA file from which I need to
    > access records in a random order. The FASTA format is a standard

    format
    > for storing molecular biological sequences. Each record contains a
    > header line for describing the sequence that begins with a '>'
    > (right-angle bracket) followed by lines that contain the actual
    > sequence data. Three example FASTA records are below:
    >
    > >CW127_A01

    > TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    > TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    > GCATTAAACAT

    [snip]
    > Since the file I'm working with contains tens of thousands of these
    > records, I believe I need to find a way to hash this file such that I
    > can retrieve the respective sequence more quickly than I could by
    > parsing through the file request-by-request. However, I'm very new to
    > Python and am still very low on the learning curve for programming

    and
    > algorithms in general; while I'm certain there are ubiquitous
    > algorithms for this type of problem, I don't know what they are or
    > where to look for them. So I turn to the gurus and accost you for

    help
    > once again. :) If you could help me figure out how to code a

    solution
    > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer

    to
    > keep it in Python only, even though I know interaction with a
    > relational database would provide the fastest method--the group I'm
    > trying to write this for does not have access to a RDBMS.)
    > Thanks very much in advance,
    > Chris


    Before you get too carried away, how often do you want to do this and
    how grunty is the box you will be running on? Will the data be on a
    server? If the server is on a WAN or at the other end of a radio link
    between buildings, you definitely need an index so that you can access
    the data randomly!

    By way of example, to read all of a 157MB file into memory from a local
    (i.e. not networked) disk using readlines() takes less than 4 seconds
    on a 1.4Ghz Athlon processor (see below). The average new corporate
    desktop box is about twice as fast as that. Note that Windows Task
    Manager showed 100% CPU utilisation for both read() and readlines().

    My guess is that you don't need anything much fancier than the effbot's
    index method -- which by now you have probably found works straight out
    of the box and is more than fast enough for your needs.

    BTW, you need to clarify "don't have access to an RDBMS" ... surely
    this can only be due to someone stopping them from installing good free
    software freely available on the Internet.

    HTH,
    John

    C:\junk>python -m timeit -n 1 -r 6 "print
    len(file('bigfile.csv').read())"
    157581595
    157581595
    157581595
    157581595
    157581595
    157581595
    1 loops, best of 6: 3.3e+006 usec per loop

    C:\junk>python -m timeit -n 1 -r 6 "print
    len(file('bigfile.csv').readlines())"
    1118870
    1118870
    1118870
    1118870
    1118870
    1118870
    1 loops, best of 6: 3.57e+006 usec per loop
    John Machin, Jan 13, 2005
    #6
  7. On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <> wrote:

    >Hello,
    >I have a rather large (100+ MB) FASTA file from which I need to
    >access records in a random order. The FASTA format is a standard format
    >for storing molecular biological sequences. Each record contains a
    >header line for describing the sequence that begins with a '>'
    >(right-angle bracket) followed by lines that contain the actual
    >sequence data. Three example FASTA records are below:
    >

    Others have probably solved your basic problem, or pointed
    the way. I'm just curious.

    Given that the information content is 2 bits per character
    that is taking up 8 bits of storage, there must be a good reason
    for storing and/or transmitting them this way? I.e., it it easy
    to think up a count-prefixed compressed format packing 4:1 in
    subsequent data bytes (except for the last byte which have
    less than 4 2-bit codes).

    I'm wondering how the data is actually used once records are
    retrieved. (but I'm too lazy to explore the biopython.org link).

    >>CW127_A01

    >TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    >TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    >GCATTAAACAT
    >>CW127_A02

    >TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    >TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    >GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
    >>CW127_A03

    >TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
    >TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
    >GCATTAAACATTCCGCCTGGG
    >...



    Regards,
    Bengt Richter
    Bengt Richter, Jan 13, 2005
    #7
  8. Chris Lasher

    Chris Lasher Guest

    >Before you get too carried away, how often do you want to do this and
    >how grunty is the box you will be running on?


    Oops, I should have specified this. The script will only need to be run
    once every three or four months, when the sequences are updated. I'll
    be running it on boxes that are 3GHz/100GB Ram, but others may not be
    so fortunate, and so I'd like to keep that in mind.

    >BTW, you need to clarify "don't have access to an RDBMS" ... surely
    >this can only be due to someone stopping them from installing good
    >free software freely available on the Internet.


    I understand your and others' sentiment on this. I agree, the
    open-source database systems are wonderful. However, keeping it
    Python-only saves me hassle by only having to assist in instances where
    others need help downloading and installing Python. I suppose if I keep
    it in Python, I can even use Py2exe to generate an executable that
    wouldn't even require them to install Python. A solution using
    interaction with a database is much sexier, but, for the purposes of
    the script, seems unnecesary. However, I certainly appreciate the
    suggestions.

    >My guess is that you don't need anything much fancier than the
    >effbot's index method -- which by now you have probably found works
    >straight out of the box and is more than fast enough for your needs.


    I think Mr. Lundh's code will work very well for these purposes. Thanks
    very much to him for posting it. Many thanks for posting that! You'll
    have full credit for that part of the code. Thanks very much to all who
    replied!

    Chris
    Chris Lasher, Jan 13, 2005
    #8
  9. Chris Lasher

    Chris Lasher Guest

    Thanks for your reply, Larry. I thought about this, but I'm worried the
    dictionary will consume a lot of resources. I think my 3GHz/1GB RAM box
    could handle the load fine, but I'm not sure about others' systems.
    Chris
    Chris Lasher, Jan 13, 2005
    #9
  10. Chris Lasher

    Chris Lasher Guest

    >Others have probably solved your basic problem, or pointed
    >the way. I'm just curious.


    >Given that the information content is 2 bits per character
    >that is taking up 8 bits of storage, there must be a good reason
    >for storing and/or transmitting them this way? I.e., it it easy
    >to think up a count-prefixed compressed format packing 4:1 in
    >subsequent data bytes (except for the last byte which have
    >less than 4 2-bit codes).


    My guess for the inefficiency in storage size is because it is
    human-readable, and because most in-silico molecular biology is just a
    bunch of fancy string algorithms. This is my limited view of these
    things at least.

    >I'm wondering how the data is actually used once records are
    >retrieved.


    This one I can answer. For my purposes, I'm just organizing the
    sequences at hand, but there are all sorts of things one could actually
    do with sequences: alignments, BLAST searches, gene annotations, etc.
    Chris Lasher, Jan 13, 2005
    #10
  11. Chris Lasher

    Jeff Shannon Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    Chris Lasher wrote:

    >>Given that the information content is 2 bits per character
    >>that is taking up 8 bits of storage, there must be a good reason
    >>for storing and/or transmitting them this way? I.e., it it easy
    >>to think up a count-prefixed compressed format packing 4:1 in
    >>subsequent data bytes (except for the last byte which have
    >>less than 4 2-bit codes).

    >
    > My guess for the inefficiency in storage size is because it is
    > human-readable, and because most in-silico molecular biology is just a
    > bunch of fancy string algorithms. This is my limited view of these
    > things at least.


    Yeah, that pretty much matches my guess (not that I'm involved in
    anything related to computational molecular biology or genetics).
    Given the current technology, the cost of the extra storage size is
    presumably lower than the cost of translating into/out of a packed
    format. Heck, hard drives cost less than $1/GB now.

    And besides, for long-term archiving purposes, I'd expect that zip et
    al on a character-stream would provide significantly better
    compression than a 4:1 packed format, and that zipping the packed
    format wouldn't be all that much more efficient than zipping the
    character stream.

    Jeff Shannon
    Technician/Programmer
    Credit International
    Jeff Shannon, Jan 13, 2005
    #11
  12. Chris Lasher

    Chris Lasher Guest

    >And besides, for long-term archiving purposes, I'd expect that zip et
    >al on a character-stream would provide significantly better
    >compression than a 4:1 packed format, and that zipping the packed
    >format wouldn't be all that much more efficient than zipping the
    >character stream.


    This 105MB FASTA file is 8.3 MB gzip-ed.
    Chris Lasher, Jan 13, 2005
    #12
  13. Chris Lasher

    Jeff Shannon Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    Chris Lasher wrote:

    >>And besides, for long-term archiving purposes, I'd expect that zip et
    >>al on a character-stream would provide significantly better
    >>compression than a 4:1 packed format, and that zipping the packed
    >>format wouldn't be all that much more efficient than zipping the
    >>character stream.

    >
    > This 105MB FASTA file is 8.3 MB gzip-ed.


    And a 4:1 packed-format file would be ~26MB. It'd be interesting to
    see how that packed-format file would compress, but I don't care
    enough to write a script to convert the FASTA file into a
    packed-format file to experiment with... ;)

    Short version, then, is that yes, size concerns (such as they may be)
    are outweighed by speed and conceptual simplicity (i.e. avoiding a
    huge mess of bit-masking every time a single base needs to be
    examined, or a human-(semi-)readable display is needed).

    (Plus, if this format might be used for RNA sequences as well as DNA
    sequences, you've got at least a fifth base to represent, which means
    you need at least three bits per base, which means only two bases per
    byte (or else base-encodings split across byte-boundaries).... That
    gets ugly real fast.)

    Jeff Shannon
    Technician/Programmer
    Credit International
    Jeff Shannon, Jan 14, 2005
    #13
  14. Chris Lasher

    Robert Kern Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    Jeff Shannon wrote:

    > (Plus, if this format might be used for RNA sequences as well as DNA
    > sequences, you've got at least a fifth base to represent, which means
    > you need at least three bits per base, which means only two bases per
    > byte (or else base-encodings split across byte-boundaries).... That gets
    > ugly real fast.)


    Not to mention all the IUPAC symbols for incompletely specified bases
    (e.g. R = A or G).

    http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

    --
    Robert Kern


    "In the fields of hell where the grass grows high
    Are the graves of dreams allowed to die."
    -- Richard Harter
    Robert Kern, Jan 14, 2005
    #14
  15. Chris Lasher

    John Lenton Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    On Thu, Jan 13, 2005 at 12:19:49AM +0100, Fredrik Lundh wrote:
    > Chris Lasher wrote:
    >
    > > Since the file I'm working with contains tens of thousands of these
    > > records, I believe I need to find a way to hash this file such that I
    > > can retrieve the respective sequence more quickly than I could by
    > > parsing through the file request-by-request. However, I'm very new to
    > > Python and am still very low on the learning curve for programming and
    > > algorithms in general; while I'm certain there are ubiquitous
    > > algorithms for this type of problem, I don't know what they are or
    > > where to look for them. So I turn to the gurus and accost you for help
    > > once again. :) If you could help me figure out how to code a solution
    > > that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
    > > keep it in Python only, even though I know interaction with a
    > > relational database would provide the fastest method--the group I'm
    > > trying to write this for does not have access to a RDBMS.)

    >
    > keeping an index in memory might be reasonable. the following class
    > creates an index file by scanning the FASTA file, and uses the "marshal"
    > module to save it to disk. if the index file already exists, it's used as is.
    > to regenerate the index, just remove the index file, and run the program
    > again.


    the problem caught my interest, and the way you used a non-mmaped file
    in a place where using mmap was pretty much obvious (IMVHO) bothered
    me enough to overcome my laziness. It didn't overcome it by *much*,
    mind you, so the following probably only works on python 2.3, on
    Linux, in Argentina, and with FASTA data that looks like the sample I
    was able to download.

    However, having said that, the sample I downloaded was one 46MiB file,
    and reading it in on my notebook was fast enough that I ripped out the
    saving/reloading of the index and just reindex it every time. Adding
    back the persistant index is trivial. [ time passes ] In fact, I just
    found a several-gigabyte fasta file, and I guess you'd want the index
    for that one; I put the code back in. And now I should really go to
    bed, because this is very interesting but won't pay the bills.

    import os, mmap, marshal
    from UserDict import DictMixin

    class FASTA(DictMixin):
    def __init__(self, filename):
    self.file = f = open(filename)
    fno = f.fileno()
    stat = os.fstat(fno)
    self.map = mmap.mmap(fno, stat.st_size, access=mmap.ACCESS_COPY)
    self.checkindex()

    def __getitem__(self, key):
    p0, pf = self.index[key]
    m = self.map
    return m[p0:pf]

    def keys(self):
    return self.index.keys()
    def __contains__(self, key):
    return self.index.__contains__(key)

    def checkindex(self):
    indexfile = self.file.name + ".idx"
    if os.path.exists(indexfile):
    # and os.stat(indexfile).st_mtime > os.stat(self.file.name).st_mtime:
    self.index = marshal.load(open(indexfile, "rb"))
    else:
    print 'generating index...'
    self.genindex()
    marshal.dump(self.index, open(indexfile, "wb"))
    print 'done.'

    def genindex(self):
    index = {}
    m = self.map
    last = None
    while 1:
    pos = m.find('>')
    if last is not None:
    index[last] = (index[last], pos)
    if pos == -1:
    break
    m.seek(pos)
    line = m.readline()
    pos = m.tell()
    # this is the bit that probably only works with FASTA
    # files like I was able to find on the 'net.
    sep = line.index(' ')
    if sep == -1:
    name = line[1:].strip()
    else:
    name = line[1:sep].strip()
    index[name] = pos
    last = name
    self.index = index

    db = FASTA("/home/john/tmp/uniprot_sprot.fasta")
    print db["104K_THEPA"]


    --
    John Lenton () -- Random fortune:
    "Those who believe in astrology are living in houses with foundations of
    Silly Putty."
    - Dennis Rawlins, astronomer

    -----BEGIN PGP SIGNATURE-----
    Version: GnuPG v1.2.5 (GNU/Linux)

    iD8DBQFB52VWgPqu395ykGsRAtiTAJ9sooa2folCJp3beGzY3PenuuMJJgCfQEnz
    5ZSQezYJ5R0X6vB+Aj7FnOQ=
    =n0xF
    -----END PGP SIGNATURE-----
    John Lenton, Jan 14, 2005
    #15
  16. Chris Lasher

    Neil Benn Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    Jeff Shannon wrote:

    > Chris Lasher wrote:
    >
    >>> And besides, for long-term archiving purposes, I'd expect that zip et
    >>> al on a character-stream would provide significantly better
    >>> compression than a 4:1 packed format, and that zipping the packed
    >>> format wouldn't be all that much more efficient than zipping the
    >>> character stream.

    >>
    >>
    >> This 105MB FASTA file is 8.3 MB gzip-ed.

    >
    >
    > And a 4:1 packed-format file would be ~26MB. It'd be interesting to
    > see how that packed-format file would compress, but I don't care
    > enough to write a script to convert the FASTA file into a
    > packed-format file to experiment with... ;)
    >
    > Short version, then, is that yes, size concerns (such as they may be)
    > are outweighed by speed and conceptual simplicity (i.e. avoiding a
    > huge mess of bit-masking every time a single base needs to be
    > examined, or a human-(semi-)readable display is needed).
    >
    > (Plus, if this format might be used for RNA sequences as well as DNA
    > sequences, you've got at least a fifth base to represent, which means
    > you need at least three bits per base, which means only two bases per
    > byte (or else base-encodings split across byte-boundaries).... That
    > gets ugly real fast.)
    >
    > Jeff Shannon
    > Technician/Programmer
    > Credit International
    >

    Hello,

    Just to clear up a few things on the topic :

    If the file denotes DNA sequences there are five basic identifiers

    AGCT and X (where X means 'dunno!').

    If the files denoites RNA sequences, you will still only need five
    basic indentifiers the issue is that the T is replaced by a U.

    One very good way I have found to parse large files of this nature
    (I've done it with many a use case) is to write a sax parser for the
    file. Therefore you can register a content handler, receive events from
    the sax parser and do whatever you like with it. Basically, using the
    sax framework to read the files - if your write the sax parser carefully
    then you stream the files and remove old lines from memory, therefore
    you have a scalable solution (rather than keeping everything in memory).

    As an aside, I would seriously consider parsing your files and
    putting this information in a small local db - it's really not much work
    to do and the 'pure' python thing is a misnomer, whichever persistence
    mechanism you use (file,DB,etching it on the floor with a small robot
    accepting logo commands,etc) is unlikely to be pure python.

    The advantage of putting it in a DB will show up later when you have
    fast and powerful retrieval capability.

    Cheers,

    Neil

    --

    Neil Benn
    Senior Automation Engineer
    Cenix BioScience
    BioInnovations Zentrum
    Tatzberg 47
    D-01307
    Dresden
    Germany

    Tel : +49 (0)351 4173 154
    e-mail :
    Cenix Website : http://www.cenix-bioscience.com
    Neil Benn, Jan 14, 2005
    #16
  17. Re: What strategy for random accession of records in massive FASTAfile?

    On Thu, Jan 13, 2005 at 04:41:45PM -0800, Robert Kern wrote:
    >Jeff Shannon wrote:
    >
    >>(Plus, if this format might be used for RNA sequences as well as DNA
    >>sequences, you've got at least a fifth base to represent, which means
    >>you need at least three bits per base, which means only two bases per
    >>byte (or else base-encodings split across byte-boundaries).... That gets
    >>ugly real fast.)

    >
    >Not to mention all the IUPAC symbols for incompletely specified bases
    >(e.g. R = A or G).
    >
    >http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html


    Or, for those of us working with proteins as well, all the single letter codes for proteins:

    http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

    lots more bits.

    I have a db with approx 3 million proteins in it and would not want to be using a pure python approach :)

    Michael
    Michael Maibaum, Jan 14, 2005
    #17
  18. Chris Lasher

    Chris Lasher Guest

    Forgive my ignorance, but what does using mmap do for the script? My
    guess is that it improves performance, but I'm not sure how. I read the
    module documentation and the module appears to be a way to read out
    information from memory (RAM maybe?).
    Chris Lasher, Jan 14, 2005
    #18
  19. Chris Lasher

    Paul Rubin Guest

    "Chris Lasher" <> writes:
    > Forgive my ignorance, but what does using mmap do for the script? My
    > guess is that it improves performance, but I'm not sure how. I read the
    > module documentation and the module appears to be a way to read out
    > information from memory (RAM maybe?).


    Mmap lets you treat a disk file as an array, so you can randomly
    access the bytes in the file without having to do seek operations.
    Just say a[234]='x' and you've changed byte 234 of the file to the
    letter x. It works through the OS's virtual memory system and the
    computer's MMU hardware, and so it has lower overhead than doing
    system calls for every access.
    Paul Rubin, Jan 14, 2005
    #19
  20. Chris Lasher

    Steve Holden Guest

    Re: What strategy for random accession of records in massive FASTAfile?

    Bengt Richter wrote:

    > On 12 Jan 2005 14:46:07 -0800, "Chris Lasher" <> wrote:
    >

    [...]
    > Others have probably solved your basic problem, or pointed
    > the way. I'm just curious.
    >
    > Given that the information content is 2 bits per character
    > that is taking up 8 bits of storage, there must be a good reason
    > for storing and/or transmitting them this way? I.e., it it easy
    > to think up a count-prefixed compressed format packing 4:1 in
    > subsequent data bytes (except for the last byte which have
    > less than 4 2-bit codes).
    >
    > I'm wondering how the data is actually used once records are
    > retrieved. (but I'm too lazy to explore the biopython.org link).
    >

    Revealingly honest.

    Of course, adopting an encoding that only used two bits per base would
    make it impossible to use the re module to search for patterns in them,
    for example. So the work of continuously translating between
    representations might militate against more efficient representations.
    Or, of course, it might not :)

    it's-only-storage-ly y'rs - steve
    --
    Steve Holden http://www.holdenweb.com/
    Python Web Programming http://pydish.holdenweb.com/
    Holden Web LLC +1 703 861 4237 +1 800 494 3119
    Steve Holden, Jan 14, 2005
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Antonis Mitsopoulos
    Replies:
    3
    Views:
    737
    C A Upsdell
    Jun 30, 2004
  2. globalrev
    Replies:
    4
    Views:
    753
    Gabriel Genellina
    Apr 20, 2008
  3. PeroMHC

    concatenate fasta file

    PeroMHC, Feb 12, 2010, in forum: Python
    Replies:
    3
    Views:
    1,628
    Grant Edwards
    Feb 13, 2010
  4. Replies:
    9
    Views:
    196
    Anno Siegel
    Mar 1, 2006
  5. VK
    Replies:
    15
    Views:
    1,152
    Dr J R Stockton
    May 2, 2010
Loading...

Share This Page