What strategy for random accession of records in massive FASTA file?

S

Steve Holden

Jeff said:
And a 4:1 packed-format file would be ~26MB. It'd be interesting to see
how that packed-format file would compress, but I don't care enough to
write a script to convert the FASTA file into a packed-format file to
experiment with... ;)
If your compression algorithm's any good then both, when compressed,
should be approximately equal in size, since the size should be
determined by the information content rather than the representation.
Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a huge
mess of bit-masking every time a single base needs to be examined, or a
human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)
Right!

regards
Steve
 
R

Roy Smith

"Chris Lasher said:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request.

First, before embarking on any major project, take a look at
http://www.biopython.org/ to at least familiarize yourself with what
other people have done in the field.

The easiest thing I think would be to use the gdbm module. You can
write a simple parser to parse the FASTA file (or, I would imagine, find
one already written on biopython), and then store the data in a gdbm
map, using the tag lines as the keys and the sequences as the values.

Even for a Python neophyte, this should be a pretty simple project. The
most complex part might getting the gdbm module built if your copy of
Python doesn't already have it, but gdbm is so convenient, it's worth
the effort.
 
B

Bulba!

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations
Cool!

Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.

However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it,
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte?

flush() definitely makes updating less of an issue, I'm just
curious about the cost of writing small changes scattered all
over the place back to the large file.



--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.
 
M

Michael Hoffman

Chris said:
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order.

I just came across this thread today and I don't understand why you are
trying to reinvent the wheel instead of using Biopython which already
has a solution to this problem, among others.

But actually I usually use formatdb, which comes with NCBI-BLAST to
create blastdb files that can also be used for BLAST.

mh5@ecs4a /data/blastdb/Users/mh5
$ python
Python 2.3.3 (#1, Jan 20 2004, 17:39:36) [C] on osf1V5
Type "help", "copyright", "credits" or "license" for more information.('lcl|004/04/m00404.peptide.faa ENSMUSG00000022297 peptide', 'MERSPFLLACILLPLVRGHSLFTCEPITVPRCMKMTYNMTFFPNLMGHYDQGIAAVEMGHFLHLANLECSPNIEMFLCQAFIPTCTEQIHVVLPCRKLCEKIVSDCKKLMDTFGIRWPEELECNRLPHCDDTVPVTSHPHTELSGPQKKSDQVPRDIGFWCPKHLRTSGDQGYRFLGIEQCAPPCPNMYFKSDELDFAKSFIGIVSIFCLCATLFTFLTFLIDVRRFRYPERPIIYYSVCYSIVSLMYFVGFLLGNSTACNKADEKLELGDTVVLGSKNKACSVVFMFLYFFTMAGTVWWVILTITWFLAAGRKWSCEAIEQKAVWFHAVAWGAPGFLTVMLLAMNKVEGDNISGVCFVGLYDLDASRYFVLLPLCLCVFVGLSLLLAGIISLNHVRQVIQHDGRNQEKLKKFMIRIGVFSGLYLVPLVTLLGCYVYELVNRITWEMTWFSDHCHQYRIPCPYQANPKARPELALFMIKYLMTLIVGISAVFWVGSKKTCTEWAGFFKRNRKRDPISESRRVLQESCEFFLKHNSKVKHKKKHGAPGPHRLKVISKSMGTSTGATTNHGTSAMAIADHDYLGQETSTEVHTSPEASVKEGRADRANTPSAKDRDCGESAGPSSKLSGNRNGRESRAGGLKERSNGSEGAPSEGRVSPKSSVPETGLIDCSTSQAASSPEPTSLKGSTSLPVHSASRARKEQGAGSHSDA')

tools2 has this in it:

class LightIterator(object):
def __init__(self, handle):
self._handle = handle
self._defline = None

def __iter__(self):
return self

def next(self):
lines = []
defline_old = self._defline

while 1:
line = self._handle.readline()
if not line:
if not defline_old and not lines:
raise StopIteration
if defline_old:
self._defline = None
break
elif line[0] == '>':
self._defline = line[1:].rstrip()
if defline_old or lines:
break
else:
defline_old = self._defline
else:
lines.append(line.rstrip())

return defline_old, ''.join(lines)

blastdb.py:

#!/usr/bin/env python
from __future__ import division

__version__ = "$Revision: 1.3 $"

"""
blastdb.py

access blastdb files
Copyright 2005 Michael Hoffman
License: GPL
"""

import os
import sys

try:
from poly import NamedTemporaryFile # http://www.ebi.ac.uk/~hoffman/software/poly/
except ImportError:
from tempfile import NamedTemporaryFile

FASTACMD_CMDLINE = "fastacmd -d %s -s %s -o %s"

class Database(object):
def __init__(self, filename):
self.filename = filename

def fetch_to_file(self, query, filename):
status = os.system(FASTACMD_CMDLINE % (self.filename, query, filename))
if status:
raise RuntimeError, "fastacmd returned %d" % os.WEXITSTATUS(status)

def fetch_to_tempfile(self, query):
temp_file = NamedTemporaryFile()
self.fetch_to_file(query, temp_file.name)
return temp_file
 
S

Steve Holden

Bulba! said:
Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations

Cool!


Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.


However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it,
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte?
Nope. If you try a[234] = 'banana' you'll get an error message. The mmap
protocol doesn't support insertion and deletion, only overwriting.

Of course, it's far too complicated to actually *try* this stuff before
pontificating [not]:
>>> import mmap
>>> f = file("/tmp/Xout.txt", "r+")
>>> mm = mmap.mmap(f.fileno(), 200)
>>> mm[1:10] 'elcome to'
>>> mm[1] = "banana"
Traceback (most recent call last):
File said:
>>> mm[1:10] = 'ishing ::'
>>> mm[1:10] 'ishing ::'
>>> mm[1:10] = 'a'
Traceback (most recent call last):
flush() definitely makes updating less of an issue, I'm just
curious about the cost of writing small changes scattered all
over the place back to the large file.
Some of this depends on whether the mmap is shared or private, of
course, but generally speaking you can ignore the overhead, and the
flush() calls will be automatic as long as you don't mix file and string
operations. The programming convenience is amazing.
--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.

So make it work :)

regards
Steve
 
C

Chris Lasher

Roy, thank you for your reply. I have BioPython installed on my box at
work and have been browsing through the code in there, some of which I
can follow, and most of which will take more time and experience for me
to do so. I have considered BioPython and databases, and have chosen to
forego those routes for the reasons above, here summarized: this script
has a limited scope of what I hope it will acheive, and it should be as
convenient as possible for the end-user to execute it (even at the
expense of convenience to me as the coder). Again, thanks for your
input, though. It's very helpful to me to be able to learn from other
perspectives that I wouldn't have seen from, myself.
 
B

Bengt Richter

Bulba! said:
Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations

Cool!


Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.


However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it,
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte?
I would wonder what mm.find('pattern') in the middle of a huge file
would do to the working set vs sequential reads as in my little toy
(which BTW is also happy to expand or contract old vs new replacement string
as it streams buffers file to file).
Nope. If you try a[234] = 'banana' you'll get an error message. The mmap
protocol doesn't support insertion and deletion, only overwriting.

Of course, it's far too complicated to actually *try* this stuff before
pontificating [not]:
import mmap
f = file("/tmp/Xout.txt", "r+")
mm = mmap.mmap(f.fileno(), 200)
mm[1:10] 'elcome to'
mm[1] = "banana"
Traceback (most recent call last):
File said:
mm[1:10] = 'ishing ::'
mm[1:10] 'ishing ::'
mm[1:10] = 'a'
Traceback (most recent call last):
flush() definitely makes updating less of an issue, I'm just
curious about the cost of writing small changes scattered all
over the place back to the large file.
Some of this depends on whether the mmap is shared or private, of
course, but generally speaking you can ignore the overhead, and the
flush() calls will be automatic as long as you don't mix file and string
operations. The programming convenience is amazing.
That part does look good, but will scanning a large file with find
cause massive swapouts, or is there some smart prioritization or
hidden sequential windowing that limits mmap's impact?
So make it work :)

Regards,
Bengt Richter
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,479
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top