What strategy for random accession of records in massive FASTA file?

Chris Lasher · Jan 12, 2005

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACAT
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACATTCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATAGACGG
CW127_A03

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACATTCCGCCTGGG
....

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again.

If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

Fredrik Lundh · Jan 12, 2005

Chris said:
Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)

keeping an index in memory might be reasonable. the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk. if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

import os, marshal

class FASTA:

def __init__(self, file):
self.file = open(file)
self.checkindex()

def __getitem__(self, key):
try:
pos = self.index[key]
except KeyError:
raise IndexError("no such item")
else:
f = self.file
f.seek(pos)
header = f.readline()
assert ">" + header + "\n"
data = []
while 1:
line = f.readline()
if not line or line[0] == ">":
break
data.append(line)
return data

def checkindex(self):
indexfile = self.file.name + ".index"

try:
self.index = marshal.load(open(indexfile, "rb"))
except IOError:
print "building index..."

index = {}

# scan the file
f = self.file
f.seek(0)
while 1:
pos = f.tell()
line = f.readline()
if not line:
break
if line[0] == ">":
# save offset to header line
header = line[1:].strip()
index[header] = pos

# save index to disk
f = open(indexfile, "wb")
marshal.dump(index, f)
f.close()

self.index = index

db = FASTA("myfastafile.dat")
print db["CW127_A02"]

['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

tweak as necessary.

</F>

John Lenton · Jan 12, 2005

If you could help me figure out how to code a solution

that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)

You don't need a RDBMS; I'd put it in a DBM or CDB myself.

Larry Bates · Jan 12, 2005

You don't say how this will be used, but here goes:

1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data. Use shelve to store the dictionary for subsequent
runs (if load time is excessive).

2) Take a look at Gadfly (gadfly.sourceforge.net). It
provides you with Python SQL-like database and may be
better solution if data is basically static and you
do lots of processing.

All depends on how you use the data.

Regards,
Larry Bates
Syscon, Inc.

David E. Konerding DSD staff · Jan 13, 2005

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

Use biopython. They have dictionary-style classes which wrap FASTA files using indexes.

http://www.biopython.org

Dave

John Machin · Jan 13, 2005

Chris said:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACAT

Click to expand...

[snip]
Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on? Will the data be on a
server? If the server is on a WAN or at the other end of a radio link
between buildings, you definitely need an index so that you can access
the data randomly!

By way of example, to read all of a 157MB file into memory from a local
(i.e. not networked) disk using readlines() takes less than 4 seconds
on a 1.4Ghz Athlon processor (see below). The average new corporate
desktop box is about twice as fast as that. Note that Windows Task
Manager showed 100% CPU utilisation for both read() and readlines().

My guess is that you don't need anything much fancier than the effbot's
index method -- which by now you have probably found works straight out
of the box and is more than fast enough for your needs.

BTW, you need to clarify "don't have access to an RDBMS" ... surely
this can only be due to someone stopping them from installing good free
software freely available on the Internet.

HTH,
John

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').read())"
157581595
157581595
157581595
157581595
157581595
157581595
1 loops, best of 6: 3.3e+006 usec per loop

C:\junk>python -m timeit -n 1 -r 6 "print
len(file('bigfile.csv').readlines())"
1118870
1118870
1118870
1118870
1118870
1118870
1 loops, best of 6: 3.57e+006 usec per loop

Bengt Richter · Jan 13, 2005

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a '>'
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGGGGGATAACAGTTGGAAACGACTGCTAATACCCCATA
GCATTAAACATTCCGCCTGGG
...

Regards,
Bengt Richter

Chris Lasher · Jan 13, 2005

Before you get too carried away, how often do you want to do this and

how grunty is the box you will be running on?

Oops, I should have specified this. The script will only need to be run
once every three or four months, when the sequences are updated. I'll
be running it on boxes that are 3GHz/100GB Ram, but others may not be
so fortunate, and so I'd like to keep that in mind.

BTW, you need to clarify "don't have access to an RDBMS" ... surely
this can only be due to someone stopping them from installing good
free software freely available on the Internet.

I understand your and others' sentiment on this. I agree, the
open-source database systems are wonderful. However, keeping it
Python-only saves me hassle by only having to assist in instances where
others need help downloading and installing Python. I suppose if I keep
it in Python, I can even use Py2exe to generate an executable that
wouldn't even require them to install Python. A solution using
interaction with a database is much sexier, but, for the purposes of
the script, seems unnecesary. However, I certainly appreciate the
suggestions.

My guess is that you don't need anything much fancier than the
effbot's index method -- which by now you have probably found works
straight out of the box and is more than fast enough for your needs.

I think Mr. Lundh's code will work very well for these purposes. Thanks
very much to him for posting it. Many thanks for posting that! You'll
have full credit for that part of the code. Thanks very much to all who
replied!

Chris

Chris Lasher · Jan 13, 2005

Thanks for your reply, Larry. I thought about this, but I'm worried the
dictionary will consume a lot of resources. I think my 3GHz/1GB RAM box
could handle the load fine, but I'm not sure about others' systems.
Chris

Chris Lasher · Jan 13, 2005

Others have probably solved your basic problem, or pointed

the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.

I'm wondering how the data is actually used once records are
retrieved.

This one I can answer. For my purposes, I'm just organizing the
sequences at hand, but there are all sorts of things one could actually
do with sequences: alignments, BLAST searches, gene annotations, etc.

Jeff Shannon · Jan 13, 2005

Chris said:
My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.

Yeah, that pretty much matches my guess (not that I'm involved in
anything related to computational molecular biology or genetics).
Given the current technology, the cost of the extra storage size is
presumably lower than the cost of translating into/out of a packed
format. Heck, hard drives cost less than $1/GB now.

And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

Jeff Shannon
Technician/Programmer
Credit International

Chris Lasher · Jan 13, 2005

And besides, for long-term archiving purposes, I'd expect that zip et

al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

Jeff Shannon · Jan 14, 2005

Chris said:
This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB. It'd be interesting to
see how that packed-format file would compress, but I don't care
enough to write a script to convert the FASTA file into a
packed-format file to experiment with...

Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a
huge mess of bit-masking every time a single base needs to be
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International

Robert Kern · Jan 14, 2005

Jeff said:
(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That gets
ugly real fast.)

Not to mention all the IUPAC symbols for incompletely specified bases
(e.g. R = A or G).

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

John Lenton · Jan 14, 2005

keeping an index in memory might be reasonable. the following class
creates an index file by scanning the FASTA file, and uses the "marshal"
module to save it to disk. if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

the problem caught my interest, and the way you used a non-mmaped file
in a place where using mmap was pretty much obvious (IMVHO) bothered
me enough to overcome my laziness. It didn't overcome it by *much*,
mind you, so the following probably only works on python 2.3, on
Linux, in Argentina, and with FASTA data that looks like the sample I
was able to download.

However, having said that, the sample I downloaded was one 46MiB file,
and reading it in on my notebook was fast enough that I ripped out the
saving/reloading of the index and just reindex it every time. Adding
back the persistant index is trivial. [ time passes ] In fact, I just
found a several-gigabyte fasta file, and I guess you'd want the index
for that one; I put the code back in. And now I should really go to
bed, because this is very interesting but won't pay the bills.

import os, mmap, marshal
from UserDict import DictMixin

class FASTA(DictMixin):
def __init__(self, filename):
self.file = f = open(filename)
fno = f.fileno()
stat = os.fstat(fno)
self.map = mmap.mmap(fno, stat.st_size, access=mmap.ACCESS_COPY)
self.checkindex()

def __getitem__(self, key):
p0, pf = self.index[key]
m = self.map
return m[p0

f]

def keys(self):
return self.index.keys()
def __contains__(self, key):
return self.index.__contains__(key)

def checkindex(self):
indexfile = self.file.name + ".idx"
if os.path.exists(indexfile):
# and os.stat(indexfile).st_mtime > os.stat(self.file.name).st_mtime:
self.index = marshal.load(open(indexfile, "rb"))
else:
print 'generating index...'
self.genindex()
marshal.dump(self.index, open(indexfile, "wb"))
print 'done.'

def genindex(self):
index = {}
m = self.map
last = None
while 1:
pos = m.find('>')
if last is not None:
index[last] = (index[last], pos)
if pos == -1:
break
m.seek(pos)
line = m.readline()
pos = m.tell()
# this is the bit that probably only works with FASTA
# files like I was able to find on the 'net.
sep = line.index(' ')
if sep == -1:
name = line[1:].strip()
else:
name = line[1:sep].strip()
index[name] = pos
last = name
self.index = index

db = FASTA("/home/john/tmp/uniprot_sprot.fasta")
print db["104K_THEPA"]

--
John Lenton ([email protected]) -- Random fortune:
"Those who believe in astrology are living in houses with foundations of
Silly Putty."
- Dennis Rawlins, astronomer

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (GNU/Linux)

iD8DBQFB52VWgPqu395ykGsRAtiTAJ9sooa2folCJp3beGzY3PenuuMJJgCfQEnz
5ZSQezYJ5R0X6vB+Aj7FnOQ=
=n0xF
-----END PGP SIGNATURE-----

Neil Benn · Jan 14, 2005

Jeff said:
And a 4:1 packed-format file would be ~26MB. It'd be interesting to
see how that packed-format file would compress, but I don't care
enough to write a script to convert the FASTA file into a
packed-format file to experiment with...

Short version, then, is that yes, size concerns (such as they may be)
are outweighed by speed and conceptual simplicity (i.e. avoiding a
huge mess of bit-masking every time a single base needs to be
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA
sequences, you've got at least a fifth base to represent, which means
you need at least three bits per base, which means only two bases per
byte (or else base-encodings split across byte-boundaries).... That
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International

Hello,

Just to clear up a few things on the topic :

If the file denotes DNA sequences there are five basic identifiers

AGCT and X (where X means 'dunno!').

If the files denoites RNA sequences, you will still only need five
basic indentifiers the issue is that the T is replaced by a U.

One very good way I have found to parse large files of this nature
(I've done it with many a use case) is to write a sax parser for the
file. Therefore you can register a content handler, receive events from
the sax parser and do whatever you like with it. Basically, using the
sax framework to read the files - if your write the sax parser carefully
then you stream the files and remove old lines from memory, therefore
you have a scalable solution (rather than keeping everything in memory).

As an aside, I would seriously consider parsing your files and
putting this information in a small local db - it's really not much work
to do and the 'pure' python thing is a misnomer, whichever persistence
mechanism you use (file,DB,etching it on the floor with a small robot
accepting logo commands,etc) is unlikely to be pure python.

The advantage of putting it in a DB will show up later when you have
fast and powerful retrieval capability.

Cheers,

Neil

--

Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47
D-01307
Dresden
Germany

Tel : +49 (0)351 4173 154
e-mail : (e-mail address removed)
Cenix Website : http://www.cenix-bioscience.com

Michael Maibaum · Jan 14, 2005

Not to mention all the IUPAC symbols for incompletely specified bases
(e.g. R = A or G).

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html

Or, for those of us working with proteins as well, all the single letter codes for proteins:

http://www.chem.qmul.ac.uk/iupac/AminoAcid/A2021.html

lots more bits.

I have a db with approx 3 million proteins in it and would not want to be using a pure python approach

Michael

Chris Lasher · Jan 14, 2005

Forgive my ignorance, but what does using mmap do for the script? My
guess is that it improves performance, but I'm not sure how. I read the
module documentation and the module appears to be a way to read out
information from memory (RAM maybe?).

Paul Rubin · Jan 14, 2005

Chris Lasher said:
Forgive my ignorance, but what does using mmap do for the script? My
guess is that it improves performance, but I'm not sure how. I read the
module documentation and the module appears to be a way to read out
information from memory (RAM maybe?).

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations.
Just say a[234]='x' and you've changed byte 234 of the file to the
letter x. It works through the OS's virtual memory system and the
computer's MMU hardware, and so it has lower overhead than doing
system calls for every access.

Steve Holden · Jan 14, 2005

Bengt said:
[...]
Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

Revealingly honest.

Of course, adopting an encoding that only used two bits per base would
make it impossible to use the re module to search for patterns in them,
for example. So the work of continuously translating between
representations might militate against more efficient representations.
Or, of course, it might not

it's-only-storage-ly y'rs - steve

concatenate fasta file	3	Feb 12, 2010
FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
How to Create a random password generator in a separate window	4	May 26, 2022
strategy for parsing text file	11	Aug 28, 2009
Random string of digits?	18	Dec 25, 2011
Database schema for file organizer.	1	May 17, 2022
Looking For Advice	1	Dec 10, 2022
Best strategy for overcoming excessive gethostbyname timeout.	8	Nov 28, 2009

What strategy for random accession of records in massive FASTA file?

Chris Lasher

Fredrik Lundh

John Lenton

Larry Bates

David E. Konerding DSD staff

John Machin

Bengt Richter

Chris Lasher

Chris Lasher

Chris Lasher

Jeff Shannon

Chris Lasher

Jeff Shannon

Robert Kern

John Lenton

Neil Benn

Michael Maibaum

Chris Lasher

Paul Rubin

Steve Holden

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads