building an index for large text files for fast access

Yi Xing · Jul 25, 2006

Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi

alex23 · Jul 25, 2006

Yi said:
I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Hey Yi,

The standard library module 'libcache' does exactly what you're
considering implementing.

-alex23

Erik Max Francis · Jul 25, 2006

alex23 said:
The standard library module 'libcache' does exactly what you're
considering implementing.

I believe the module you're referring to is `linecache`.

Neil Hodgson · Jul 25, 2006

Yi Xing:

Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python?

It depends on the size of the files and the amount of memory and
disk you may use. First suggestion would be an in-memory array.array of
64 bit integers made from 2 'I' entries with each 64 bit integer
pointing to the start of a set of n lines. Then to find a particular
line number p you seek to a[p/n] and then read over p%n lines. The
factor 'n' is adjusted to fit your data into memory. If this uses too
much memory or scanning the file to build the index each time uses too
much time then you can use an index file with the same layout instead.

Neil

John Machin · Jul 25, 2006

Erik said:
I believe the module you're referring to is `linecache`.

and whatever its name is, it's not a goer: it reads the whole of each
file into memory. It was designed for stack traces. See docs. See
source code. See discussion when somebody with a name within a typo or
0 of the OP's asked an almost identical question very recently.

Here's a thought:

sqlite3 database, one table (line_number, file_offset, line_length),
make line_number the primary key, Bob's yer uncle.

Oh yeah, that battery's not included yet -- you'll need to download the
pysqlite2 module, and mutter strange oaths:
import sys
PY_VERSION = sys.version_info[:2]
if PY_VERSION >= (2, 5):
import sqlite3
else:
from pysqlite2 import dbapi2 as sqlite3

It would be a very good idea for the OP to give us a clue as to
(min/max/average) number of (bytes per line, lines per file, bytes per
file) *and* the desired response time on what hardware & OS ... *and*
how long if takes to do this:
for line in file_handle:
pass

Alternatively, if you have the disk space, you could have just two
columns in the DB table: (line_number, line).

Cheers,
John

alex23 · Jul 26, 2006

Erik said:
I believe the module you're referring to is `linecache`.

Thanks, Erik. That's what I get for posting on my way out of work

-alex23

Simon Forman · Jul 26, 2006

Yi said:
Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi

I had to do this for some large log files. I wrote one simple script
to generate the index file and another that used the index file to read
lines from the log file. Here are (slightly cleaned up for clarity)
the two scripts. (Note that they'll only work with files less than
4,294,967,296 bytes long.. If your files are larger than that
substitute 'Q' for 'L' in the struct formats.)

First, genoffsets.py
#!/usr/bin/env python
'''
Write the byte offset of each line.
'''
import fileinput
import struct
import sys

def f(n): return struct.pack('L', n)

def main():
total = 0

# Main processing..
for n, line in enumerate(fileinput.input()):

sys.stdout.write(f(total))

total += len(line)

# Status output.
if not n % 1000:
print >> sys.stderr, '%i lines processed' % n

print >> sys.stderr, '%i lines processed' % (n + 1)

if __name__ == '__main__':
main()

You use it (on linux) like so:
cat large_file | ./genoffsets.py > index.dat

And here's the getline.py script:
#!/usr/bin/env python
'''
Usage: "getline.py <datafile> <indexfile> <num>"

Prints line num from datafile using indexfile.
'''
import struct
import sys

fmt = 'L'
fmt_size = struct.calcsize(fmt)

def F(n, fn):
'''
Return the byte offset of line n from index file fn.
'''
f = open(fn)

try:
f.seek(n * fmt_size)
data = f.read(fmt_size)
finally:
f.close()

return struct.unpack(fmt, data)[0]

def getline(n, data_file, index_file):
'''
Return line n from data file using index file.
'''
n = F(n, index_file)
f = open(data_file)

try:
f.seek(n)
data = f.readline()
finally:
f.close()

return data

if __name__ == '__main__':
dfn, ifn, lineno = sys.argv[-3:]
n = int(lineno)
print getline(n, dfn, ifn)

Hope this helps,
~Simon

Processing large CSV files - how to maximise throughput?	11	Oct 25, 2013
I need help in understanding these files on my phone, Could someone help me understand these files? Urgent help needed. Please help.	1	Jun 4, 2023
Fast alternatives to "File" and "IO" for large numbers of files ?	6	Feb 24, 2011
Fast way to process large files line by line	18	Nov 15, 2006
Search a Large files backwards	7	Mar 2, 2010
Large data arrays?	9	Apr 23, 2009
a fast malloc/free implementation & benchmarks	0	Mar 20, 2011
How to compare two large text files?	3	Jun 19, 2007

building an index for large text files for fast access

Yi Xing

alex23

Erik Max Francis

Neil Hodgson

John Machin

alex23

Simon Forman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads