building an index for large text files for fast access

Y

Yi Xing

Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi
 
A

alex23

Yi said:
I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Hey Yi,

The standard library module 'libcache' does exactly what you're
considering implementing.

-alex23
 
E

Erik Max Francis

alex23 said:
The standard library module 'libcache' does exactly what you're
considering implementing.

I believe the module you're referring to is `linecache`.
 
N

Neil Hodgson

Yi Xing:
Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python?

It depends on the size of the files and the amount of memory and
disk you may use. First suggestion would be an in-memory array.array of
64 bit integers made from 2 'I' entries with each 64 bit integer
pointing to the start of a set of n lines. Then to find a particular
line number p you seek to a[p/n] and then read over p%n lines. The
factor 'n' is adjusted to fit your data into memory. If this uses too
much memory or scanning the file to build the index each time uses too
much time then you can use an index file with the same layout instead.

Neil
 
J

John Machin

Erik said:
I believe the module you're referring to is `linecache`.

and whatever its name is, it's not a goer: it reads the whole of each
file into memory. It was designed for stack traces. See docs. See
source code. See discussion when somebody with a name within a typo or
0 of the OP's asked an almost identical question very recently.

Here's a thought:

sqlite3 database, one table (line_number, file_offset, line_length),
make line_number the primary key, Bob's yer uncle.

Oh yeah, that battery's not included yet -- you'll need to download the
pysqlite2 module, and mutter strange oaths:
import sys
PY_VERSION = sys.version_info[:2]
if PY_VERSION >= (2, 5):
import sqlite3
else:
from pysqlite2 import dbapi2 as sqlite3

It would be a very good idea for the OP to give us a clue as to
(min/max/average) number of (bytes per line, lines per file, bytes per
file) *and* the desired response time on what hardware & OS ... *and*
how long if takes to do this:
for line in file_handle:
pass

Alternatively, if you have the disk space, you could have just two
columns in the DB table: (line_number, line).

Cheers,
John
 
S

Simon Forman

Yi said:
Hi,

I need to read specific lines of huge text files. Each time, I know
exactly which line(s) I want to read. readlines() or readline() in a
loop is just too slow. Since different lines have different size, I
cannot use seek(). So I am thinking of building an index for the file
for fast access. Can anybody give me some tips on how to do this in
Python? Thanks.

Yi

I had to do this for some large log files. I wrote one simple script
to generate the index file and another that used the index file to read
lines from the log file. Here are (slightly cleaned up for clarity)
the two scripts. (Note that they'll only work with files less than
4,294,967,296 bytes long.. If your files are larger than that
substitute 'Q' for 'L' in the struct formats.)

First, genoffsets.py
#!/usr/bin/env python
'''
Write the byte offset of each line.
'''
import fileinput
import struct
import sys

def f(n): return struct.pack('L', n)

def main():
total = 0

# Main processing..
for n, line in enumerate(fileinput.input()):

sys.stdout.write(f(total))

total += len(line)

# Status output.
if not n % 1000:
print >> sys.stderr, '%i lines processed' % n

print >> sys.stderr, '%i lines processed' % (n + 1)


if __name__ == '__main__':
main()


You use it (on linux) like so:
cat large_file | ./genoffsets.py > index.dat

And here's the getline.py script:
#!/usr/bin/env python
'''
Usage: "getline.py <datafile> <indexfile> <num>"

Prints line num from datafile using indexfile.
'''
import struct
import sys

fmt = 'L'
fmt_size = struct.calcsize(fmt)


def F(n, fn):
'''
Return the byte offset of line n from index file fn.
'''
f = open(fn)

try:
f.seek(n * fmt_size)
data = f.read(fmt_size)
finally:
f.close()

return struct.unpack(fmt, data)[0]


def getline(n, data_file, index_file):
'''
Return line n from data file using index file.
'''
n = F(n, index_file)
f = open(data_file)

try:
f.seek(n)
data = f.readline()
finally:
f.close()

return data


if __name__ == '__main__':
dfn, ifn, lineno = sys.argv[-3:]
n = int(lineno)
print getline(n, dfn, ifn)



Hope this helps,
~Simon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,043
Latest member
CannalabsCBDReview

Latest Threads

Top