building an index for large text files for fast access

Discussion in 'Python' started by Yi Xing, Jul 25, 2006.

  1. Yi Xing

    Yi Xing Guest

    Hi,

    I need to read specific lines of huge text files. Each time, I know
    exactly which line(s) I want to read. readlines() or readline() in a
    loop is just too slow. Since different lines have different size, I
    cannot use seek(). So I am thinking of building an index for the file
    for fast access. Can anybody give me some tips on how to do this in
    Python? Thanks.

    Yi
     
    Yi Xing, Jul 25, 2006
    #1
    1. Advertising

  2. Yi Xing

    alex23 Guest

    Yi Xing wrote:
    > I need to read specific lines of huge text files. Each time, I know
    > exactly which line(s) I want to read. readlines() or readline() in a
    > loop is just too slow. Since different lines have different size, I
    > cannot use seek(). So I am thinking of building an index for the file
    > for fast access. Can anybody give me some tips on how to do this in
    > Python? Thanks.


    Hey Yi,

    The standard library module 'libcache' does exactly what you're
    considering implementing.

    -alex23
     
    alex23, Jul 25, 2006
    #2
    1. Advertising

  3. alex23 wrote:

    > The standard library module 'libcache' does exactly what you're
    > considering implementing.


    I believe the module you're referring to is `linecache`.

    --
    Erik Max Francis && && http://www.alcyone.com/max/
    San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
    It is from numberless diverse acts of courage and belief that human
    history is shaped. -- John F. Kennedy
     
    Erik Max Francis, Jul 25, 2006
    #3
  4. Yi Xing

    Neil Hodgson Guest

    Yi Xing:

    > Since different lines have different size, I
    > cannot use seek(). So I am thinking of building an index for the file
    > for fast access. Can anybody give me some tips on how to do this in
    > Python?


    It depends on the size of the files and the amount of memory and
    disk you may use. First suggestion would be an in-memory array.array of
    64 bit integers made from 2 'I' entries with each 64 bit integer
    pointing to the start of a set of n lines. Then to find a particular
    line number p you seek to a[p/n] and then read over p%n lines. The
    factor 'n' is adjusted to fit your data into memory. If this uses too
    much memory or scanning the file to build the index each time uses too
    much time then you can use an index file with the same layout instead.

    Neil
     
    Neil Hodgson, Jul 25, 2006
    #4
  5. Yi Xing

    John Machin Guest

    Erik Max Francis wrote:
    > alex23 wrote:
    >
    > > The standard library module 'libcache' does exactly what you're
    > > considering implementing.

    >
    > I believe the module you're referring to is `linecache`.
    >


    and whatever its name is, it's not a goer: it reads the whole of each
    file into memory. It was designed for stack traces. See docs. See
    source code. See discussion when somebody with a name within a typo or
    0 of the OP's asked an almost identical question very recently.

    Here's a thought:

    sqlite3 database, one table (line_number, file_offset, line_length),
    make line_number the primary key, Bob's yer uncle.

    Oh yeah, that battery's not included yet -- you'll need to download the
    pysqlite2 module, and mutter strange oaths:
    import sys
    PY_VERSION = sys.version_info[:2]
    if PY_VERSION >= (2, 5):
    import sqlite3
    else:
    from pysqlite2 import dbapi2 as sqlite3

    It would be a very good idea for the OP to give us a clue as to
    (min/max/average) number of (bytes per line, lines per file, bytes per
    file) *and* the desired response time on what hardware & OS ... *and*
    how long if takes to do this:
    for line in file_handle:
    pass

    Alternatively, if you have the disk space, you could have just two
    columns in the DB table: (line_number, line).

    Cheers,
    John
     
    John Machin, Jul 25, 2006
    #5
  6. Yi Xing

    alex23 Guest

    Erik Max Francis wrote:
    > I believe the module you're referring to is `linecache`.


    Thanks, Erik. That's what I get for posting on my way out of work :)

    -alex23
     
    alex23, Jul 26, 2006
    #6
  7. Yi Xing

    Simon Forman Guest

    Yi Xing wrote:
    > Hi,
    >
    > I need to read specific lines of huge text files. Each time, I know
    > exactly which line(s) I want to read. readlines() or readline() in a
    > loop is just too slow. Since different lines have different size, I
    > cannot use seek(). So I am thinking of building an index for the file
    > for fast access. Can anybody give me some tips on how to do this in
    > Python? Thanks.
    >
    > Yi


    I had to do this for some large log files. I wrote one simple script
    to generate the index file and another that used the index file to read
    lines from the log file. Here are (slightly cleaned up for clarity)
    the two scripts. (Note that they'll only work with files less than
    4,294,967,296 bytes long.. If your files are larger than that
    substitute 'Q' for 'L' in the struct formats.)

    First, genoffsets.py
    #!/usr/bin/env python
    '''
    Write the byte offset of each line.
    '''
    import fileinput
    import struct
    import sys

    def f(n): return struct.pack('L', n)

    def main():
    total = 0

    # Main processing..
    for n, line in enumerate(fileinput.input()):

    sys.stdout.write(f(total))

    total += len(line)

    # Status output.
    if not n % 1000:
    print >> sys.stderr, '%i lines processed' % n

    print >> sys.stderr, '%i lines processed' % (n + 1)


    if __name__ == '__main__':
    main()


    You use it (on linux) like so:
    cat large_file | ./genoffsets.py > index.dat

    And here's the getline.py script:
    #!/usr/bin/env python
    '''
    Usage: "getline.py <datafile> <indexfile> <num>"

    Prints line num from datafile using indexfile.
    '''
    import struct
    import sys

    fmt = 'L'
    fmt_size = struct.calcsize(fmt)


    def F(n, fn):
    '''
    Return the byte offset of line n from index file fn.
    '''
    f = open(fn)

    try:
    f.seek(n * fmt_size)
    data = f.read(fmt_size)
    finally:
    f.close()

    return struct.unpack(fmt, data)[0]


    def getline(n, data_file, index_file):
    '''
    Return line n from data file using index file.
    '''
    n = F(n, index_file)
    f = open(data_file)

    try:
    f.seek(n)
    data = f.readline()
    finally:
    f.close()

    return data


    if __name__ == '__main__':
    dfn, ifn, lineno = sys.argv[-3:]
    n = int(lineno)
    print getline(n, dfn, ifn)



    Hope this helps,
    ~Simon
     
    Simon Forman, Jul 26, 2006
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. PedroX
    Replies:
    9
    Views:
    1,554
    Bryce K. Nielsen
    Jun 28, 2005
  2. Catherine Moroney

    fast copying of large files in python

    Catherine Moroney, Nov 2, 2011, in forum: Python
    Replies:
    1
    Views:
    892
    Dave Angel
    Nov 2, 2011
  3. Devesh Agrawal
    Replies:
    18
    Views:
    261
  4. Tad McClellan
    Replies:
    4
    Views:
    186
    Mark Clements
    Nov 1, 2005
  5. Tomasz Chmielewski

    sorting index-15, index-9, index-110 "the human way"?

    Tomasz Chmielewski, Mar 4, 2008, in forum: Perl Misc
    Replies:
    4
    Views:
    317
    Tomasz Chmielewski
    Mar 4, 2008
Loading...

Share This Page