Reading a text file backwards

J

Jay

I have a very large text file (being read by a CGI script on a web server),
and I get memory errors when I try to read the whole file into a list of
strings. The problem is, I want to read the file backwards, starting with
the last line.

Previously, I did:

myfile = open('myfile.txt', 'r')
mylines = myfile.readlines()
myfile.close()
for line in range(len(mylines)-1, -1, -1):
# do something with mylines[line]

This, however caused a "MemoryError," so I want to do something like

myfile = open('myfile.txt', 'r')
for line in myfile:
# do something with line
myfile.close()

Only, I want to iterate backwards, starting with the last line of the file.
Can anybody suggest a simple way of doing this? Do I need to jump around
with myfile.seek() and use myfile.readline() ?
 
R

Rick Holbert

Jay,

Try this:

myfile = open('myfile.txt', 'r')
mylines = myfile.readlines()
myfile.close()
mylines.reverse()

Rick
 
A

Andrew Dalke

Jay said:
Only, I want to iterate backwards, starting with the last line of the file.
Can anybody suggest a simple way of doing this? Do I need to jump around
with myfile.seek() and use myfile.readline() ?

Python Cookbook has a recipe. Or two.

http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/276149
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/120686

I've not looked at them to judge the quality

Another approach is to read the lines forwards and save
the starting line position. Then iterate backwards
through the positions, seek to it and read a line.

def find_offsets(infile):
offsets = []
offset = 0
for line in infile:
offsets.append(offset)
offset += len(line)
return offsets

def iter_backwards(infile):
# make sure it's seekable and at the start
infile.seek(0)
offsets = find_offsets(infile)
for offset in offsets[::-1]:
infile.seek(offset)
yield infile.readline()

for line in iter_backwards(open("spam.py")):
print repr(line)

This won't work on MS Windows because of the
'\r\n' -> '\n' conversion. You would instead
need something like

def find_offsets(infile):
offsets = []
while 1:
offset = infile.tell()
if not infile.readline():
break
offsets.append(offset)
return offsets


Just submitted this solution to the cookbook.

Andrew
(e-mail address removed)
 
D

Daniel Yoo

: Jay,

: Try this:

: myfile = open('myfile.txt', 'r')
: mylines = myfile.readlines()
: myfile.close()
: mylines.reverse()


Hi Rick,

But this probably won't work for Jay: he's running into memory issues
because the file's too large to hold in memory at once. The point is
to avoid readlines().

Here's a generator that tries to iterate backwards across a file. We
first get the file positions of each newline, and then afterwards
start going through the offsets.

###

def backfileiter(myfile):
"""Iterates the lines of a file, but in reverse order."""
myfile.seek(0)
offsets = _getLineOffsets(myfile)
myfile.seek(0)
offsets.reverse()
for i in offsets:
myfile.seek(i+1)
yield myfile.readline()

def _getLineOffsets(myfile):
"""Return a list of offsets where newlines are located."""
offsets = [-1]
i = 0
while True:
byte = myfile.read(1)
if not byte:
break
elif byte == '\n':
offsets.append(i)
i += 1
return offsets
###



For example:

### .... hello world
.... this
.... is a
.... test""")
....
'test'
'is a\n'
'this\n'
'hello world\n'
'\n'
###


Hope this helps!
 
G

Graham Fawcett

It's just shifting the burden perhaps, but if you're on a Unix system
you should be able to use tac(1) to reverse your file a bit faster:

import os
for line in os.popen('tac myfile.txt'):
#do something with the line
 
A

Andrew Dalke

Graham said:
It's just shifting the burden perhaps, but if you're on a Unix system
you should be able to use tac(1) to reverse your file a bit faster:

Huh. Hadn't heard of that one. It's not installed
on my OS X box. It's on my FreeBSD account as gtac.
Ah, but it is available on a Linux account.

Andrew
(e-mail address removed)
 
J

Jeremy Bowers

It's just shifting the burden perhaps, but if you're on a Unix system
you should be able to use tac(1) to reverse your file a bit faster:

import os
for line in os.popen('tac myfile.txt'):
#do something with the line

It probably isn't shifting the burden; they probably do it right.

Doing it right involves reading the file in chunks backwards, and scanning
backwards for newlines, but getting it right when lines cross boundaries,
while perhaps not *hard*, is exactly the kind of tricky programming it is
best to do once... preferably somebody else's once. :)

This way you don't read the file twice, as the first time can take a while.
 
P

Paul Rubin

Andrew Dalke said:
Huh. Hadn't heard of that one. It's not installed
on my OS X box. It's on my FreeBSD account as gtac.
Ah, but it is available on a Linux account.

You can try tail(1).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top