question on using tarfile to read a *.tar.gzip file

M

m_ahlenius

Hi,

I have a number of relatively large number *tar.gzip files to
process. With the py module tarfile, I see that I can access and
extract them, one at a time to a temporary dir, but that of course
takes time.

All that I need to do is to read the first and last lines of each file
and then move on to the next one. I am not changing anything in these
files - just reading. The file lines are not fixed lengths either,
which makes it a bit more fun.

Is there a way to do this, without decompressing each file to a temp
dir? Like is there a method using some tarfile interface adapter to
read a compressed file? Otherwise I'll just access each file, extract
it, grab the 1st and last lines and then delete the temp file.

thx

'mark
 
T

Tim Chase

Is there a way to do this, without decompressing each file to a temp
dir? Like is there a method using some tarfile interface adapter to
read a compressed file? Otherwise I'll just access each file, extract
it, grab the 1st and last lines and then delete the temp file.

I think you're looking for the extractfile() method of the
TarFile object:

from glob import glob
from tarfile import TarFile
for fname in glob('*.tgz'):
print fname
tf = TarFile.gzopen(fname)
for ti in tf:
print ' %s' % ti.name
f = tf.extractfile(ti)
if not f: continue
fi = iter(f) # f doesn't natively support next()
first_line = fi.next()
for line in fi: pass
f.close()
print " First line: %r" % first_line
print " Last line: %r" % line
tf.close()

If you just want the first & last lines, it's a little more
complex if you don't want to scan the entire file (like I do with
the for-loop), but the file-like object returned by extractfile()
is documented as supporting seek() so you can skip to the end and
then read backwards until you have sufficient lines. I wrote a
"get the last line of a large file using seeks from the EOF"
function which you can find at [1] which should handle the odd
edge cases of $BUFFER_SIZE containing more or less than a full
line and then reading backwards in chunks (if needed) until you
have one full line, handling a one-line file, and other
odd/annoying edge-cases. Hope it helps.

-tkc

[1]
http://mail.python.org/pipermail/python-list/2009-January/1186176.html
 
M

m_ahlenius

Is there a way to do this, without decompressing each file to a temp
dir?  Like is there a method using some tarfile interface adapter to
read a compressed file?  Otherwise I'll just access each file, extract
it,  grab the 1st and last lines and then delete the temp file.

I think you're looking for the extractfile() method of the
TarFile object:

   from glob import glob
   from tarfile import TarFile
   for fname in glob('*.tgz'):
     print fname
     tf = TarFile.gzopen(fname)
     for ti in tf:
       print ' %s' % ti.name
       f = tf.extractfile(ti)
       if not f: continue
       fi = iter(f) # f doesn't natively support next()
       first_line = fi.next()
       for line in fi: pass
       f.close()
       print "  First line: %r" % first_line
       print "  Last line: %r" % line
     tf.close()

If you just want the first & last lines, it's a little more
complex if you don't want to scan the entire file (like I do with
the for-loop), but the file-like object returned by extractfile()
is documented as supporting seek() so you can skip to the end and
then read backwards until you have sufficient lines.  I wrote a
"get the last line of a large file using seeks from the EOF"
function which you can find at [1] which should handle the odd
edge cases of $BUFFER_SIZE containing more or less than a full
line and then reading backwards in chunks (if needed) until you
have one full line, handling a one-line file, and other
odd/annoying edge-cases.  Hope it helps.

-tkc

[1]http://mail.python.org/pipermail/python-list/2009-January/1186176.html

Thanks Tim - this was very helpful. Just learning about tarfile.

'mark
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,434
Messages
2,571,689
Members
48,796
Latest member
Greg L.

Latest Threads

Top