question on using tarfile to read a *.tar.gzip file

Discussion in 'Python' started by m_ahlenius, Feb 7, 2010.

  1. m_ahlenius

    m_ahlenius Guest

    Hi,

    I have a number of relatively large number *tar.gzip files to
    process. With the py module tarfile, I see that I can access and
    extract them, one at a time to a temporary dir, but that of course
    takes time.

    All that I need to do is to read the first and last lines of each file
    and then move on to the next one. I am not changing anything in these
    files - just reading. The file lines are not fixed lengths either,
    which makes it a bit more fun.

    Is there a way to do this, without decompressing each file to a temp
    dir? Like is there a method using some tarfile interface adapter to
    read a compressed file? Otherwise I'll just access each file, extract
    it, grab the 1st and last lines and then delete the temp file.

    thx

    'mark
     
    m_ahlenius, Feb 7, 2010
    #1
    1. Advertising

  2. m_ahlenius

    Tim Chase Guest

    > Is there a way to do this, without decompressing each file to a temp
    > dir? Like is there a method using some tarfile interface adapter to
    > read a compressed file? Otherwise I'll just access each file, extract
    > it, grab the 1st and last lines and then delete the temp file.


    I think you're looking for the extractfile() method of the
    TarFile object:

    from glob import glob
    from tarfile import TarFile
    for fname in glob('*.tgz'):
    print fname
    tf = TarFile.gzopen(fname)
    for ti in tf:
    print ' %s' % ti.name
    f = tf.extractfile(ti)
    if not f: continue
    fi = iter(f) # f doesn't natively support next()
    first_line = fi.next()
    for line in fi: pass
    f.close()
    print " First line: %r" % first_line
    print " Last line: %r" % line
    tf.close()

    If you just want the first & last lines, it's a little more
    complex if you don't want to scan the entire file (like I do with
    the for-loop), but the file-like object returned by extractfile()
    is documented as supporting seek() so you can skip to the end and
    then read backwards until you have sufficient lines. I wrote a
    "get the last line of a large file using seeks from the EOF"
    function which you can find at [1] which should handle the odd
    edge cases of $BUFFER_SIZE containing more or less than a full
    line and then reading backwards in chunks (if needed) until you
    have one full line, handling a one-line file, and other
    odd/annoying edge-cases. Hope it helps.

    -tkc

    [1]
    http://mail.python.org/pipermail/python-list/2009-January/1186176.html
     
    Tim Chase, Feb 7, 2010
    #2
    1. Advertising

  3. m_ahlenius

    m_ahlenius Guest

    On Feb 7, 5:01 pm, Tim Chase <> wrote:
    > > Is there a way to do this, without decompressing each file to a temp
    > > dir?  Like is there a method using some tarfile interface adapter to
    > > read a compressed file?  Otherwise I'll just access each file, extract
    > > it,  grab the 1st and last lines and then delete the temp file.

    >
    > I think you're looking for the extractfile() method of the
    > TarFile object:
    >
    >    from glob import glob
    >    from tarfile import TarFile
    >    for fname in glob('*.tgz'):
    >      print fname
    >      tf = TarFile.gzopen(fname)
    >      for ti in tf:
    >        print ' %s' % ti.name
    >        f = tf.extractfile(ti)
    >        if not f: continue
    >        fi = iter(f) # f doesn't natively support next()
    >        first_line = fi.next()
    >        for line in fi: pass
    >        f.close()
    >        print "  First line: %r" % first_line
    >        print "  Last line: %r" % line
    >      tf.close()
    >
    > If you just want the first & last lines, it's a little more
    > complex if you don't want to scan the entire file (like I do with
    > the for-loop), but the file-like object returned by extractfile()
    > is documented as supporting seek() so you can skip to the end and
    > then read backwards until you have sufficient lines.  I wrote a
    > "get the last line of a large file using seeks from the EOF"
    > function which you can find at [1] which should handle the odd
    > edge cases of $BUFFER_SIZE containing more or less than a full
    > line and then reading backwards in chunks (if needed) until you
    > have one full line, handling a one-line file, and other
    > odd/annoying edge-cases.  Hope it helps.
    >
    > -tkc
    >
    > [1]http://mail.python.org/pipermail/python-list/2009-January/1186176.html


    Thanks Tim - this was very helpful. Just learning about tarfile.

    'mark
     
    m_ahlenius, Feb 8, 2010
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matt Doucleff
    Replies:
    5
    Views:
    492
    Tom B.
    Aug 27, 2004
  2. Replies:
    3
    Views:
    391
    Fredrik Lundh
    Dec 13, 2004
  3. Claudio Grondi
    Replies:
    4
    Views:
    574
    Claudio Grondi
    Aug 20, 2005
  4. benoit Guyon
    Replies:
    2
    Views:
    231
    benoit Guyon
    Jul 26, 2005
  5. rudson alves
    Replies:
    1
    Views:
    218
    Dave Angel
    Aug 16, 2012
Loading...

Share This Page