Problem with tarfile module to open *.tar.gz files - unreliable ?

Discussion in 'Python' started by m_ahlenius, Aug 20, 2010.

  1. m_ahlenius

    m_ahlenius Guest

    Hi,

    I am relatively new to doing serious work in python. I am using it to
    access a large number of log files. Some of the logs get corrupted
    and I need to detect that when processing them. This code seems to
    work for quite a few of the logs (all same structure) It also
    correctly identifies some corrupt logs but then it identifies others
    as being corrupt when they are not.

    example error msg from below code:

    Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    Exception: CRC check\
    failed 0x8967e931 != 0x4e5f1036L

    When I manually examine the supposed corrupt log file and use
    "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
    just fine.

    Is there anything wrong with how I am using this module? (extra code
    removed for clarity)

    if tarfile.is_tarfile( file ):
    try:
    xf = tarfile.open( file, "r:gz" )
    for locFile in xf:
    logfile = xf.extractfile( locFile )
    validFileFlag = True
    # iterate through each log file, grab the first and
    the last lines
    lines = iter( logfile )
    firstLine = lines.next()
    for nextLine in lines:
    ....
    continue

    logfile.close()
    ...
    xf.close()
    except Exception, e:
    validFileFlag = False
    msg = "\nCould not open the log file: " + repr(file) + "
    Exception: " + str(e) + "\n"
    else:
    validFileFlag = False
    lTime = extractFileNameTime( file )
    msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
    \n"
    print msg
     
    m_ahlenius, Aug 20, 2010
    #1
    1. Advertising

  2. m_ahlenius

    Dave Angel Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    m_ahlenius wrote:
    > Hi,
    >
    > I am relatively new to doing serious work in python. I am using it to
    > access a large number of log files. Some of the logs get corrupted
    > and I need to detect that when processing them. This code seems to
    > work for quite a few of the logs (all same structure) It also
    > correctly identifies some corrupt logs but then it identifies others
    > as being corrupt when they are not.
    >
    > example error msg from below code:
    >
    > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    > Exception: CRC check\
    > failed 0x8967e931 != 0x4e5f1036L
    >
    > When I manually examine the supposed corrupt log file and use
    > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
    > just fine.
    >
    > Is there anything wrong with how I am using this module? (extra code
    > removed for clarity)
    >
    > if tarfile.is_tarfile( file ):
    > try:
    > xf = tarfile.open( file, "r:gz" )
    > for locFile in xf:
    > logfile = xf.extractfile( locFile )
    > validFileFlag = True
    > # iterate through each log file, grab the first and
    > the last lines
    > lines = iter( logfile )
    > firstLine = lines.next()
    > for nextLine in lines:
    > ....
    > continue
    >
    > logfile.close()
    > ...
    > xf.close()
    > except Exception, e:
    > validFileFlag = False
    > msg = "\nCould not open the log file: " + repr(file) + "
    > Exception: " + str(e) + "\n"
    > else:
    > validFileFlag = False
    > lTime = extractFileNameTime( file )
    > msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
    > \n"
    > print msg
    >
    >

    I haven't used tarfile, but this feels like a problem with the Win/Unix
    line endings. I'm going to assume you're running on Windows, which
    could trigger the problem I'm going to describe.

    You use 'file' to hold something, but don't show us what. In fact, it's
    a lousy name, since it's already a Python builtin. But if it's holding
    fileobj, that you've separately opened, then you need to change that
    open to use mode 'rb'

    The problem, if I've guessed right, is that occasionally you'll
    accidentally encounter a 0d0a sequence in the middle of the (binary)
    compressed data. If you're on Windows, and use the default 'r' mode,
    it'll be changed into a 0a byte. Thus corrupting the checksum, and
    eventually the contents.

    DaveA
     
    Dave Angel, Aug 20, 2010
    #2
    1. Advertising

  3. m_ahlenius

    m_ahlenius Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    On Aug 20, 5:34 am, Dave Angel <> wrote:
    > m_ahlenius wrote:
    > > Hi,

    >
    > > I am relatively new to doing serious work in python.  I am using it to
    > > access a large number of log files.  Some of the logs get corrupted
    > > and I need to detect that when processing them.  This code seems to
    > > work for quite a few of the logs (all same structure)  It also
    > > correctly identifies some corrupt logs but then it identifies others
    > > as being corrupt when they are not.

    >
    > > example error msg from below code:

    >
    > > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    > > Exception: CRC check\
    > >  failed 0x8967e931 != 0x4e5f1036L

    >
    > > When I manually examine the supposed corrupt log file and use
    > > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
    > > just fine.

    >
    > > Is there anything wrong with how I am using this module?  (extra code
    > > removed for clarity)

    >
    > >  if tarfile.is_tarfile( file ):
    > >         try:
    > >             xf = tarfile.open( file, "r:gz" )
    > >             for locFile in xf:
    > >                 logfile = xf.extractfile( locFile )
    > >                 validFileFlag = True
    > >                 # iterate through each log file, grab the first and
    > > the last lines
    > >                 lines = iter( logfile )
    > >                 firstLine = lines.next()
    > >                 for nextLine in lines:
    > >                     ....
    > >                         continue

    >
    > >                 logfile.close()
    > >                  ...
    > >             xf.close()
    > >         except Exception, e:
    > >             validFileFlag = False
    > >             msg = "\nCould not open the log file: " + repr(file) + "
    > > Exception: " + str(e) + "\n"
    > >  else:
    > >         validFileFlag = False
    > >         lTime = extractFileNameTime( file )
    > >         msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
    > > \n"
    > >         print msg

    >
    > I haven't used tarfile, but this feels like a problem with the Win/Unix
    > line endings.  I'm going to assume you're running on Windows, which
    > could trigger the problem I'm going to describe.
    >
    > You use 'file' to hold something, but don't show us what.  In fact, it's
    > a lousy name, since it's already a Python builtin.  But if it's holding  
    > fileobj, that you've separately opened, then you need to change that
    > open to use mode 'rb'
    >
    > The problem, if I've guessed right, is that occasionally you'll
    > accidentally encounter a 0d0a sequence in the middle of the (binary)
    > compressed data.  If you're on Windows, and use the default 'r' mode,
    > it'll be changed into a 0a byte.  Thus corrupting the checksum, and
    > eventually the contents.
    >
    > DaveA


    Hi,

    thanks for the comments - I'll change the variable name.

    I am running this on linux so don't think its a Windows issue. So if
    that's the case
    is the 0d0a still an issue?

    'mark
     
    m_ahlenius, Aug 20, 2010
    #3
  4. m_ahlenius

    m_ahlenius Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    On Aug 20, 6:57 am, m_ahlenius <> wrote:
    > On Aug 20, 5:34 am, Dave Angel <> wrote:
    >
    >
    >
    >
    >
    > > m_ahlenius wrote:
    > > > Hi,

    >
    > > > I am relatively new to doing serious work in python.  I am using it to
    > > > access a large number of log files.  Some of the logs get corrupted
    > > > and I need to detect that when processing them.  This code seems to
    > > > work for quite a few of the logs (all same structure)  It also
    > > > correctly identifies some corrupt logs but then it identifies others
    > > > as being corrupt when they are not.

    >
    > > > example error msg from below code:

    >
    > > > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    > > > Exception: CRC check\
    > > >  failed 0x8967e931 != 0x4e5f1036L

    >
    > > > When I manually examine the supposed corrupt log file and use
    > > > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
    > > > just fine.

    >
    > > > Is there anything wrong with how I am using this module?  (extra code
    > > > removed for clarity)

    >
    > > >  if tarfile.is_tarfile( file ):
    > > >         try:
    > > >             xf = tarfile.open( file, "r:gz" )
    > > >             for locFile in xf:
    > > >                 logfile = xf.extractfile( locFile )
    > > >                 validFileFlag = True
    > > >                 # iterate through each log file, grab the first and
    > > > the last lines
    > > >                 lines = iter( logfile )
    > > >                 firstLine = lines.next()
    > > >                 for nextLine in lines:
    > > >                     ....
    > > >                         continue

    >
    > > >                 logfile.close()
    > > >                  ...
    > > >             xf.close()
    > > >         except Exception, e:
    > > >             validFileFlag = False
    > > >             msg = "\nCould not open the log file: " + repr(file) + "
    > > > Exception: " + str(e) + "\n"
    > > >  else:
    > > >         validFileFlag = False
    > > >         lTime = extractFileNameTime( file )
    > > >         msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
    > > > \n"
    > > >         print msg

    >
    > > I haven't used tarfile, but this feels like a problem with the Win/Unix
    > > line endings.  I'm going to assume you're running on Windows, which
    > > could trigger the problem I'm going to describe.

    >
    > > You use 'file' to hold something, but don't show us what.  In fact, it's
    > > a lousy name, since it's already a Python builtin.  But if it's holding  
    > > fileobj, that you've separately opened, then you need to change that
    > > open to use mode 'rb'

    >
    > > The problem, if I've guessed right, is that occasionally you'll
    > > accidentally encounter a 0d0a sequence in the middle of the (binary)
    > > compressed data.  If you're on Windows, and use the default 'r' mode,
    > > it'll be changed into a 0a byte.  Thus corrupting the checksum, and
    > > eventually the contents.

    >
    > > DaveA

    >
    > Hi,
    >
    > thanks for the comments - I'll change the variable name.
    >
    > I am running this on linux so don't think its a Windows issue.  So if
    > that's the case
    > is the 0d0a still an issue?
    >
    > 'mark


    Oh and what's stored currently in
    The file var us just the unopened pathname to the
    Target file I want to open
     
    m_ahlenius, Aug 20, 2010
    #4
  5. m_ahlenius

    Dave Angel Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    m_ahlenius wrote:
    > On Aug 20, 6:57 am, m_ahlenius <> wrote:
    >
    >> On Aug 20, 5:34 am, Dave Angel <> wrote:
    >>
    >>
    >>
    >>
    >>
    >>
    >>> m_ahlenius wrote:
    >>>
    >>>> Hi,
    >>>>
    >>>> I am relatively new to doing serious work in python. I am using it to
    >>>> access a large number of log files. Some of the logs get corrupted
    >>>> and I need to detect that when processing them. This code seems to
    >>>> work for quite a few of the logs (all same structure) It also
    >>>> correctly identifies some corrupt logs but then it identifies others
    >>>> as being corrupt when they are not.
    >>>>
    >>>> example error msg from below code:
    >>>>
    >>>> Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    >>>> Exception: CRC check\
    >>>> failed 0x8967e931 !=x4e5f1036L
    >>>>
    >>>> When I manually examine the supposed corrupt log file and use
    >>>> "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
    >>>> just fine.
    >>>>
    >>>> Is there anything wrong with how I am using this module? (extra code
    >>>> removed for clarity)
    >>>>
    >>>> if tarfile.is_tarfile( file ):
    >>>> try:
    >>>> xf =arfile.open( file, "r:gz" )
    >>>> for locFile in xf:
    >>>> logfile =f.extractfile( locFile )
    >>>> validFileFlag =rue
    >>>> # iterate through each log file, grab the first and
    >>>> the last lines
    >>>> lines =ter( logfile )
    >>>> firstLine =ines.next()
    >>>> for nextLine in lines:
    >>>> ....
    >>>> continue
    >>>>
    >>>> logfile.close()
    >>>> ...
    >>>> xf.close()
    >>>> except Exception, e:
    >>>> validFileFlag =alse
    >>>> msg =\nCould not open the log file: " + repr(file) + "
    >>>> Exception: " + str(e) + "\n"
    >>>> else:
    >>>> validFileFlag =alse
    >>>> lTime =xtractFileNameTime( file )
    >>>> msg =>>>>>>> Warning " + file + " is NOT a valid tar archive
    >>>> \n"
    >>>> print msg
    >>>>
    >>> I haven't used tarfile, but this feels like a problem with the Win/Unix
    >>> line endings. I'm going to assume you're running on Windows, which
    >>> could trigger the problem I'm going to describe.
    >>>
    >>> You use 'file' to hold something, but don't show us what. In fact, it's
    >>> a lousy name, since it's already a Python builtin. But if it's holding
    >>> fileobj, that you've separately opened, then you need to change that
    >>> open to use mode 'rb'
    >>>
    >>> The problem, if I've guessed right, is that occasionally you'll
    >>> accidentally encounter a 0d0a sequence in the middle of the (binary)
    >>> compressed data. If you're on Windows, and use the default 'r' mode,
    >>> it'll be changed into a 0a byte. Thus corrupting the checksum, and
    >>> eventually the contents.
    >>>
    >>> DaveA
    >>>

    >> Hi,
    >>
    >> thanks for the comments - I'll change the variable name.
    >>
    >> I am running this on linux so don't think its a Windows issue. So if
    >> that's the case
    >> is the 0d0a still an issue?
    >>
    >> 'mark
    >>

    >
    > Oh and what's stored currently in
    > The file var us just the unopened pathname to the
    > Target file I want to open
    >
    >
    >

    No, on Linux, there should be no such problem. And I have to assume
    that if you pass the filename as a string, the library would use 'rb'
    anyway. It's just if you pass a fileobj, AND are on Windows.

    Sorry I wasted your time, but nobody else had answered, and I hoped it
    might help.

    DaveA
     
    Dave Angel, Aug 20, 2010
    #5
  6. m_ahlenius

    Peter Otten Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    m_ahlenius wrote:

    > On Aug 20, 6:57 am, m_ahlenius <> wrote:
    >> On Aug 20, 5:34 am, Dave Angel <> wrote:
    >>
    >>
    >>
    >>
    >>
    >> > m_ahlenius wrote:
    >> > > Hi,

    >>
    >> > > I am relatively new to doing serious work in python. I am using it
    >> > > to access a large number of log files. Some of the logs get
    >> > > corrupted and I need to detect that when processing them. This code
    >> > > seems to work for quite a few of the logs (all same structure) It
    >> > > also correctly identifies some corrupt logs but then it identifies
    >> > > others as being corrupt when they are not.

    >>
    >> > > example error msg from below code:

    >>
    >> > > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    >> > > Exception: CRC check\
    >> > > failed 0x8967e931 != 0x4e5f1036L

    >>
    >> > > When I manually examine the supposed corrupt log file and use
    >> > > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz " on it, it opens
    >> > > just fine.

    >>
    >> > > Is there anything wrong with how I am using this module? (extra code
    >> > > removed for clarity)

    >>
    >> > > if tarfile.is_tarfile( file ):
    >> > > try:
    >> > > xf = tarfile.open( file, "r:gz" )
    >> > > for locFile in xf:
    >> > > logfile = xf.extractfile( locFile )
    >> > > validFileFlag = True
    >> > > # iterate through each log file, grab the first and
    >> > > the last lines
    >> > > lines = iter( logfile )
    >> > > firstLine = lines.next()
    >> > > for nextLine in lines:
    >> > > ....
    >> > > continue

    >>
    >> > > logfile.close()
    >> > > ...
    >> > > xf.close()
    >> > > except Exception, e:
    >> > > validFileFlag = False
    >> > > msg = "\nCould not open the log file: " + repr(file) + "
    >> > > Exception: " + str(e) + "\n"
    >> > > else:
    >> > > validFileFlag = False
    >> > > lTime = extractFileNameTime( file )
    >> > > msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
    >> > > \n"
    >> > > print msg

    >>
    >> > I haven't used tarfile, but this feels like a problem with the Win/Unix
    >> > line endings. I'm going to assume you're running on Windows, which
    >> > could trigger the problem I'm going to describe.

    >>
    >> > You use 'file' to hold something, but don't show us what. In fact,
    >> > it's a lousy name, since it's already a Python builtin. But if it's
    >> > holding fileobj, that you've separately opened, then you need to change
    >> > that open to use mode 'rb'

    >>
    >> > The problem, if I've guessed right, is that occasionally you'll
    >> > accidentally encounter a 0d0a sequence in the middle of the (binary)
    >> > compressed data. If you're on Windows, and use the default 'r' mode,
    >> > it'll be changed into a 0a byte. Thus corrupting the checksum, and
    >> > eventually the contents.

    >>
    >> > DaveA

    >>
    >> Hi,
    >>
    >> thanks for the comments - I'll change the variable name.
    >>
    >> I am running this on linux so don't think its a Windows issue. So if
    >> that's the case
    >> is the 0d0a still an issue?
    >>
    >> 'mark

    >
    > Oh and what's stored currently in
    > The file var us just the unopened pathname to the
    > Target file I want to open


    Random questions:

    What python version are you using?
    If you have other python versions around, do they exhibit the same problem?
    If you extract and compress your data using the external tool, does the
    resulting file make problems in Python, too?
    If so, can you reduce data size and put a small demo online for others to
    experiment with?

    Peter
     
    Peter Otten, Aug 20, 2010
    #6
  7. m_ahlenius

    m_ahlenius Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    On Aug 20, 9:10 am, Dave Angel <> wrote:
    > m_ahlenius wrote:
    > > On Aug 20, 6:57 am, m_ahlenius <> wrote:

    >
    > >> On Aug 20, 5:34 am, Dave Angel <> wrote:

    >
    > >>> m_ahlenius wrote:

    >
    > >>>> Hi,

    >
    > >>>> I am relatively new to doing serious work in python.  I am using it to
    > >>>> access a large number of log files.  Some of the logs get corrupted
    > >>>> and I need to detect that when processing them.  This code seems to
    > >>>> work for quite a few of the logs (all same structure)  It also
    > >>>> correctly identifies some corrupt logs but then it identifies others
    > >>>> as being corrupt when they are not.

    >
    > >>>> example error msg from below code:

    >
    > >>>> Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    > >>>> Exception: CRC check\
    > >>>>  failed 0x8967e931 !=x4e5f1036L

    >
    > >>>> When I manually examine the supposed corrupt log file and use
    > >>>> "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
    > >>>> just fine.

    >
    > >>>> Is there anything wrong with how I am using this module?  (extra code
    > >>>> removed for clarity)

    >
    > >>>>  if tarfile.is_tarfile( file ):
    > >>>>         try:
    > >>>>             xf =arfile.open( file, "r:gz" )
    > >>>>             for locFile in xf:
    > >>>>                 logfile =f.extractfile( locFile )
    > >>>>                 validFileFlag =rue
    > >>>>                 # iterate through each log file, grab the first and
    > >>>> the last lines
    > >>>>                 lines =ter( logfile )
    > >>>>                 firstLine =ines.next()
    > >>>>                 for nextLine in lines:
    > >>>>                     ....
    > >>>>                         continue

    >
    > >>>>                 logfile.close()
    > >>>>                  ...
    > >>>>             xf.close()
    > >>>>         except Exception, e:
    > >>>>             validFileFlag =alse
    > >>>>             msg =\nCould not open the log file: " + repr(file) + "
    > >>>> Exception: " + str(e) + "\n"
    > >>>>  else:
    > >>>>         validFileFlag =alse
    > >>>>         lTime =xtractFileNameTime( file )
    > >>>>         msg =>>>>>>> Warning " + file + " is NOT a valid tar archive
    > >>>> \n"
    > >>>>         print msg

    >
    > >>> I haven't used tarfile, but this feels like a problem with the Win/Unix
    > >>> line endings.  I'm going to assume you're running on Windows, which
    > >>> could trigger the problem I'm going to describe.

    >
    > >>> You use 'file' to hold something, but don't show us what.  In fact, it's
    > >>> a lousy name, since it's already a Python builtin.  But if it's holding  
    > >>> fileobj, that you've separately opened, then you need to change that
    > >>> open to use mode 'rb'

    >
    > >>> The problem, if I've guessed right, is that occasionally you'll
    > >>> accidentally encounter a 0d0a sequence in the middle of the (binary)
    > >>> compressed data.  If you're on Windows, and use the default 'r' mode,
    > >>> it'll be changed into a 0a byte.  Thus corrupting the checksum, and
    > >>> eventually the contents.

    >
    > >>> DaveA

    >
    > >> Hi,

    >
    > >> thanks for the comments - I'll change the variable name.

    >
    > >> I am running this on linux so don't think its a Windows issue.  So if
    > >> that's the case
    > >> is the 0d0a still an issue?

    >
    > >> 'mark

    >
    > > Oh and what's stored currently in
    > > The file var us just the unopened pathname to the
    > > Target file I want to open

    >
    > No, on Linux, there should be no such problem.  And I have to assume
    > that if you pass the filename as a string, the library would use 'rb'
    > anyway.  It's just if you pass a fileobj,  AND are on Windows.
    >
    > Sorry I wasted your time, but nobody else had answered, and I hoped it
    > might help.
    >
    > DaveA


    Hi Dave

    thanks for responding - you were not wasting my time but helping me to
    be aware of other potential issues.

    Appreciate it much.

    Its just weird that it works for most files and even finds corrupt
    ones, but some of the ones it marks as corrupt seem to be OK.

    thanks

    'mark
     
    m_ahlenius, Aug 20, 2010
    #7
  8. m_ahlenius

    m_ahlenius Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    On Aug 20, 9:25 am, Peter Otten <> wrote:
    > m_ahlenius wrote:
    > > On Aug 20, 6:57 am, m_ahlenius <> wrote:
    > >> On Aug 20, 5:34 am, Dave Angel <> wrote:

    >
    > >> > m_ahlenius wrote:
    > >> > > Hi,

    >
    > >> > > I am relatively new to doing serious work in python.  I am using it
    > >> > > to access a large number of log files.  Some of the logs get
    > >> > > corrupted and I need to detect that when processing them.  This code
    > >> > > seems to work for quite a few of the logs (all same structure)  It
    > >> > > also correctly identifies some corrupt logs but then it identifies
    > >> > > others as being corrupt when they are not.

    >
    > >> > > example error msg from below code:

    >
    > >> > > Could not open the log file: '/disk/7-29-04-02-01.console.log.tar.gz'
    > >> > > Exception: CRC check\
    > >> > > failed 0x8967e931 != 0x4e5f1036L

    >
    > >> > > When I manually examine the supposed corrupt log file and use
    > >> > > "tar -xzvof /disk/7-29-04-02-01.console.log.tar.gz "  on it, it opens
    > >> > > just fine.

    >
    > >> > > Is there anything wrong with how I am using this module?  (extra code
    > >> > > removed for clarity)

    >
    > >> > > if tarfile.is_tarfile( file ):
    > >> > > try:
    > >> > > xf = tarfile.open( file, "r:gz" )
    > >> > > for locFile in xf:
    > >> > > logfile = xf.extractfile( locFile )
    > >> > > validFileFlag = True
    > >> > > # iterate through each log file, grab the first and
    > >> > > the last lines
    > >> > > lines = iter( logfile )
    > >> > > firstLine = lines.next()
    > >> > > for nextLine in lines:
    > >> > > ....
    > >> > > continue

    >
    > >> > > logfile.close()
    > >> > > ...
    > >> > > xf.close()
    > >> > > except Exception, e:
    > >> > > validFileFlag = False
    > >> > > msg = "\nCould not open the log file: " + repr(file) + "
    > >> > > Exception: " + str(e) + "\n"
    > >> > > else:
    > >> > > validFileFlag = False
    > >> > > lTime = extractFileNameTime( file )
    > >> > > msg = ">>>>>>> Warning " + file + " is NOT a valid tar archive
    > >> > > \n"
    > >> > > print msg

    >
    > >> > I haven't used tarfile, but this feels like a problem with the Win/Unix
    > >> > line endings.  I'm going to assume you're running on Windows, which
    > >> > could trigger the problem I'm going to describe.

    >
    > >> > You use 'file' to hold something, but don't show us what.  In fact,
    > >> > it's a lousy name, since it's already a Python builtin.  But if it's
    > >> > holding fileobj, that you've separately opened, then you need to change
    > >> > that open to use mode 'rb'

    >
    > >> > The problem, if I've guessed right, is that occasionally you'll
    > >> > accidentally encounter a 0d0a sequence in the middle of the (binary)
    > >> > compressed data.  If you're on Windows, and use the default 'r' mode,
    > >> > it'll be changed into a 0a byte.  Thus corrupting the checksum, and
    > >> > eventually the contents.

    >
    > >> > DaveA

    >
    > >> Hi,

    >
    > >> thanks for the comments - I'll change the variable name.

    >
    > >> I am running this on linux so don't think its a Windows issue.  So if
    > >> that's the case
    > >> is the 0d0a still an issue?

    >
    > >> 'mark

    >
    > > Oh and what's stored currently in
    > > The file var us just the unopened pathname to the
    > > Target file I want to open

    >
    > Random questions:
    >
    > What python version are you using?
    > If you have other python versions around, do they exhibit the same problem?
    > If you extract and compress your data using the external tool, does the
    > resulting file make problems in Python, too?
    > If so, can you reduce data size and put a small demo online for others to
    > experiment with?
    >
    > Peter


    Hi,

    I am using Python 2.6.5.

    Unfortunately I don't have other versions installed so its hard to
    test with a different version.

    As for the log compression, its a bit hard to test. Right now I may
    process 100+ of these logs per night, and will get maybe 5 which are
    reported as corrupt (typically a bad CRC) and 2 which it reported as a
    bad tar archive. This morning I checked each of the 7 reported
    problem files by manually opening them with "tar -xzvof" and they were
    all indeed corrupt. Sign.

    Unfortunately due to the nature of our business, I can't post the data
    files online, I hope you can understand. But I really appreciate your
    suggestions.

    The thing that gets me is that it seems to work just fine for most
    files, but then not others. Labeling normal files as corrupt hurts us
    as we then skip getting any log data from those files.

    appreciate all your help.

    'mark
     
    m_ahlenius, Aug 20, 2010
    #8
  9. m_ahlenius

    Peter Otten Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    m_ahlenius wrote:

    > I am using Python 2.6.5.
    >
    > Unfortunately I don't have other versions installed so its hard to
    > test with a different version.
    >
    > As for the log compression, its a bit hard to test. Right now I may
    > process 100+ of these logs per night, and will get maybe 5 which are
    > reported as corrupt (typically a bad CRC) and 2 which it reported as a
    > bad tar archive. This morning I checked each of the 7 reported
    > problem files by manually opening them with "tar -xzvof" and they were
    > all indeed corrupt. Sign.


    So many corrupted files? I'd say you have to address the problem with your
    infrastructure first.

    > Unfortunately due to the nature of our business, I can't post the data
    > files online, I hope you can understand. But I really appreciate your
    > suggestions.
    >
    > The thing that gets me is that it seems to work just fine for most
    > files, but then not others. Labeling normal files as corrupt hurts us
    > as we then skip getting any log data from those files.
    >
    > appreciate all your help.


    I've written an autocorruption script,

    import sys
    import subprocess
    import tarfile

    def process(source, dest, data):
    for pos in range(len(data)):
    for bit in range(8):
    new_data = data[:pos] + chr(ord(data[pos]) ^ (1<<bit)) +
    data[pos+1:]
    assert len(data) == len(new_data)
    out = open(dest, "w")
    out.write(new_data)
    out.close()
    try:
    t = tarfile.open(dest)
    for f in t:
    t.extractfile(f)
    except Exception, e:
    if 0 == subprocess.call(["tar", "-xf", dest]):
    return pos, bit

    if __name__ == "__main__":
    source, dest = sys.argv[1:]
    data = open(source).read()
    print process(source, dest, data)

    and I can indeed construct an archive that is rejected by tarfile, but not
    by tar. My working hypothesis is that the python library is a bit stricter
    in what it accepts...

    Peter
     
    Peter Otten, Aug 20, 2010
    #9
  10. m_ahlenius

    m_ahlenius Guest

    Re: Problem with tarfile module to open *.tar.gz files - unreliable ?

    On Aug 20, 12:55 pm, Peter Otten <> wrote:
    > m_ahlenius wrote:
    > > I am using Python 2.6.5.

    >
    > > Unfortunately I don't have other versions installed so its hard to
    > > test with a different version.

    >
    > > As for the log compression, its a bit hard to test.  Right now I may
    > > process 100+ of these logs per night, and will get maybe 5 which are
    > > reported as corrupt (typically a bad CRC) and 2 which it reported as a
    > > bad tar archive.  This morning I checked each of the 7 reported
    > > problem files by manually opening them with "tar -xzvof" and they were
    > > all indeed corrupt. Sign.

    >
    > So many corrupted files? I'd say you have to address the problem with your
    > infrastructure first.
    >
    > > Unfortunately due to the nature of our business, I can't post the data
    > > files online, I hope you can understand.  But I really appreciate your
    > > suggestions.

    >
    > > The thing that gets me is that it seems to work just fine for most
    > > files, but then not others.  Labeling normal files as corrupt hurts us
    > > as we then skip getting any log data from those files.

    >
    > > appreciate all your help.

    >
    > I've written an autocorruption script,
    >
    > import sys
    > import subprocess
    > import tarfile
    >
    > def process(source, dest, data):
    >     for pos in range(len(data)):
    >         for bit in range(8):
    >             new_data = data[:pos] + chr(ord(data[pos]) ^ (1<<bit)) +
    > data[pos+1:]
    >             assert len(data) == len(new_data)
    >             out = open(dest, "w")
    >             out.write(new_data)
    >             out.close()
    >             try:
    >                 t = tarfile.open(dest)
    >                 for f in t:
    >                     t.extractfile(f)
    >             except Exception, e:
    >                 if 0 == subprocess.call(["tar", "-xf", dest]):
    >                     return pos, bit
    >
    > if __name__ == "__main__":
    >     source, dest = sys.argv[1:]
    >     data = open(source).read()
    >     print process(source, dest, data)
    >
    > and I can indeed construct an archive that is rejected by tarfile, but not
    > by tar. My working hypothesis is that the python library is a bit stricter
    > in what it accepts...
    >
    > Peter


    Thanks - that's cool.

    A friend of mine was suggesting that he's seen similar behaviour when
    he uses Perl on these types of files when the OS (Unix) has not
    finished writing them. We have an rsync process which sync's our
    servers for these files and then come down somewhat randomly. So its
    conceivable I think that this process could be trying to open a file
    as its being written. I know it sounds like a stretch but my guess is
    that its a possibility. I could verify that with the timestamps of
    the errors in my log and the mod time on the original file.

    'mark
     
    m_ahlenius, Aug 21, 2010
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Matt Doucleff
    Replies:
    5
    Views:
    498
    Tom B.
    Aug 27, 2004
  2. Claudio Grondi
    Replies:
    4
    Views:
    588
    Claudio Grondi
    Aug 20, 2005
  3. Replies:
    3
    Views:
    467
    =?iso-8859-1?q?Lars_Gust=E4bel?=
    Aug 28, 2005
  4. m_ahlenius
    Replies:
    2
    Views:
    306
    m_ahlenius
    Feb 8, 2010
  5. rudson alves
    Replies:
    1
    Views:
    232
    Dave Angel
    Aug 16, 2012
Loading...

Share This Page