Help with python-list archives

random joe · Jan 5, 2012

Hi. I am new to python and wanted to search the python-list archives
for answers to my many questions but i can't seem to get the archive
files to uncompressed? What gives? From what i understand they are
gzip files so i assumed the gzip module would work, but no! The best i
could do was to get a ton of chinese chars using gzip and
zlib.uncompress(). I would like to be courteous and search for my
answers before asking so as not to waste anyones time. Does anyone
know how to uncompress these files into a readable text form?

Miki Tebeka · Jan 5, 2012

Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

Also, can you give an example of the code and an input file?

random joe · Jan 5, 2012

Is the Google groups search not good enough?

That works but i would like to do some regexes and set up some
defaults.

Also, can you give an example of the code and an input file?

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.

import gzip
f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
data = f1.read()
data[:100]

Click to expand...

Click to expand...

'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
data = f2.read()
data[:100]

Click to expand...

Click to expand...

'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

Ian Kelly · Jan 5, 2012

Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

My experience with the Google groups search (and Google groups in
general) in the past has been terrible. If you're looking for a
specific thread, it can actually be quite hard to find.

Ian Kelly · Jan 5, 2012

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.

import gzip
f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
data = f1.read()
data[:100]

Click to expand...

'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
data = f2.read()
data[:100]

Click to expand...

Click to expand...

'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

Interesting. I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file. I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly. If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

random joe · Jan 5, 2012

Interesting. I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file. I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly. If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

That is interesting. I wonder if anyone else has had the same issue?

Just to be thorough I tried to uncompress using both python 2.x and
3.x and the results are unreadable text files in both cases. I have no
idea what the problem could be. Especially without some way to compare
my files to the gunzip'ed files on a linux machine.

MRAB · Jan 5, 2012

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.

import gzip
f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
data = f1.read()
data[:100] '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
data = f2.read()
data[:100]

Click to expand...

'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

Click to expand...

Interesting. I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file. I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly. If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.

random joe · Jan 5, 2012

I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.

On a windows machine? If so, can you post a code snippet please?
Thanks

MRAB · Jan 5, 2012

On a windows machine? If so, can you post a code snippet please?
Thanks

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

random joe · Jan 5, 2012

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

EXCELLENT! Thanks.

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

random joe · Jan 5, 2012

EXCELLENT! Thanks.

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

Nevermind. Notepad was the problem. After using a real editor the text
is displayed correctly! Thanks for help everyone!

PS: I wonder why no one has added a note to the Python-list archives
to advise people about the bug?

Chris Angelico · Jan 5, 2012

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

I'm seeing it as plain text, with proper newlines. There's no
indentation as it just runs straight through, top-to-bottom; but you
should be able to see line breaks. Check your mail reader in case
something's getting botched there.

ChrisA

Chris Angelico · Jan 5, 2012

Nevermind. Notepad was the problem. After using a real editor the text
is displayed correctly! Thanks for help everyone!

.... or that could be your problem

ChrisA

Ian Kelly · Jan 6, 2012

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:

import gzip
from cStringIO import StringIO

in_file = gzip.open('2012-January.txt.gz')
tmp_file = StringIO(in_file.read())
in_file.close()
in_file = gzip.GzipFile(fileobj=tmp_file)
out_file = open('2012-January.txt', 'wb')
out_file.write(in_file.read())
in_file.close()
out_file.close()

Sadly, GzipFile won't read directly from another GzipFile instance
(ValueError: Seek from end not supported), so some sort of
intermediate is necessary.

Ian Kelly · Jan 6, 2012

PS: I wonder why no one has added a note to the Python-list archives
to advise people about the bug?

Probably nobody has noticed it until now. It seems to be a quirk of
the archive files that they are double-gzipped, and most people
probably just use gunzip or gzcat (or a higher-level tool that invokes
those) to extract them, which seems to be smart enough to handle it.

random joe · Jan 6, 2012

One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:

Yes StringIO is perfect for this. Many thanks to all who replied.

Anssi Saari · Jan 10, 2012

Ian Kelly said:
Probably nobody has noticed it until now. It seems to be a quirk of
the archive files that they are double-gzipped...

Interesting, but I don't think the files are actually double-gzipped. If
I download
http://mail.python.org/pipermail/python-list/2012-January.txt.gz with
wget in Cygwin or Unix, the file is 226753 bytes and singly gzipped.

However, if I download the same file with Firefox in Windows, then it's
226782 bytes and double gzipped. So maybe it's something in the browser
or server setup?

Python list archives double-gzipped?	0	Aug 26, 2012
Python package to read .7z archives?	2	Aug 4, 2010
Importing package with zip-archives	0	Jul 2, 2010
Need help with this script	4	Mar 12, 2023
I need help with a Gemini prompt	1	May 14, 2025
Python 3000: Standard API for archives?	3	Jun 4, 2007
Help with Python Flask on PI as server SSE to website	0	Apr 23, 2022
Archives and magic bytes	5	Mar 23, 2005

Help with python-list archives

random joe

Miki Tebeka

random joe

Ian Kelly

Ian Kelly

random joe

MRAB

random joe

MRAB

random joe

random joe

Chris Angelico

Chris Angelico

Ian Kelly

Ian Kelly

random joe

Anssi Saari

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads