Help with python-list archives

R

random joe

Hi. I am new to python and wanted to search the python-list archives
for answers to my many questions but i can't seem to get the archive
files to uncompressed? What gives? From what i understand they are
gzip files so i assumed the gzip module would work, but no! The best i
could do was to get a ton of chinese chars using gzip and
zlib.uncompress(). I would like to be courteous and search for my
answers before asking so as not to waste anyones time. Does anyone
know how to uncompress these files into a readable text form?
 
M

Miki Tebeka

Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

Also, can you give an example of the code and an input file?
 
R

random joe

Is the Google groups search not good enough?

That works but i would like to do some regexes and set up some
defaults.
Also, can you give an example of the code and an input file?

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.
import gzip
f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
data = f1.read()
data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
data = f2.read()
data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?
 
I

Ian Kelly

Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

My experience with the Google groups search (and Google groups in
general) in the past has been terrible. If you're looking for a
specific thread, it can actually be quite hard to find.
 
I

Ian Kelly

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.
import gzip
f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
data = f1.read()
data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
data = f2.read()
data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

Interesting. I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file. I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly. If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.
 
R

random joe

Interesting.  I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file.  I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly.  If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

That is interesting. I wonder if anyone else has had the same issue?

Just to be thorough I tried to uncompress using both python 2.x and
3.x and the results are unreadable text files in both cases. I have no
idea what the problem could be. Especially without some way to compare
my files to the gunzip'ed files on a linux machine.
 
M

MRAB

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.
import gzip
f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
data = f1.read()
data[:100] '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
data = f2.read()
data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

Interesting. I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file. I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly. If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.
 
R

random joe

I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.

On a windows machine? If so, can you post a code snippet please?
Thanks
 
M

MRAB

On a windows machine? If so, can you post a code snippet please?
Thanks

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()
 
R

random joe

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

EXCELLENT! Thanks.

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?
 
R

random joe

EXCELLENT! Thanks.

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

Nevermind. Notepad was the problem. After using a real editor the text
is displayed correctly! Thanks for help everyone!

PS: I wonder why no one has added a note to the Python-list archives
to advise people about the bug?
 
C

Chris Angelico

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

I'm seeing it as plain text, with proper newlines. There's no
indentation as it just runs straight through, top-to-bottom; but you
should be able to see line breaks. Check your mail reader in case
something's getting botched there.

ChrisA
 
C

Chris Angelico

Nevermind. Notepad was the problem. After using a real editor the text
is displayed correctly! Thanks for help everyone!

.... or that could be your problem :)

ChrisA
 
I

Ian Kelly

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:

import gzip
from cStringIO import StringIO

in_file = gzip.open('2012-January.txt.gz')
tmp_file = StringIO(in_file.read())
in_file.close()
in_file = gzip.GzipFile(fileobj=tmp_file)
out_file = open('2012-January.txt', 'wb')
out_file.write(in_file.read())
in_file.close()
out_file.close()

Sadly, GzipFile won't read directly from another GzipFile instance
(ValueError: Seek from end not supported), so some sort of
intermediate is necessary.
 
I

Ian Kelly

PS: I wonder why no one has added a note to the Python-list archives
to advise people about the bug?

Probably nobody has noticed it until now. It seems to be a quirk of
the archive files that they are double-gzipped, and most people
probably just use gunzip or gzcat (or a higher-level tool that invokes
those) to extract them, which seems to be smart enough to handle it.
 
R

random joe

One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:

Yes StringIO is perfect for this. Many thanks to all who replied.
 
A

Anssi Saari

Ian Kelly said:
Probably nobody has noticed it until now. It seems to be a quirk of
the archive files that they are double-gzipped...

Interesting, but I don't think the files are actually double-gzipped. If
I download
http://mail.python.org/pipermail/python-list/2012-January.txt.gz with
wget in Cygwin or Unix, the file is 226753 bytes and singly gzipped.

However, if I download the same file with Firefox in Windows, then it's
226782 bytes and double gzipped. So maybe it's something in the browser
or server setup?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top