distinction between unzipping bytes and unzipping a file

W

webcomm

Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.

Apologies if it's not appropriate to start a new thread for this. It
just seems like a different topic than how to deal with the resulting
FFFD characters.

Thanks for your help,
Ryan
 
W

webcomm

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

Sorry, that code is not what I mean to paste. This is what I
intended...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = popen("unzip data.zip")
 
S

Steve Holden

webcomm said:
Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?

The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."
Not terribly useful advice, but one presumes he she or it was trying to
be helpful.
If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.
Well, what you have done appears pretty wrong to me, but let's take a
look. What's datum? You appear to be treating it as base64-encoded data;
is that correct? Have you examined it?

f = open('data.zip', 'wb')

opens the file data.zip for writing in binary. Not as a zip file, you
understand, just as a regular file. I suspect here you really needed

f = zipfile.ZipFile('data.zip', 'w')

Now, of course, you need to remember what zipfiles contain. Which is
other files. So the data you *write* tot he zipfile has to be associated
with a filename in the archive. Of course you don't have the data in a
file, you have it in a string, so you would use

f.writestr("somefile.dat", decoded)
f.close()

You have now written a zip file containing a single "somefile.dat" file
with the decoded base64 data in it. Open it with Winzip or one of its
buddies and see if anyone barfs.

I find that I am able to unzip the resulting data.zip using the unix
unzip command, but the file inside contains some FFFD characters, as
described in this thread...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
I don't know if the unwanted characters might be the result of my
trying to write and unzip a file, rather than unzipping the bytes.
The file does contain a semblance of what I ultimately want -- it's
not all garbage.
But it's certainly not a zip file.
Apologies if it's not appropriate to start a new thread for this. It
just seems like a different topic than how to deal with the resulting
FFFD characters.
Don't worry about it.

regards
Steve
 
M

MRAB

webcomm said:
Hi,
In python, is there a distinction between unzipping bytes and
unzipping a binary file to which those bytes have been written?
Python's zipfile module can only read and write zip files; it can't
compress or decompress data as a bytestring.
The following code is, I think, an example of writing bytes to a file
and then unzipping...

decoded = base64.b64decode(datum)
#datum is a base64 encoded string of data downloaded from a web
service
f = open('data.zip', 'wb')
f.write(decoded)
f.close()
x = zipfile.ZipFile('data.zip', 'r')

After looking at the preceding code, the provider of the web service
gave me this advice...
"Instead of trying to create a file, take the unzipped bytes and get a
Unicode string of text from it."

If so, I'm not sure how to do what he's suggesting, or if it's really
different from what I've done.
If what you've been given is data which has been zipped and then base-64
encoded, then I can't see that you might be doing wrong.
 
W

webcomm

Not terribly useful advice, but one presumes he she or it was trying to
be helpful.


Well, what you have done appears pretty wrong to me, but let's take a
look. What's datum? You appear to be treating it as base64-encoded data;
is that correct? Have you examined it?

It's data that has been compressed then base64 encoded by the web
service. I'm supposed to download it, then decode, then unzip. They
provide a C# example of how to do this on page 13 of
http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

If you have a minute, see also this thread...
http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4
 
C

Chris Mellon

It's data that has been compressed then base64 encoded by the web
service. I'm supposed to download it, then decode, then unzip. They
provide a C# example of how to do this on page 13 of
http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

If you have a minute, see also this thread...
http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4

When they say "zip", they're talking about a zlib compressed stream of
bytes, not a zip archive.

You want to base64 decode the data, then zlib decompress it, then
finally interpret it as (I think) UTF-16, as that's what Windows
usually means when it says "Unicode".

decoded = base64.b64decode(datum)
decompressed = zlib.decompress(decoded)
result = decompressed.decode('utf-16')
 
C

Chris Mellon

When they say "zip", they're talking about a zlib compressed stream of
bytes, not a zip archive.

You want to base64 decode the data, then zlib decompress it, then
finally interpret it as (I think) UTF-16, as that's what Windows
usually means when it says "Unicode".

decoded = base64.b64decode(datum)
decompressed = zlib.decompress(decoded)
result = decompressed.decode('utf-16')


And of course as *soon* as I write that, I read the appendix on the
documentation in full and turn out to be wrong. Ignore me *sigh*.

It would really help if you could post a sample file somewhere.
 
W

webcomm

It would really help if you could post a sample file somewhere.

Here's a sample with some dummy data from the web service:
http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it. It looks
like valid XML. But if I return it to my browser with python+django,
there are bad characters every other character

If I unzip it like this...
popen("unzip data.zip")
....then the bad characters are 'FFFD' characters as described and
pictured here...
http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#

If I unzip it like this...
getzip('data.zip', ignoreable=30000)
....using the function at...
http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
....then the bad characters are \x00 characters.
 
J

John Machin

Here's a sample with some dummy data from the web service:http://webcomm.webfactional.com/htdocs/data.zip

That's the zip created in this line of my code...
f = open('data.zip', 'wb')

Your original problem is identical to that already reported by Chris
Mellon (gratuitous \0 bytes appended to the real archive contents).
Here's the output of the diagnostic gadget that I posted a few minutes
ago:
...........................................................
C:\downloads>python zip_susser_v2.py data.zip
archive size is 1092
FileHeader at 0
CentralDir at 844
EndArchive at 894
using posEndArchive = 894
endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
signature : 'PK\x05\x06'
this_disk_num : 0
central_dir_disk_num : 0
central_dir_this_disk_num_entries : 1
central_dir_overall_num_entries : 1
central_dir_size : 50
central_dir_offset : 844
comment_size : 0

expected_comment_size: 0
actual_comment_size: 176
comment is all spaces: False
comment is all '\0': True
comment (first 100 bytes):
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x00\x00\x00\x00\x00'
....................................

If I open the file it contains as unicode in my text editor (EditPlus)
on Windows XP, there is ostensibly nothing wrong with it.  It looks
like valid XML.

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:
buff = open('data', 'rb').read()
buff[:100]
'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'
buff[:100].decode('utf_16_le')

 But if I return it to my browser with python+django,
there are bad characters every other character

Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.
If I unzip it like this...
popen("unzip data.zip")
...then the bad characters are 'FFFD' characters as described and
pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...

Yup, you've somehow pushed your utf_16_le-encoded data through some
decoder that doesn't like '\x00' and is replacing it with U+FFFD whose
name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is
"big fat Unicode version of the question mark".
If I unzip it like this...
getzip('data.zip', ignoreable=30000)
...using the function at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
...then the bad characters are \x00 characters.

Hmmm ... shouldn't make a difference how you extracted 'data' from
'data.zip'.

Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html

Cheers,
John
 
W

webcomm

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:
buff = open('data', 'rb').read()
buff[:100]

'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'>>> buff[:100].decode('utf_16_le')

There it is. Thanks.
u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'




Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

I did stop and consider what code to show. I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code. You
solved my problem with decode('utf_16_le'). I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW. :)

I didn't know the data was utf_16_le-encoded because I'm getting it
from a service. I don't even know if *they* know what encoding they
used. I'm not sure how you knew what the encoding was.
Please consider reading the Unicode HOWTO athttp://docs.python.org/howto/unicode.html

Probably wouldn't hurt, though reading that HOWTO wouldn't have given
me the encoding, I don't think.

-Ryan
 
J

John Machin

Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
God^H^H^HGates intended:
buff = open('data', 'rb').read()
buff[:100]
'<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
\x00<\x00B\x0
0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
\x000\x00.\x000\x000\x000\x000\x0
0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
\x00S\x00t\x0
0a\x00t\x00'
buff[:100].decode('utf_16_le')

There it is.  Thanks.
u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
Please consider that we might have difficulty guessing what "return it
to my browser with python+django" means. Show actual code.

I did stop and consider what code to show.  I tried to show only the
code that seemed relevant, as there are sometimes complaints on this
and other groups when someone shows more than the relevant code.  You
solved my problem with decode('utf_16_le').  I can't find any
description of that encoding on the WWW... and I thought *everything*
was on the WWW.  :)

Try searching using the official name UTF-16LE ... looks like a blind
spot in the approximate matching algorithm(s) used by the search engine
(s) that you tried :-(
I didn't know the data was utf_16_le-encoded because I'm getting it
from a service.  I don't even know if *they* know what encoding they
used.  I'm not sure how you knew what the encoding was.

Actually looked at the raw data. Pattern appeared to be an alternation
of 1 "meaningful" byte and one zero ('\x00') byte: => UTF16*. No BOM
('\xFE\xFF' or '\xFF\xFE') at start of file: => UTF16-?E. First byte
is meaningful: => UTF16-LE.
Probably wouldn't hurt,

Definitely won't hurt. Could even help.
though reading that HOWTO wouldn't have given
me the encoding, I don't think.

It wasn't intended to give you the encoding. Just read it.

Cheers,
John
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,878
Messages
2,569,935
Members
46,222
Latest member
JonathonDu

Latest Threads

Top