distinction between unzipping bytes and unzipping a file

Discussion in 'Python' started by webcomm, Jan 9, 2009.

  1. webcomm

    webcomm Guest

    Hi,
    In python, is there a distinction between unzipping bytes and
    unzipping a binary file to which those bytes have been written?

    The following code is, I think, an example of writing bytes to a file
    and then unzipping...

    decoded = base64.b64decode(datum)
    #datum is a base64 encoded string of data downloaded from a web
    service
    f = open('data.zip', 'wb')
    f.write(decoded)
    f.close()
    x = zipfile.ZipFile('data.zip', 'r')

    After looking at the preceding code, the provider of the web service
    gave me this advice...
    "Instead of trying to create a file, take the unzipped bytes and get a
    Unicode string of text from it."

    If so, I'm not sure how to do what he's suggesting, or if it's really
    different from what I've done.

    I find that I am able to unzip the resulting data.zip using the unix
    unzip command, but the file inside contains some FFFD characters, as
    described in this thread...
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
    I don't know if the unwanted characters might be the result of my
    trying to write and unzip a file, rather than unzipping the bytes.
    The file does contain a semblance of what I ultimately want -- it's
    not all garbage.

    Apologies if it's not appropriate to start a new thread for this. It
    just seems like a different topic than how to deal with the resulting
    FFFD characters.

    Thanks for your help,
    Ryan
     
    webcomm, Jan 9, 2009
    #1
    1. Advertising

  2. webcomm

    webcomm Guest

    On Jan 9, 2:49 pm, webcomm <> wrote:
    > decoded = base64.b64decode(datum)
    > #datum is a base64 encoded string of data downloaded from a web
    > service
    > f = open('data.zip', 'wb')
    > f.write(decoded)
    > f.close()
    > x = zipfile.ZipFile('data.zip', 'r')


    Sorry, that code is not what I mean to paste. This is what I
    intended...

    decoded = base64.b64decode(datum)
    #datum is a base64 encoded string of data downloaded from a web
    service
    f = open('data.zip', 'wb')
    f.write(decoded)
    f.close()
    x = popen("unzip data.zip")
     
    webcomm, Jan 9, 2009
    #2
    1. Advertising

  3. webcomm

    Steve Holden Guest

    webcomm wrote:
    > Hi,
    > In python, is there a distinction between unzipping bytes and
    > unzipping a binary file to which those bytes have been written?
    >
    > The following code is, I think, an example of writing bytes to a file
    > and then unzipping...
    >
    > decoded = base64.b64decode(datum)
    > #datum is a base64 encoded string of data downloaded from a web
    > service
    > f = open('data.zip', 'wb')
    > f.write(decoded)
    > f.close()
    > x = zipfile.ZipFile('data.zip', 'r')
    >
    > After looking at the preceding code, the provider of the web service
    > gave me this advice...
    > "Instead of trying to create a file, take the unzipped bytes and get a
    > Unicode string of text from it."
    >

    Not terribly useful advice, but one presumes he she or it was trying to
    be helpful.

    > If so, I'm not sure how to do what he's suggesting, or if it's really
    > different from what I've done.
    >

    Well, what you have done appears pretty wrong to me, but let's take a
    look. What's datum? You appear to be treating it as base64-encoded data;
    is that correct? Have you examined it?

    f = open('data.zip', 'wb')

    opens the file data.zip for writing in binary. Not as a zip file, you
    understand, just as a regular file. I suspect here you really needed

    f = zipfile.ZipFile('data.zip', 'w')

    Now, of course, you need to remember what zipfiles contain. Which is
    other files. So the data you *write* tot he zipfile has to be associated
    with a filename in the archive. Of course you don't have the data in a
    file, you have it in a string, so you would use

    f.writestr("somefile.dat", decoded)
    f.close()

    You have now written a zip file containing a single "somefile.dat" file
    with the decoded base64 data in it. Open it with Winzip or one of its
    buddies and see if anyone barfs.


    > I find that I am able to unzip the resulting data.zip using the unix
    > unzip command, but the file inside contains some FFFD characters, as
    > described in this thread...
    > http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
    > I don't know if the unwanted characters might be the result of my
    > trying to write and unzip a file, rather than unzipping the bytes.
    > The file does contain a semblance of what I ultimately want -- it's
    > not all garbage.
    >

    But it's certainly not a zip file.

    > Apologies if it's not appropriate to start a new thread for this. It
    > just seems like a different topic than how to deal with the resulting
    > FFFD characters.
    >

    Don't worry about it.

    regards
    Steve
    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
     
    Steve Holden, Jan 9, 2009
    #3
  4. webcomm

    MRAB Guest

    webcomm wrote:
    > Hi,
    > In python, is there a distinction between unzipping bytes and
    > unzipping a binary file to which those bytes have been written?
    >

    Python's zipfile module can only read and write zip files; it can't
    compress or decompress data as a bytestring.

    > The following code is, I think, an example of writing bytes to a file
    > and then unzipping...
    >
    > decoded = base64.b64decode(datum)
    > #datum is a base64 encoded string of data downloaded from a web
    > service
    > f = open('data.zip', 'wb')
    > f.write(decoded)
    > f.close()
    > x = zipfile.ZipFile('data.zip', 'r')
    >
    > After looking at the preceding code, the provider of the web service
    > gave me this advice...
    > "Instead of trying to create a file, take the unzipped bytes and get a
    > Unicode string of text from it."
    >
    > If so, I'm not sure how to do what he's suggesting, or if it's really
    > different from what I've done.
    >

    If what you've been given is data which has been zipped and then base-64
    encoded, then I can't see that you might be doing wrong.

    > I find that I am able to unzip the resulting data.zip using the unix
    > unzip command, but the file inside contains some FFFD characters, as
    > described in this thread...
    > http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#
    > I don't know if the unwanted characters might be the result of my
    > trying to write and unzip a file, rather than unzipping the bytes.
    > The file does contain a semblance of what I ultimately want -- it's
    > not all garbage.
    >
    > Apologies if it's not appropriate to start a new thread for this. It
    > just seems like a different topic than how to deal with the resulting
    > FFFD characters.
    >
     
    MRAB, Jan 9, 2009
    #4
  5. webcomm

    webcomm Guest

    On Jan 9, 3:15 pm, Steve Holden <> wrote:
    > webcomm wrote:
    > > Hi,
    > > In python, is there a distinction between unzipping bytes and
    > > unzipping a binary file to which those bytes have been written?

    >
    > > The following code is, I think, an example of writing bytes to a file
    > > and then unzipping...

    >
    > > decoded = base64.b64decode(datum)
    > > #datum is a base64 encoded string of data downloaded from a web
    > > service
    > > f = open('data.zip', 'wb')
    > > f.write(decoded)
    > > f.close()
    > > x = zipfile.ZipFile('data.zip', 'r')

    >
    > > After looking at the preceding code, the provider of the web service
    > > gave me this advice...
    > > "Instead of trying to create a file, take the unzipped bytes and get a
    > > Unicode string of text from it."

    >
    > Not terribly useful advice, but one presumes he she or it was trying to
    > be helpful.
    >
    > > If so, I'm not sure how to do what he's suggesting, or if it's really
    > > different from what I've done.

    >
    > Well, what you have done appears pretty wrong to me, but let's take a
    > look. What's datum? You appear to be treating it as base64-encoded data;
    > is that correct? Have you examined it?


    It's data that has been compressed then base64 encoded by the web
    service. I'm supposed to download it, then decode, then unzip. They
    provide a C# example of how to do this on page 13 of
    http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf

    If you have a minute, see also this thread...
    http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4
     
    webcomm, Jan 9, 2009
    #5
  6. webcomm

    Chris Mellon Guest

    On Fri, Jan 9, 2009 at 2:32 PM, webcomm <> wrote:
    > On Jan 9, 3:15 pm, Steve Holden <> wrote:
    >> webcomm wrote:
    >> > Hi,
    >> > In python, is there a distinction between unzipping bytes and
    >> > unzipping a binary file to which those bytes have been written?

    >>
    >> > The following code is, I think, an example of writing bytes to a file
    >> > and then unzipping...

    >>
    >> > decoded = base64.b64decode(datum)
    >> > #datum is a base64 encoded string of data downloaded from a web
    >> > service
    >> > f = open('data.zip', 'wb')
    >> > f.write(decoded)
    >> > f.close()
    >> > x = zipfile.ZipFile('data.zip', 'r')

    >>
    >> > After looking at the preceding code, the provider of the web service
    >> > gave me this advice...
    >> > "Instead of trying to create a file, take the unzipped bytes and get a
    >> > Unicode string of text from it."

    >>
    >> Not terribly useful advice, but one presumes he she or it was trying to
    >> be helpful.
    >>
    >> > If so, I'm not sure how to do what he's suggesting, or if it's really
    >> > different from what I've done.

    >>
    >> Well, what you have done appears pretty wrong to me, but let's take a
    >> look. What's datum? You appear to be treating it as base64-encoded data;
    >> is that correct? Have you examined it?

    >
    > It's data that has been compressed then base64 encoded by the web
    > service. I'm supposed to download it, then decode, then unzip. They
    > provide a C# example of how to do this on page 13 of
    > http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf
    >
    > If you have a minute, see also this thread...
    > http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4
    >


    When they say "zip", they're talking about a zlib compressed stream of
    bytes, not a zip archive.

    You want to base64 decode the data, then zlib decompress it, then
    finally interpret it as (I think) UTF-16, as that's what Windows
    usually means when it says "Unicode".

    decoded = base64.b64decode(datum)
    decompressed = zlib.decompress(decoded)
    result = decompressed.decode('utf-16')
     
    Chris Mellon, Jan 9, 2009
    #6
  7. webcomm

    Chris Mellon Guest

    On Fri, Jan 9, 2009 at 3:08 PM, Chris Mellon <> wrote:
    > On Fri, Jan 9, 2009 at 2:32 PM, webcomm <> wrote:
    >> On Jan 9, 3:15 pm, Steve Holden <> wrote:
    >>> webcomm wrote:
    >>> > Hi,
    >>> > In python, is there a distinction between unzipping bytes and
    >>> > unzipping a binary file to which those bytes have been written?
    >>>
    >>> > The following code is, I think, an example of writing bytes to a file
    >>> > and then unzipping...
    >>>
    >>> > decoded = base64.b64decode(datum)
    >>> > #datum is a base64 encoded string of data downloaded from a web
    >>> > service
    >>> > f = open('data.zip', 'wb')
    >>> > f.write(decoded)
    >>> > f.close()
    >>> > x = zipfile.ZipFile('data.zip', 'r')
    >>>
    >>> > After looking at the preceding code, the provider of the web service
    >>> > gave me this advice...
    >>> > "Instead of trying to create a file, take the unzipped bytes and get a
    >>> > Unicode string of text from it."
    >>>
    >>> Not terribly useful advice, but one presumes he she or it was trying to
    >>> be helpful.
    >>>
    >>> > If so, I'm not sure how to do what he's suggesting, or if it's really
    >>> > different from what I've done.
    >>>
    >>> Well, what you have done appears pretty wrong to me, but let's take a
    >>> look. What's datum? You appear to be treating it as base64-encoded data;
    >>> is that correct? Have you examined it?

    >>
    >> It's data that has been compressed then base64 encoded by the web
    >> service. I'm supposed to download it, then decode, then unzip. They
    >> provide a C# example of how to do this on page 13 of
    >> http://forums.regonline.com/forums/docs/RegOnlineWebServices.pdf
    >>
    >> If you have a minute, see also this thread...
    >> http://groups.google.com/group/comp...7dd4?hl=en&lnk=gst&q=webcomm#5b9eceeee3e77dd4
    >>

    >
    > When they say "zip", they're talking about a zlib compressed stream of
    > bytes, not a zip archive.
    >
    > You want to base64 decode the data, then zlib decompress it, then
    > finally interpret it as (I think) UTF-16, as that's what Windows
    > usually means when it says "Unicode".
    >
    > decoded = base64.b64decode(datum)
    > decompressed = zlib.decompress(decoded)
    > result = decompressed.decode('utf-16')
    >



    And of course as *soon* as I write that, I read the appendix on the
    documentation in full and turn out to be wrong. Ignore me *sigh*.

    It would really help if you could post a sample file somewhere.
     
    Chris Mellon, Jan 9, 2009
    #7
  8. webcomm

    webcomm Guest

    On Jan 9, 4:12 pm, "Chris Mellon" <> wrote:
    > It would really help if you could post a sample file somewhere.


    Here's a sample with some dummy data from the web service:
    http://webcomm.webfactional.com/htdocs/data.zip

    That's the zip created in this line of my code...
    f = open('data.zip', 'wb')

    If I open the file it contains as unicode in my text editor (EditPlus)
    on Windows XP, there is ostensibly nothing wrong with it. It looks
    like valid XML. But if I return it to my browser with python+django,
    there are bad characters every other character

    If I unzip it like this...
    popen("unzip data.zip")
    ....then the bad characters are 'FFFD' characters as described and
    pictured here...
    http://groups.google.com/group/comp.lang.python/browse_thread/thread/4f57abea978cc0bf?hl=en#

    If I unzip it like this...
    getzip('data.zip', ignoreable=30000)
    ....using the function at...
    http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
    ....then the bad characters are \x00 characters.
     
    webcomm, Jan 9, 2009
    #8
  9. webcomm

    John Machin Guest

    On Jan 10, 8:56 am, webcomm <> wrote:
    > On Jan 9, 4:12 pm, "Chris Mellon" <> wrote:
    >
    > > It would really help if you could post a sample file somewhere.

    >
    > Here's a sample with some dummy data from the web service:http://webcomm.webfactional.com/htdocs/data.zip
    >
    > That's the zip created in this line of my code...
    > f = open('data.zip', 'wb')


    Your original problem is identical to that already reported by Chris
    Mellon (gratuitous \0 bytes appended to the real archive contents).
    Here's the output of the diagnostic gadget that I posted a few minutes
    ago:
    ...........................................................
    C:\downloads>python zip_susser_v2.py data.zip
    archive size is 1092
    FileHeader at 0
    CentralDir at 844
    EndArchive at 894
    using posEndArchive = 894
    endArchive: ('PK\x05\x06', 0, 0, 1, 1, 50, 844, 0)
    signature : 'PK\x05\x06'
    this_disk_num : 0
    central_dir_disk_num : 0
    central_dir_this_disk_num_entries : 1
    central_dir_overall_num_entries : 1
    central_dir_size : 50
    central_dir_offset : 844
    comment_size : 0

    expected_comment_size: 0
    actual_comment_size: 176
    comment is all spaces: False
    comment is all '\0': True
    comment (first 100 bytes):
    '\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
    \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
    \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
    \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
    \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00
    \x00\x00\x00\x00\x00\x00\x00'
    ....................................


    >
    > If I open the file it contains as unicode in my text editor (EditPlus)
    > on Windows XP, there is ostensibly nothing wrong with it.  It looks
    > like valid XML.


    Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
    God^H^H^HGates intended:

    >>> buff = open('data', 'rb').read()
    >>> buff[:100]

    '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
    \x00<\x00B\x0
    0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
    \x000\x00.\x000\x000\x000\x000\x0
    0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
    \x00S\x00t\x0
    0a\x00t\x00'
    >>> buff[:100].decode('utf_16_le')

    u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
    >>>



    >  But if I return it to my browser with python+django,
    > there are bad characters every other character


    Please consider that we might have difficulty guessing what "return it
    to my browser with python+django" means. Show actual code.

    >
    > If I unzip it like this...
    > popen("unzip data.zip")
    > ...then the bad characters are 'FFFD' characters as described and
    > pictured here...http://groups.google.com/group/comp.lang.python/browse_thread/thread/...


    Yup, you've somehow pushed your utf_16_le-encoded data through some
    decoder that doesn't like '\x00' and is replacing it with U+FFFD whose
    name is (funnily enough) REPLACEMENT CHARACTER and whose meaning is
    "big fat Unicode version of the question mark".

    >
    > If I unzip it like this...
    > getzip('data.zip', ignoreable=30000)
    > ...using the function at...http://groups.google.com/group/comp.lang.python/msg/c2008e48368c6543
    > ...then the bad characters are \x00 characters.


    Hmmm ... shouldn't make a difference how you extracted 'data' from
    'data.zip'.

    Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html

    Cheers,
    John
     
    John Machin, Jan 9, 2009
    #9
  10. webcomm

    webcomm Guest

    On Jan 9, 6:07 pm, John Machin <> wrote:
    > Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
    > God^H^H^HGates intended:
    >
    > >>> buff = open('data', 'rb').read()
    > >>> buff[:100]

    >
    > '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
    > \x00<\x00B\x0
    > 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
    > \x000\x00.\x000\x000\x000\x000\x0
    > 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
    > \x00S\x00t\x0
    > 0a\x00t\x00'>>> buff[:100].decode('utf_16_le')


    There it is. Thanks.

    > u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'
    >
    >
    >
    > >  But if I return it to my browser with python+django,
    > > there are bad characters every other character

    >
    > Please consider that we might have difficulty guessing what "return it
    > to my browser with python+django" means. Show actual code.


    I did stop and consider what code to show. I tried to show only the
    code that seemed relevant, as there are sometimes complaints on this
    and other groups when someone shows more than the relevant code. You
    solved my problem with decode('utf_16_le'). I can't find any
    description of that encoding on the WWW... and I thought *everything*
    was on the WWW. :)

    I didn't know the data was utf_16_le-encoded because I'm getting it
    from a service. I don't even know if *they* know what encoding they
    used. I'm not sure how you knew what the encoding was.

    > Please consider reading the Unicode HOWTO athttp://docs.python.org/howto/unicode.html


    Probably wouldn't hurt, though reading that HOWTO wouldn't have given
    me the encoding, I don't think.

    -Ryan


    > Cheers,
    > John
     
    webcomm, Jan 10, 2009
    #10
  11. webcomm

    John Machin Guest

    On Jan 11, 6:15 am, webcomm <> wrote:
    > On Jan 9, 6:07 pm, John Machin <> wrote:
    >
    > > Yup, it looks like it's encoded in utf_16_le, i.e. no BOM as
    > > God^H^H^HGates intended:

    >
    > > >>> buff = open('data', 'rb').read()
    > > >>> buff[:100]

    >
    > > '<\x00R\x00e\x00g\x00i\x00s\x00t\x00r\x00a\x00t\x00i\x00o\x00n\x00>
    > > \x00<\x00B\x0
    > > 0a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>
    > > \x000\x00.\x000\x000\x000\x000\x0
    > > 0<\x00/\x00B\x00a\x00l\x00a\x00n\x00c\x00e\x00D\x00u\x00e\x00>\x00<
    > > \x00S\x00t\x0
    > > 0a\x00t\x00'
    > > >>> buff[:100].decode('utf_16_le')

    >
    > There it is.  Thanks.
    >
    > > u'<Registration><BalanceDue>0.0000</BalanceDue><Stat'

    >
    > > >  But if I return it to my browser with python+django,
    > > > there are bad characters every other character

    >
    > > Please consider that we might have difficulty guessing what "return it
    > > to my browser with python+django" means. Show actual code.

    >
    > I did stop and consider what code to show.  I tried to show only the
    > code that seemed relevant, as there are sometimes complaints on this
    > and other groups when someone shows more than the relevant code.  You
    > solved my problem with decode('utf_16_le').  I can't find any
    > description of that encoding on the WWW... and I thought *everything*
    > was on the WWW.  :)


    Try searching using the official name UTF-16LE ... looks like a blind
    spot in the approximate matching algorithm(s) used by the search engine
    (s) that you tried :-(

    > I didn't know the data was utf_16_le-encoded because I'm getting it
    > from a service.  I don't even know if *they* know what encoding they
    > used.  I'm not sure how you knew what the encoding was.


    Actually looked at the raw data. Pattern appeared to be an alternation
    of 1 "meaningful" byte and one zero ('\x00') byte: => UTF16*. No BOM
    ('\xFE\xFF' or '\xFF\xFE') at start of file: => UTF16-?E. First byte
    is meaningful: => UTF16-LE.

    > > Please consider reading the Unicode HOWTO at http://docs.python.org/howto/unicode.html

    >
    > Probably wouldn't hurt,


    Definitely won't hurt. Could even help.

    > though reading that HOWTO wouldn't have given
    > me the encoding, I don't think.


    It wasn't intended to give you the encoding. Just read it.

    Cheers,
    John
     
    John Machin, Jan 10, 2009
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Casey Hawthorne
    Replies:
    4
    Views:
    362
    Casey Hawthorne
    Oct 20, 2005
  2. Tim Martin
    Replies:
    20
    Views:
    2,239
    Mark Sicignano
    Sep 5, 2003
  3. Ishwor

    distinction between float & int

    Ishwor, Dec 3, 2004, in forum: Python
    Replies:
    1
    Views:
    279
    Steve Holden
    Dec 3, 2004
  4. Yandos
    Replies:
    12
    Views:
    5,127
    Pete Becker
    Sep 15, 2005
  5. Uday S Reddy
    Replies:
    0
    Views:
    88
    Uday S Reddy
    Apr 17, 2013
Loading...

Share This Page