Reading HTTP response body that is gzip'd *and* in UTF-8

Discussion in 'Perl Misc' started by James Marshall, Aug 4, 2006.

  1. I'm writing an HTTP client that handles gzip'd content as well as UTF-8
    text, including when a response body is both gzip'd and in UTF-8.

    I'm newish to both compression and PerlIO layers, so I'd like a second
    opinion from someone who knows them better than I do. Does the code below
    look correct? The goal is to end up with the uncompressed body in $body,
    and interpreted as UTF-8 if identified as such by "charset".

    I appreciate not wanting to use utf8::upgrade() ; is there a better way to
    handle it in this case, or is this one of those cases where it's
    legitimately needed?

    Finally, does anyone know if Compress::Zlib::memGzip() handles UTF-8 input
    correctly, or do I need to "utf8::downgrade($body)" before compressing it?

    =======================================================

    use Compress::Zlib ;

    # Assume S is the socket, and $is_gzipped and $is_utf8 are set correctly
    # from the HTTP response headers, which have just been read from S.

    if ($is_gzipped) {
    $body= &read_full_body(S) ;
    $body= Compress::Zlib::memGunzip($body) ;
    if ($is_utf8) {
    utf8::upgrade($body) ;
    }
    } else { # not gzip'd
    if ($is_utf8) {
    binmode(S, ':encoding(utf8)') ;
    }
    $body= &read_full_body(S) ;
    }

    # $body should now contain response body in workable format.
     
    James Marshall, Aug 4, 2006
    #1
    1. Advertising

  2. James Marshall

    Ben Morrow Guest

    Quoth :
    > I'm writing an HTTP client that handles gzip'd content as well as UTF-8
    > text, including when a response body is both gzip'd and in UTF-8.
    >
    > I'm newish to both compression and PerlIO layers, so I'd like a second
    > opinion from someone who knows them better than I do. Does the code below
    > look correct? The goal is to end up with the uncompressed body in $body,
    > and interpreted as UTF-8 if identified as such by "charset".
    >
    > I appreciate not wanting to use utf8::upgrade() ; is there a better way to
    > handle it in this case, or is this one of those cases where it's
    > legitimately needed?


    It's never (IMHO) legitimately needed. The only possible case is where
    some XS code has messed something up. utf8::upgrade doesn't change
    anything that's visible at the Perl level. All it does is change how
    perl represents the data internally, but you don't care about that.

    > Finally, does anyone know if Compress::Zlib::memGzip() handles UTF-8 input
    > correctly, or do I need to "utf8::downgrade($body)" before compressing it?


    I don't know what Compress::Zlib does aobut it, but it's pretty
    meaningless to apply gzip to a stream of characters. It's not defined on
    characters, it's defined on bytes; so you need to convert from
    characters to bytes. This is Encode::encode, or the :encoding layer on
    output.

    >
    > =======================================================
    >
    > use Compress::Zlib ;
    >
    > # Assume S is the socket, and $is_gzipped and $is_utf8 are set correctly
    > # from the HTTP response headers, which have just been read from S.
    >
    > if ($is_gzipped) {
    > $body= &read_full_body(S) ;


    Don't call subs with & unless you know why.

    > $body= Compress::Zlib::memGunzip($body) ;


    Is there a good reason why you're not using PerlIO::gzip? (I've never
    used it, so there may be some reason it doesn't work I'm not aware
    of...?)

    > if ($is_utf8) {
    > utf8::upgrade($body) ;


    This should be either

    utf8::decode($body);

    or (preferably)

    $body = Encode::decode(utf8 => $body);

    (and then you can handle all the other charsets the same way: just pass
    Encode the value of the charset MIME parameter).

    I would have written this more like

    # push gzip first: closer to the outside world

    # +---------+ +-----+
    # program <--> |:encoding| <--> |:gzip| <--> socket
    # +---------+ +-----+

    if (is_gzip) {
    binmode S, ':gzip';
    }

    my $charset = get_charset;
    binmode S, ":encoding($charset)";

    # read data... when you've finished, and want the next request:
    binmode S, ':pop:pop';

    and then just read or write characters.

    Ben

    --
    The Earth is degenerating these days. Bribery and corruption abound.
    Children no longer mind their parents, every man wants to write a book,
    and it is evident that the end of the world is fast approaching.
    Assyrian stone tablet, c.2800 BC
     
    Ben Morrow, Aug 5, 2006
    #2
    1. Advertising

  3. Thanks, Ben! I was able to get it working with utf8::encode() and
    utf8::decode(). I appreciate the cleanliness of your other solutions, but
    at this time they won't quite fit the situation (e.g. I need to know the
    Content-Length: I'm sending out in an HTTP response.) I'm also a little
    concerned that some of these features (utf8::decode(), ":pop") are
    documented as experimental, but hopefully they've been around long enough
    by now to lose that condition. I'd love to have this program work in Perl
    5.6.1 (since its users may have zero control over how their server is
    configured), but that may be asking too much.

    The reason I'm not using PerlIO::gzip is mostly just ignorance. The
    0.17 version number is kinda low, but maybe I shouldn't worry about that.

    So thank you for your helpful and educational post, it clarified some
    things.

    Cheers,
    James
    .............................................................................
    James Marshall Berkeley, CA @}-'-,--
    "Teach people what you know."
    .............................................................................


    On Sat, 5 Aug 2006, Ben Morrow wrote:

    BM>
    BM> Quoth :
    BM> > I'm writing an HTTP client that handles gzip'd content as well as UTF-8
    BM> > text, including when a response body is both gzip'd and in UTF-8.
    BM> >
    BM> > I'm newish to both compression and PerlIO layers, so I'd like a second
    BM> > opinion from someone who knows them better than I do. Does the code below
    BM> > look correct? The goal is to end up with the uncompressed body in $body,
    BM> > and interpreted as UTF-8 if identified as such by "charset".
    BM> >
    BM> > I appreciate not wanting to use utf8::upgrade() ; is there a better way to
    BM> > handle it in this case, or is this one of those cases where it's
    BM> > legitimately needed?
    BM>
    BM> It's never (IMHO) legitimately needed. The only possible case is where
    BM> some XS code has messed something up. utf8::upgrade doesn't change
    BM> anything that's visible at the Perl level. All it does is change how
    BM> perl represents the data internally, but you don't care about that.
    BM>
    BM> > Finally, does anyone know if Compress::Zlib::memGzip() handles UTF-8 input
    BM> > correctly, or do I need to "utf8::downgrade($body)" before compressing it?
    BM>
    BM> I don't know what Compress::Zlib does aobut it, but it's pretty
    BM> meaningless to apply gzip to a stream of characters. It's not defined on
    BM> characters, it's defined on bytes; so you need to convert from
    BM> characters to bytes. This is Encode::encode, or the :encoding layer on
    BM> output.
    BM>
    BM> >
    BM> > =======================================================
    BM> >
    BM> > use Compress::Zlib ;
    BM> >
    BM> > # Assume S is the socket, and $is_gzipped and $is_utf8 are set correctly
    BM> > # from the HTTP response headers, which have just been read from S.
    BM> >
    BM> > if ($is_gzipped) {
    BM> > $body= &read_full_body(S) ;
    BM>
    BM> Don't call subs with & unless you know why.
    BM>
    BM> > $body= Compress::Zlib::memGunzip($body) ;
    BM>
    BM> Is there a good reason why you're not using PerlIO::gzip? (I've never
    BM> used it, so there may be some reason it doesn't work I'm not aware
    BM> of...?)
    BM>
    BM> > if ($is_utf8) {
    BM> > utf8::upgrade($body) ;
    BM>
    BM> This should be either
    BM>
    BM> utf8::decode($body);
    BM>
    BM> or (preferably)
    BM>
    BM> $body = Encode::decode(utf8 => $body);
    BM>
    BM> (and then you can handle all the other charsets the same way: just pass
    BM> Encode the value of the charset MIME parameter).
    BM>
    BM> I would have written this more like
    BM>
    BM> # push gzip first: closer to the outside world
    BM>
    BM> # +---------+ +-----+
    BM> # program <--> |:encoding| <--> |:gzip| <--> socket
    BM> # +---------+ +-----+
    BM>
    BM> if (is_gzip) {
    BM> binmode S, ':gzip';
    BM> }
    BM>
    BM> my $charset = get_charset;
    BM> binmode S, ":encoding($charset)";
    BM>
    BM> # read data... when you've finished, and want the next request:
    BM> binmode S, ':pop:pop';
    BM>
    BM> and then just read or write characters.
    BM>
    BM> Ben
    BM>
    BM> --
    BM> The Earth is degenerating these days. Bribery and corruption abound.
    BM> Children no longer mind their parents, every man wants to write a book,
    BM> and it is evident that the end of the world is fast approaching.
    BM> Assyrian stone tablet, c.2800 BC
    BM>
     
    James Marshall, Aug 7, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bill Loren

    gzip HTTP results problem

    Bill Loren, Jul 29, 2003, in forum: Python
    Replies:
    0
    Views:
    313
    Bill Loren
    Jul 29, 2003
  2. Fredrik Lundh

    Re: gzip HTTP results problem

    Fredrik Lundh, Jul 29, 2003, in forum: Python
    Replies:
    1
    Views:
    325
    John J. Lee
    Jul 29, 2003
  3. Johannes Bauer

    Python3.1: gzip encoding with UTF-8 fails

    Johannes Bauer, Dec 20, 2009, in forum: Python
    Replies:
    3
    Views:
    1,198
    Antoine Pitrou
    Dec 21, 2009
  4. Piotr MÄ…sior

    Zlib gzip Iconv, what is going on with UTF-8

    Piotr MÄ…sior, Feb 1, 2010, in forum: Ruby
    Replies:
    2
    Views:
    149
    Piotr MÄ…sior
    Feb 1, 2010
  5. HTTP::Response and gzip

    , Jul 25, 2006, in forum: Perl Misc
    Replies:
    0
    Views:
    199
Loading...

Share This Page