Reading HTTP response body that is gzip'd *and* in UTF-8

J

James Marshall

I'm writing an HTTP client that handles gzip'd content as well as UTF-8
text, including when a response body is both gzip'd and in UTF-8.

I'm newish to both compression and PerlIO layers, so I'd like a second
opinion from someone who knows them better than I do. Does the code below
look correct? The goal is to end up with the uncompressed body in $body,
and interpreted as UTF-8 if identified as such by "charset".

I appreciate not wanting to use utf8::upgrade() ; is there a better way to
handle it in this case, or is this one of those cases where it's
legitimately needed?

Finally, does anyone know if Compress::Zlib::memGzip() handles UTF-8 input
correctly, or do I need to "utf8::downgrade($body)" before compressing it?

=======================================================

use Compress::Zlib ;

# Assume S is the socket, and $is_gzipped and $is_utf8 are set correctly
# from the HTTP response headers, which have just been read from S.

if ($is_gzipped) {
$body= &read_full_body(S) ;
$body= Compress::Zlib::memGunzip($body) ;
if ($is_utf8) {
utf8::upgrade($body) ;
}
} else { # not gzip'd
if ($is_utf8) {
binmode(S, ':encoding(utf8)') ;
}
$body= &read_full_body(S) ;
}

# $body should now contain response body in workable format.
 
B

Ben Morrow

Quoth (e-mail address removed):
I'm writing an HTTP client that handles gzip'd content as well as UTF-8
text, including when a response body is both gzip'd and in UTF-8.

I'm newish to both compression and PerlIO layers, so I'd like a second
opinion from someone who knows them better than I do. Does the code below
look correct? The goal is to end up with the uncompressed body in $body,
and interpreted as UTF-8 if identified as such by "charset".

I appreciate not wanting to use utf8::upgrade() ; is there a better way to
handle it in this case, or is this one of those cases where it's
legitimately needed?

It's never (IMHO) legitimately needed. The only possible case is where
some XS code has messed something up. utf8::upgrade doesn't change
anything that's visible at the Perl level. All it does is change how
perl represents the data internally, but you don't care about that.
Finally, does anyone know if Compress::Zlib::memGzip() handles UTF-8 input
correctly, or do I need to "utf8::downgrade($body)" before compressing it?

I don't know what Compress::Zlib does aobut it, but it's pretty
meaningless to apply gzip to a stream of characters. It's not defined on
characters, it's defined on bytes; so you need to convert from
characters to bytes. This is Encode::encode, or the :encoding layer on
output.
=======================================================

use Compress::Zlib ;

# Assume S is the socket, and $is_gzipped and $is_utf8 are set correctly
# from the HTTP response headers, which have just been read from S.

if ($is_gzipped) {
$body= &read_full_body(S) ;

Don't call subs with & unless you know why.
$body= Compress::Zlib::memGunzip($body) ;

Is there a good reason why you're not using PerlIO::gzip? (I've never
used it, so there may be some reason it doesn't work I'm not aware
of...?)
if ($is_utf8) {
utf8::upgrade($body) ;

This should be either

utf8::decode($body);

or (preferably)

$body = Encode::decode(utf8 => $body);

(and then you can handle all the other charsets the same way: just pass
Encode the value of the charset MIME parameter).

I would have written this more like

# push gzip first: closer to the outside world

# +---------+ +-----+
# program <--> |:encoding| <--> |:gzip| <--> socket
# +---------+ +-----+

if (is_gzip) {
binmode S, ':gzip';
}

my $charset = get_charset;
binmode S, ":encoding($charset)";

# read data... when you've finished, and want the next request:
binmode S, ':pop:pop';

and then just read or write characters.

Ben
 
J

James Marshall

Thanks, Ben! I was able to get it working with utf8::encode() and
utf8::decode(). I appreciate the cleanliness of your other solutions, but
at this time they won't quite fit the situation (e.g. I need to know the
Content-Length: I'm sending out in an HTTP response.) I'm also a little
concerned that some of these features (utf8::decode(), ":pop") are
documented as experimental, but hopefully they've been around long enough
by now to lose that condition. I'd love to have this program work in Perl
5.6.1 (since its users may have zero control over how their server is
configured), but that may be asking too much.

The reason I'm not using PerlIO::gzip is mostly just ignorance. The
0.17 version number is kinda low, but maybe I shouldn't worry about that.

So thank you for your helpful and educational post, it clarified some
things.

Cheers,
James
.............................................................................
James Marshall (e-mail address removed) Berkeley, CA @}-'-,--
"Teach people what you know."
.............................................................................


On Sat, 5 Aug 2006, Ben Morrow wrote:

BM>
BM> Quoth (e-mail address removed):
BM> > I'm writing an HTTP client that handles gzip'd content as well as UTF-8
BM> > text, including when a response body is both gzip'd and in UTF-8.
BM> >
BM> > I'm newish to both compression and PerlIO layers, so I'd like a second
BM> > opinion from someone who knows them better than I do. Does the code below
BM> > look correct? The goal is to end up with the uncompressed body in $body,
BM> > and interpreted as UTF-8 if identified as such by "charset".
BM> >
BM> > I appreciate not wanting to use utf8::upgrade() ; is there a better way to
BM> > handle it in this case, or is this one of those cases where it's
BM> > legitimately needed?
BM>
BM> It's never (IMHO) legitimately needed. The only possible case is where
BM> some XS code has messed something up. utf8::upgrade doesn't change
BM> anything that's visible at the Perl level. All it does is change how
BM> perl represents the data internally, but you don't care about that.
BM>
BM> > Finally, does anyone know if Compress::Zlib::memGzip() handles UTF-8 input
BM> > correctly, or do I need to "utf8::downgrade($body)" before compressing it?
BM>
BM> I don't know what Compress::Zlib does aobut it, but it's pretty
BM> meaningless to apply gzip to a stream of characters. It's not defined on
BM> characters, it's defined on bytes; so you need to convert from
BM> characters to bytes. This is Encode::encode, or the :encoding layer on
BM> output.
BM>
BM> >
BM> > =======================================================
BM> >
BM> > use Compress::Zlib ;
BM> >
BM> > # Assume S is the socket, and $is_gzipped and $is_utf8 are set correctly
BM> > # from the HTTP response headers, which have just been read from S.
BM> >
BM> > if ($is_gzipped) {
BM> > $body= &read_full_body(S) ;
BM>
BM> Don't call subs with & unless you know why.
BM>
BM> > $body= Compress::Zlib::memGunzip($body) ;
BM>
BM> Is there a good reason why you're not using PerlIO::gzip? (I've never
BM> used it, so there may be some reason it doesn't work I'm not aware
BM> of...?)
BM>
BM> > if ($is_utf8) {
BM> > utf8::upgrade($body) ;
BM>
BM> This should be either
BM>
BM> utf8::decode($body);
BM>
BM> or (preferably)
BM>
BM> $body = Encode::decode(utf8 => $body);
BM>
BM> (and then you can handle all the other charsets the same way: just pass
BM> Encode the value of the charset MIME parameter).
BM>
BM> I would have written this more like
BM>
BM> # push gzip first: closer to the outside world
BM>
BM> # +---------+ +-----+
BM> # program <--> |:encoding| <--> |:gzip| <--> socket
BM> # +---------+ +-----+
BM>
BM> if (is_gzip) {
BM> binmode S, ':gzip';
BM> }
BM>
BM> my $charset = get_charset;
BM> binmode S, ":encoding($charset)";
BM>
BM> # read data... when you've finished, and want the next request:
BM> binmode S, ':pop:pop';
BM>
BM> and then just read or write characters.
BM>
BM> Ben
BM>
BM> --
BM> The Earth is degenerating these days. Bribery and corruption abound.
BM> Children no longer mind their parents, every man wants to write a book,
BM> and it is evident that the end of the world is fast approaching.
BM> Assyrian stone tablet, c.2800 BC (e-mail address removed)
BM>
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top