Mechanize Help

J

Jeremy Woertink

I checked out the page response, and this is what I got back
=> {"cache-control"=>"private", "connection"=>"close",
"p3p"=>"policyref=\"http://p3p.yahoo.com/w3c/p3p.xml\", CP=\"CAO DSP COR
CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi
PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV\"",
"date"=>"Thu, 19 Nov 2009 04:39:19 GMT", "content-type"=>"text/html",
"content-encoding"=>"gzip", "set-cookie"=>"B=c4mf1f55g9ivn&b=3&s=9v;
expires=Tue, 02-Jun-2037 20:00:00 GMT; path=/; domain=.yahoo.com"}


I'm getting an encoding error when writing out the contents of this
page. The content-encoding is showing gzip. Anyone know a way I can tell
mechanize to use a different encoding when parsing a page? Or possibly
another way I can do this?

Thanks,


~Jeremy
 
J

John W Higgins

[Note: parts of this message were removed to make it a legal post.]

Morning Jeremy,

I checked out the page response, and this is what I got back

=> {"cache-control"=>"private", "connection"=>"close",
"p3p"=>"policyref=\"http://p3p.yahoo.com/w3c/p3p.xml\", CP=\"CAO DSP COR
CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi
PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV\"",
"date"=>"Thu, 19 Nov 2009 04:39:19 GMT", "content-type"=>"text/html",
"content-encoding"=>"gzip", "set-cookie"=>"B=c4mf1f55g9ivn&b=3&s=9v;
expires=Tue, 02-Jun-2037 20:00:00 GMT; path=/; domain=.yahoo.com"}


I'm getting an encoding error when writing out the contents of this
page. The content-encoding is showing gzip. Anyone know a way I can tell
mechanize to use a different encoding when parsing a page? Or possibly
another way I can do this?
You are getting the encoding error because you aren't dealing with a string
in this case but rather a gzipped string. You could technically go to the
request object and tell it you won't accept gzip encoded responses - but I
personally find that distasteful because gzipping the pages saves everyone
bandwidth and if we're scraping data we shouldn't be a nuisance (IMO). What
you want to do is use the Zlib::Inflate class to convert the response to a
regular string (
http://www.ruby-doc.org/stdlib/libdoc/zlib/rdoc/classes/Zlib/Inflate.html#M001974).
That should solve your problem.

John
 
J

Jeremy Woertink

John said:
Morning Jeremy,


You are getting the encoding error because you aren't dealing with a
string
in this case but rather a gzipped string. You could technically go to
the
request object and tell it you won't accept gzip encoded responses - but
I
personally find that distasteful because gzipping the pages saves
everyone
bandwidth and if we're scraping data we shouldn't be a nuisance (IMO).
What
you want to do is use the Zlib::Inflate class to convert the response to
a
regular string (
http://www.ruby-doc.org/stdlib/libdoc/zlib/rdoc/classes/Zlib/Inflate.html#M001974).
That should solve your problem.

John


Hmmm, interesting. I didn't think about that. I'm not familiar with the
class though. I ran through it and I got an error:

Zlib::DataError: incorrect header check
from (irb):84:in `inflate'
from (irb):84
from :0


I did notice when I print out the string, there are a lot of "\240" in
the string like

"<pre><a
href=\"javascript:document.f5.SLID.value='F99';%20document.f5.submit();\"
title=\"Select\" onmouseover=\"window.status='Select'; return true;\"
onmouseout=\"window.status='';\">DIV</a>\240id\240<a
href=\"javascript:document.f5.SLID.value='F100';%20document.f5.submit();\"
title=\"Select\" onmouseover=\"window.status='Select'; return true;\"
onmouseout=\"window.status='';\">\...."

I think these are where I'm getting messed up. Does anyone know a good
site that lists these characters? I think \240 might be a tab character,
but I want to check it against some list just to see.

Thanks,

~Jeremy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,201
Latest member
KourtneyBe

Latest Threads

Top