Nokogiri help

J

Jeremy Woertink

I keep getting this error
"encoding error : output conversion failed due to conv error, bytes 0xA0
0x69 0x64 0xC2
I/O error : encoder error"

whenever I try to append my html string to Nokogiri::HTML.

When I write that doc to a file, some of the spaces, or possibly letters
are weird looking characters.

Here is my (partial) output in irb:
...
templates = templates_page.search('/html/body/table[3]/tr[2]/td[2]//a')
=> <a href="CAyxDAAO?e=footer">footer</a><a
href="CAyxDAAO?e=has-sub-sections">has-sub-sections</a><a
href="CAyxDAAO?e=header">header</a><a href="CAyxDAAO?e=home">home</a><a
href="CAyxDAAO?e=item-page">item-page</a><a
href="CAyxDAAO?e=justpro-888-519-5878">justpro-888-519-5878</a><a
href="CAyxDAAO?e=left-nav">left-nav</a><a
=> <pre><a
href="javascript:document.f3.SLID.value='F16';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">DIV</a>?id?<a
href="javascript:document.f3.SLID.value='F17';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">"footer"</a>
<a
href="javascript:document.f3.SLID.value='F18';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">DIV</a>?id?<a
href="javascript:document.f3.SLID.value='F19';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">"footer-icons"</a>
<a
href="javascript:document.f3.SLID.value='F20';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">IMG</a>?src?<a
href="javascript:document.f3.SLID.value='F21';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">"/lib/yhst-72759769340912/yahoo.gif"</a>
<a
href="javascript:document.f3.SLID.value='F22';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">IMG</a>?src?<a
href="javascript:document.f3.SLID.value='F23';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">"/lib/yhst-72759769340912/secure.gif"</a>
<a
href="javascript:document.f3.SLID.value='F24';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">DIV</a>?id?<a
href="javascript:document.f3.SLID.value='F25';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">"copyright"</a>
<a
href="javascript:document.f3.SLID.value='F26';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">TEXT</a>?<a
href="javascript:document.f3.SLID.value='F27';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">@copyright</a>
<a
href="javascript:document.f3.SLID.value='F28';%20document.f3.submit();"
title="Select" onmouseover="window.status='Select'; return true;"
onmouseout="window.status='';">LINEBREAK said:
t = template_body.to_html.gsub(/[;|\s][a-zA-Z-]+[&|\s]/m) { |match| ?> if match != " return "
val = %{|#{match.scan(/[a-zA-Z-]+/).first}}
match.gsub(/[a-zA-Z-]+/, val)
end
}
=> "<pre><a
href=\"javascript:document.f3.SLID.value='F16';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">DIV</a>\240id\240<a
href=\"javascript:document.f3.SLID.value='F17';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">\"footer\"</a>\n <a
href=\"javascript:document.f3.SLID.value='F18';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">DIV</a>\240id\240<a
href=\"javascript:document.f3.SLID.value='F19';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">\"footer-icons\"</a>\n <a
href=\"javascript:document.f3.SLID.value='F20';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">IMG</a>\240src\240<a
href=\"javascript:document.f3.SLID.value='F21';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">\"/lib/yhst-72759769340912/yahoo.gif\"</a>\n
<a
href=\"javascript:document.f3.SLID.value='F22';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">IMG</a>\240src\240<a
href=\"javascript:document.f3.SLID.value='F23';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">\"/lib/yhst-72759769340912/secure.gif\"</a>\n
<a
href=\"javascript:document.f3.SLID.value='F24';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">DIV</a>\240id\240<a
href=\"javascript:document.f3.SLID.value='F25';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">\"copyright\"</a>\n <a
href=\"javascript:document.f3.SLID.value='F26';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">TEXT</a>\240<a
href=\"javascript:document.f3.SLID.value='F27';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
onmouseout=\"window.status='';\">@copyright</a>\n <a
href=\"javascript:document.f3.SLID.value='F28';%20document.f3.submit();\"
title=\"Select\" onmouseover=\"window.status='Select';true;\"
#{t}
eohtml
encoding error : output conversion failed due to conv error, bytes
0xA0 0x69 0x64 0xC2
I/O error : encoder error
=>


Is there something I should do different?

Thanks,

~Jeremy Woertink
 
A

Aaron Patterson

I keep getting this error
"encoding error : output conversion failed due to conv error, bytes 0xA0
0x69 0x64 0xC2
I/O error : encoder error"

This is most definitely an encoding problem with the source document.

If the source document hasn't declared an encoding in the meta tags,
then libxml2 must guess the encoding of the document. Sometimes it gets
it wrong, and it looks like you've found one of those times.

I suggest attempting to parse the document outside Mechanize. Check the
encoding returned in the server headers, and use that when parsing.

Check the actual document source for an encoding, and try that.

You may also need to make an educated guess. For example, some people will
create documents containing UTF-8 characters, but then declare the document as
using ISO-8859-1 encoding. :-(
 
J

Jeremy Woertink

I checked out the page response, and this is what I got back
=> {"cache-control"=>"private", "connection"=>"close",
"p3p"=>"policyref=\"http://p3p.yahoo.com/w3c/p3p.xml\", CP=\"CAO DSP COR
CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi
PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV\"",
"date"=>"Thu, 19 Nov 2009 04:39:19 GMT", "content-type"=>"text/html",
"content-encoding"=>"gzip", "set-cookie"=>"B=c4mf1f55g9ivn&b=3&s=9v;
expires=Tue, 02-Jun-2037 20:00:00 GMT; path=/; domain=.yahoo.com"}


so, where content-encoding is gzip, is this what "should" be UTF-8?

I just updated my libxml2 as well so I'm using libxml2 @2.7.3_0
(active). Is there an attribute I can set somewhere that will allow me
to parse the page using the gzip encoding?

Thanks for the help man!

~Jeremy
 
A

Aaron Patterson

I checked out the page response, and this is what I got back

=> {"cache-control"=>"private", "connection"=>"close",
"p3p"=>"policyref=\"http://p3p.yahoo.com/w3c/p3p.xml\", CP=\"CAO DSP COR
CUR ADM DEV TAI PSA PSD IVAi IVDi CONi TELo OTPi OUR DELi SAMi OTRi UNRi
PUBi IND PHY ONL UNI PUR FIN COM NAV INT DEM CNT STA POL HEA PRE GOV\"",
"date"=>"Thu, 19 Nov 2009 04:39:19 GMT", "content-type"=>"text/html",
"content-encoding"=>"gzip", "set-cookie"=>"B=c4mf1f55g9ivn&b=3&s=9v;
expires=Tue, 02-Jun-2037 20:00:00 GMT; path=/; domain=.yahoo.com"}


so, where content-encoding is gzip, is this what "should" be UTF-8?

No. That means they are just not specifying a character encoding? Was
there one in the HTML document itself?
I just updated my libxml2 as well so I'm using libxml2 @2.7.3_0
(active). Is there an attribute I can set somewhere that will allow me
to parse the page using the gzip encoding?

No. It should be unzipped before sending to the parser.
Thanks for the help man!

No problem. :)
 
J

Jeremy Woertink

Aaron said:
On Thu, Nov 19, 2009 at 01:54:47PM +0900, Jeremy Woertink wrote:

No. That means they are just not specifying a character encoding? Was
there one in the HTML document itself?

No, there's just a page full of crap >.< For example... here is the
first line when you view source

<html><html><head><title>Yahoo! Store Editor</title></head><body
bgcolor=ffffff link=0000e8 vlink=0000e8><!--body_unload--><table


Oh yeah! 2 HTML tags!!

Ok, well I got a bit of a start at least.

Thanks.

~Jeremy
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top