Nokogiri not getting html body sometimes

Jarmo Pertman · May 20, 2009

I'm using Mechanize to get imdb page and then Nokogiri Node#search
method to get some info from the page, but I've stumbled onto one
special case where #search doesn't work properly, e.g. all other pages
I've tried so far work as expected.

It seems that some special characters are causing the trouble for
Nokogiri, because when I tried to print document itself it outputted
only half of <head> tag and no body tags at all!

Anyway here is the code snippet which I'd expect to output "false" 4
times. Instead, it outputs false, false, true, false. Try with some
other imdb url and it's ok.

require 'mechanize'

mech = WWW::Mechanize.new {|agent| agent.user_agent_alias = 'Windows
Mozilla'}
mech.get("http://www.imdb.com/title/tt1092016/") do |page|
puts page.search("/html").empty?
puts page.search("/html/head").empty?
puts page.search("/html/body").empty?
puts page.body.empty?
end

What could be the problem?

I'm using ruby 1.8.6 (2007-09-24 patchlevel 111) [i386-mswin32]

Lui Core · May 21, 2009

i think you'd better set the encoding first.

mech.get("http://www.imdb.com/title/tt1092016/") do |page|
page.encoding = 'ISO-8859-1'
#... the rest of ur code
end

Jarmo Pertman · May 21, 2009

Thank you! It did the trick.

Best regards,
Jarmo

URL paramater sts - mechanize & nokogiri differences	1	Oct 9, 2010
Nokogiri bug or intended effect??	4	May 3, 2010
Scraping with Nokogiri while using Mechanize	2	Mar 10, 2011
Nokogiri bug?	2	Aug 18, 2010
Using Nokogiri	17	Nov 8, 2009
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Datagrid problem - update method not getting called?!?	0	Aug 24, 2006
Script works in Firefox and Chrome, but not in IE7	1	Apr 29, 2009

Nokogiri not getting html body sometimes

Jarmo Pertman

Lui Core

Jarmo Pertman

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads