firefox html, my downloaded html and firebug html different?

A

Adam Akhtar

Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.

To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching

For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???

Here is my code

def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text('http://www.amazon.com/gp/product/0974514055')


File.open('html.txt','w') do |out|
out << html
end
 
A

Adam Akhtar

Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted.

Should read

Some of my regular expressions would NOT work ....
 
M

Michael Morin

Adam said:
Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.

To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching

For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???

Here is my code

def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text('http://www.amazon.com/gp/product/0974514055')


File.open('html.txt','w') do |out|
out << html
end

The "view source" function of Firefox shows you the source code of the
page as it was downloaded from the server. Firebug is more
sophisticated, it shows you the DOM tree of the document. Javascript
can alter the DOM tree (which is essentially what AJAX does), so you
might be seeing the DOM tree after it's been modified by some Javascript
code.

--
Michael Morin
Guide to Ruby
http://ruby.about.com/
Become an About.com Guide: beaguide.about.com
About.com is part of the New York Times Company
 
T

Thomas Bl.

Hello there.

I don't know why the html differs, but I think the browsers must send
some different data to the server in their requests. Maybe you have some
cookies for the page in one of them, and don't have any in the other, or
maybe the Accept header is different. You can check what your Firefox
sends using Data Tamper:
https://addons.mozilla.org/en-US/firefox/addon/966 .

Now if you want your own requests from Ruby to be more flexible, use
Net::HTTP instead.

require 'net/http'

Net::HTTP::start("www.amazon.com")\
{ |http|
header=
{
"Accept"=>"*/*",
"User-Agent"=>"MyRubyProgram",
}
h,b=*http.get("/")
p h
p b

p h.code
p h.message
p h.to_hash

}
 
T

Thomas Wieczorek

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every <table> a <tbody>, when
it doesn't exist. Firebug shows you the cleaned up source.
I had to download a website once, because it was so crappy and I
searched for the table entry by hand. It had a path like
"\html\body\table\tr\td\tr\center\font\b\font". Quite annoying, but it
speeded up scraping.

You could try the hpricot gem to get data from websites if the regex
become to complex.
 
A

Adam Akhtar

Thank you everyone for your help so far.

I tackled the problem by not viewing firefox source or firebugs, instead
i just saved and viewed the html code downloaded via open(a_url) with
open-uri. I wasnt sure if this did any `tidy up` like firefox or firebug
but after various trys i could be confident that what it downloaded was
the real deal.

That meant though that id have to view the html in something like
notepad and it aint easy to read. I really wish I could rely on the code
firebug displays as its so easy to find the areas you need to restrict
your searches to.

As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would `clean` the code as its
downloaded? Perhaps in an identical way to firebug?

The upside would be that it would be easier to grab what you want as
there would be more regular structure, downside i guess would be longer
run times. Just a thought though.

Ive used hpricot a while ago and it wasnt so great on badly designed
webpages so i ended up resorting to regexps. But if i find a nice
website i think ill give it another try!
 
P

Phlip

Adam said:
As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would `clean` the code as its
downloaded? Perhaps in an identical way to firebug?

Why advertise your HTML is sloppy?

At work, we use assert_xpath, assert_tidy, and LibXML in all our functional
tests. They scream bloody murder if we have a single ill-formed ID. Then we
clean up our html.erb and keep going.
 
F

Florian Gilcher

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every <table> a <tbody>, when
it doesn't exist. Firebug shows you the cleaned up source.

Actually, if the table has no header, no footer and only one ody, the
tbody-tag is not
required but implicitly assumed.

So it always exists in the dom displayed by firebug (as it is added
and thus existing)
but does not when you manipulate the document with a tool that does not
build the dom beforehand (the source viewer).

Regards,
Florian Gilcher
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,054
Latest member
LucyCarper

Latest Threads

Top