firefox html, my downloaded html and firebug html different?

Adam Akhtar · Aug 16, 2008

Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.

To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching

For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???

Here is my code

def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text('http://www.amazon.com/gp/product/0974514055')

File.open('html.txt','w') do |out|
out << html
end

Adam Akhtar · Aug 16, 2008

Some of my regular expressions would work but they definately mathched

in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted.

Should read

Some of my regular expressions would NOT work ....

Michael Morin · Aug 16, 2008

Adam said:
Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.

To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching

For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???

Here is my code

def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end

html = get_web_page_text('http://www.amazon.com/gp/product/0974514055')

File.open('html.txt','w') do |out|
out << html
end

The "view source" function of Firefox shows you the source code of the
page as it was downloaded from the server. Firebug is more
sophisticated, it shows you the DOM tree of the document. Javascript
can alter the DOM tree (which is essentially what AJAX does), so you
might be seeing the DOM tree after it's been modified by some Javascript
code.

--
Michael Morin
Guide to Ruby
http://ruby.about.com/
Become an About.com Guide: beaguide.about.com
About.com is part of the New York Times Company

Thomas Bl. · Aug 16, 2008

Hello there.

I don't know why the html differs, but I think the browsers must send
some different data to the server in their requests. Maybe you have some
cookies for the page in one of them, and don't have any in the other, or
maybe the Accept header is different. You can check what your Firefox
sends using Data Tamper:
https://addons.mozilla.org/en-US/firefox/addon/966 .

Now if you want your own requests from Ruby to be more flexible, use
Net::HTTP instead.

require 'net/http'

Net::HTTP::start("www.amazon.com")\
{ |http|
header=
{
"Accept"=>"*/*",
"User-Agent"=>"MyRubyProgram",
}
h,b=*http.get("/")
p h
p b

p h.code
p h.message
p h.to_hash

}

Thomas Bl. · Aug 16, 2008

Thomas said:
header=
{
"Accept"=>"*/*",
"User-Agent"=>"MyRubyProgram",
}
h,b=*http.get("/")

Sorry, it should be h,b=*http.get("/",header) of course.

Thomas Wieczorek · Aug 16, 2008

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every <table> a <tbody>, when
it doesn't exist. Firebug shows you the cleaned up source.
I had to download a website once, because it was so crappy and I
searched for the table entry by hand. It had a path like
"\html\body\table\tr\td\tr\center\font\b\font". Quite annoying, but it
speeded up scraping.

You could try the hpricot gem to get data from websites if the regex
become to complex.

Adam Akhtar · Aug 16, 2008

Thank you everyone for your help so far.

I tackled the problem by not viewing firefox source or firebugs, instead
i just saved and viewed the html code downloaded via open(a_url) with
open-uri. I wasnt sure if this did any `tidy up` like firefox or firebug
but after various trys i could be confident that what it downloaded was
the real deal.

That meant though that id have to view the html in something like
notepad and it aint easy to read. I really wish I could rely on the code
firebug displays as its so easy to find the areas you need to restrict
your searches to.

As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would `clean` the code as its
downloaded? Perhaps in an identical way to firebug?

The upside would be that it would be easier to grab what you want as
there would be more regular structure, downside i guess would be longer
run times. Just a thought though.

Ive used hpricot a while ago and it wasnt so great on badly designed
webpages so i ended up resorting to regexps. But if i find a nice
website i think ill give it another try!

Phlip · Aug 16, 2008

Adam said:
As a side note is there a plugin or libary that sits in between your
code and the targets webpage. The plugin would `clean` the code as its
downloaded? Perhaps in an identical way to firebug?

Why advertise your HTML is sloppy?

At work, we use assert_xpath, assert_tidy, and LibXML in all our functional
tests. They scream bloody murder if we have a single ill-formed ID. Then we
clean up our html.erb and keep going.

Adam Akhtar · Aug 16, 2008

Why advertise your HTML is sloppy?

Hi Phil,

Its not my html though, its a third partys website that im scraping so I
cant fix the HTML.

Florian Gilcher · Aug 16, 2008

Every browser cleans up invalid markup. Each one has a different way
to do it. Firefox, for example, adds to every <table> a <tbody>, when
it doesn't exist. Firebug shows you the cleaned up source.

Actually, if the table has no header, no footer and only one ody, the
tbody-tag is not
required but implicitly assumed.

So it always exists in the dom displayed by firebug (as it is added
and thus existing)
but does not when you manipulate the document with a tool that does not
build the dom beforehand (the source viewer).

Regards,
Florian Gilcher

Im having some issues with my html website	1	Jun 3, 2024
Stuck with html and css	25	Dec 14, 2022
Relative image URLs in HTML	3	Feb 13, 2023
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023
Create and Preview HTML & PDF with Custom Encryption and Micro Cloud Storage	0	Nov 11, 2024
Chrome not displaying uploaded HTML and CSS code	5	Nov 17, 2022
Python client/server that reads HTML body from server	1	Apr 11, 2023
Need help with <rowspan> in an HTML table	1	Nov 6, 2024

firefox html, my downloaded html and firebug html different?

Adam Akhtar

Adam Akhtar

Michael Morin

Thomas Bl.

Thomas Bl.

Thomas Wieczorek

Adam Akhtar

Phlip

Adam Akhtar

Florian Gilcher

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads