A
Adam Akhtar
Hi Im a relatively new rubyist and programmer in general and currently
reading Everyday scripting and trying out webscraping using amazon as a
target.
To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching
For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???
Here is my code
def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end
html = get_web_page_text('http://www.amazon.com/gp/product/0974514055')
File.open('html.txt','w') do |out|
out << html
end
reading Everyday scripting and trying out webscraping using amazon as a
target.
To determine suitable regular expressions i first just viewed the page
source via firefox. Shortly after i found firebug. I noticed that there
were some differences in the source code between firefoxs source code
and firbugs. Firebug seems to add and maybe lack code and vice versa.
Some of my regular expressions would work but they definately mathched
in the firefox source view, I know because i copied the source into a
regex editor and applied my reg ex and it highlighted. So this made
question the html i was grabbing so i saved it to a text file in my
code. When i viewed the text file this too was different than the
firefox code hence why my reg exs were not matching
For the moment assume my regexs are right and that im more concerened
with why there are differences. Can anyone explain why this is
happening? Which version is the real source html???
Here is my code
def get_web_page_text(a_url)
page = open(a_url)
text = page.read
end
html = get_web_page_text('http://www.amazon.com/gp/product/0974514055')
File.open('html.txt','w') do |out|
out << html
end