Simple screen scraper using scrAPI

D

doog

I'm a Ruby novice. Does anyone have an example of a simple screen
scraper in Ruby that uses scrAPI (and works on Mac OS X)?

All I need it to do is:
1) Go to a specified web page
2) Use a CSS selector to grab and print out any section of the page

It does not need to find links on the page or crawl.

I tried the eBay example at
http://blog.labnotes.org/2006/07/11/scraping-with-style-scrapi-toolkit-for-ruby/
and have tried the recommended "require" and Tidy.path statements,
but couldn't find a combination that works.

-Doug
 
D

doog

Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.

-Doug
 
P

Peter Szinek

doog said:
I'm a Ruby novice. Does anyone have an example of a simple screen
scraper in Ruby that uses scrAPI (and works on Mac OS X)?

Though I don't seem to understand the intensity of the holy war Paul is
leading against anything that is not hand-coded on the fly, this time I
will have to agree with him: the request 'I would like to write a screen
scraper in scrAPI (or Hpricot, or xxx)' is not always the right way.
Screen scraping is can be very tedious and complex, and it really
depends on the input page, the type of the actions you would like to
perform (fetching the page is trivial? do you need to navigate? (i.e.
fill forms, lick links) how complex is the parsing?) quality you would
like to achieve, robustness (i.e. if the underlying page changes, the
scraper should still perform well) and another 10k things. Some time ago
I wrote a small article on this:

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails/

It is a bit outdated now (I am planning to beef it up with FireWatir,
Hpricot and other sections) but it can help you as a starting point.

Conclusion: it depends on the page and task at hand what/how should be
used. I suggest that if you have a concrete problem, drop us a mail and
we will figure out something.

Cheers,
Peter

__
http://www.rubyrailways.com
 
Z

Zouplaz

I Paul ! Instead of talking about scrAPI would you tell us the magic
that was inside GrafForth ???


As a teen, I was fan of the bass player of Iron Maiden metal band and
YOU ;-)
Kidding a little but not that much...


Sorry for the offtopism !
 
Z

Zouplaz

le 29/11/2006 18:01, Paul Lutus nous a dit:
Ha! A reference to a different era, a distant voice. :)

We should not forget these times (and I didn't lived the 70's - that was
certainly something else) even if there's no VW van anywhere

Art is a performance and performance comes from constraints...
For the other readers, GraForth was a Forth embodiment I cooked up about 25
years ago, at a time when most things were written in assembly. It
supported a kind of graphics that would be embarrassingly crude by modern
standards.

Hey ! I remember a demo of GraForth showing a 3D rotating cube (maybe
color filled) - Not that bad for the only mhz of my Apple IIc (no I
didn't had that wonderful II+, I was a little late)
It was basically a way to get around the fact that there were almost no
high-level languages, and none that mere mortals could either afford or
support with the small HDD and RAM sizes of the era.

Did you used any cross compilation systems to code GraForth or
AppleWriter ? (you know that kind of systems that most early 80s teen
geek dreamed to have an access onto)
I'm glad to see you had your priorities straight. :)

:))
 
D

Doug Kramer

Thanks to this group for helping me get various screen scrapers
up and running. I had made a couple of silly typos that held me
back. It will take me a weekend or so of spare time to digest what
I have and actually write the code I want.

Thanks Peter, Paul and Alvim :)

Thanks Alvim, for the pointer to hpricot -- I"ve got the demo script
working and will study it.

Thanks Peter for your article -- I had read it before, but re-reading
it at this point helps quite a bit.

Thanks Paul for the code below. I know regex, so it will just be
a matter of me learning the flow/expression syntax.

-Doug

Paul said:
doog said:
Thanks so much. Parsing a web page is sufficient, and would
be a great starting point.

Okay, here is a simple parser in ordinary Ruby, it will give you some ideas
about what is involved in parsing.

There are many libraries that do much more than this script does, some of
them have steep learning curves, many offer exotic ways to acquire
particular kinds of content.

This is a simple parser that returns an array containing all the table
content in the target Web page. I wrote it earlier today for someone who
wanted to scrape a yahoo.com financial page, which explains the target
page, something easy to change:

------------------------------------------------

#!/usr/bin/ruby -w

require 'net/http'

# read the page data

http = Net::HTTP.new('finance.yahoo.com', 80)
resp, page = http.get('/q?s=IBM', nil )

# BEGIN processing HTML

def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

out_tables = []
table_data = parse_html(page,"table")
table_data.each do |table|
out_rows = []
row_data = parse_html(table,"tr")
row_data.each do |row|
out_cells = parse_html(row,"td")
out_cells.each do |cell|
cell.gsub!(%r{<.*?>},"")
end
out_rows << out_cells
end
out_tables << out_rows
end

# END processing HTML

# examine the result

def parse_nested_array(array,tab = 0)
n = 0
array.each do |item|
if(item.size > 0)
puts "#{"\t" * tab}[#{n}] {"
if(item.class == Array)
parse_nested_array(item,tab+1)
else
puts "#{"\t" * (tab+1)}#{item}"
end
puts "#{"\t" * tab}}"
end
n += 1
end
end

parse_nested_array(out_tables)

------------------------------------------------

This program emits an indexed, indented listing of the table content that it
extracted, so you can then customize it by acquiring particular table cells
through use of the provided index numbers.

It should work with any Web page that has the interesting content embedded
in tables, and whose syntax is reliable.

The primary value of this program is to show you how easy it is to scrape
pages using Ruby, and give you a starting point you can customize to meet
your own requirements.
 
B

Bill Kelly

From: "Paul Lutus said:
def parse_html(data,tag)
return data.scan(%r{<#{tag}\s*.*?>(.*?)</#{tag}>}im).flatten
end

Are `<' and`>' characters legal inside quoted attribute values?

E.g. <img alt="a>b" src="inequality.gif">

Also, is the closing tag allowed to have whitespace between the
tag name and the ending bracket?

E.g. </body >


The latter would be trivial to accomodate with a \s* obviously;
but the former would be a shade trickier (though certainly still
possible with a regexp.)


There's a lot of foul, cruel, and bad-tempered HTML out there
in the wild. Depending on the needs of the Original Poster,
death could await a simplistic HTML lexer with nasty big pointy
teeth.

TIM: I warned you! But did you listen to me? Oh, no, you knew it
all, didn't you? Oh, it's just a harmless little markup language,
isn't it? Well, it's always the same, I always--
ARTHUR: Oh, shut up!
TIM: --But do they listen to me?--
ARTHUR: Right!
TIM: -Oh, no--
KNIGHTS: Charge!


All in fun,

Bill
 
D

dblack

Hi --

Not syntactically correct, but the question might be "will it happen?" In
which case the answer is "probably".

I believe it is actually legal. In the XML 1.1 spec, an end-tag is:

ETag ::= '</' Name S? '>'

(where S is any non-zero amount of whitespace, and ? indicates zero or
more of that), and if I'm reading the ISO 8859 roadmap in Neil
Bradley's "Concise SGML Companion" correctly, it's legal in SGML
generally.


David

--
David A. Black | (e-mail address removed)
Author of "Ruby for Rails" [1] | Ruby/Rails training & consultancy [3]
DABlog (DAB's Weblog) [2] | Co-director, Ruby Central, Inc. [4]
[1] http://www.manning.com/black | [3] http://www.rubypowerandlight.com
[2] http://dablog.rubypal.com | [4] http://www.rubycentral.org
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,007
Latest member
obedient dusk

Latest Threads

Top