B
Bruce D'Arcus
Hi,
I'm trying to get my feet wet with Ruby by tackling a manageable, but
real, issue I'd like to solve.
I'm an academic, and subscribe to some RSS feeds of journals I read.
However, the feeds are really bad, and only contain lists of authors
and titles (with no markup), and links to the issue urls.
So, I want a script that takes those feeds, goes to the issue pages,
grabs the links for the articles, and then from there extracts author
and title information.
For some reason I don't understand, the below fragment all works,
except for the author attribute is always blank. The problem is not
with my regular expression pattern.
Could someone explain what I'm doing wrong?
Bruce
# journals is an array of rss feed urls and titles
journals.each do |journal|
open(journal[1]) do |http|
response = http.read
result = RSS:arser.parse(response, false)
# grab first issue url listed from each journal
issue_url = result.items[0].link
# regular expression patterns to use below
article_page = /<a href="(.*?)">Article Description<\/a>/
title_match = /<span class="article-title">(.*?)<\/span>/
author_match = /<strong>Author:<\/strong><\/td><td
class="rightcol">(.*?)</
articles = open(issue_url)
# find each article url by screen-scraping
articles.read.scan(article_page).each do |url|
article_url = "#{base_url}#{url}"
open(article_url) do |article|
# screen-scrap for article author and title
title = article.read.scan(title_match)
# for whatever reason, author never returns anything
author = article.read.scan(author_match)
# create new article object
list.append(Article.new(title, author, article_url))
end
end
end
end
I'm trying to get my feet wet with Ruby by tackling a manageable, but
real, issue I'd like to solve.
I'm an academic, and subscribe to some RSS feeds of journals I read.
However, the feeds are really bad, and only contain lists of authors
and titles (with no markup), and links to the issue urls.
So, I want a script that takes those feeds, goes to the issue pages,
grabs the links for the articles, and then from there extracts author
and title information.
For some reason I don't understand, the below fragment all works,
except for the author attribute is always blank. The problem is not
with my regular expression pattern.
Could someone explain what I'm doing wrong?
Bruce
# journals is an array of rss feed urls and titles
journals.each do |journal|
open(journal[1]) do |http|
response = http.read
result = RSS:arser.parse(response, false)
# grab first issue url listed from each journal
issue_url = result.items[0].link
# regular expression patterns to use below
article_page = /<a href="(.*?)">Article Description<\/a>/
title_match = /<span class="article-title">(.*?)<\/span>/
author_match = /<strong>Author:<\/strong><\/td><td
class="rightcol">(.*?)</
articles = open(issue_url)
# find each article url by screen-scraping
articles.read.scan(article_page).each do |url|
article_url = "#{base_url}#{url}"
open(article_url) do |article|
# screen-scrap for article author and title
title = article.read.scan(title_match)
# for whatever reason, author never returns anything
author = article.read.scan(author_match)
# create new article object
list.append(Article.new(title, author, article_url))
end
end
end
end