newbie read.scan (?) question

Discussion in 'Ruby' started by Bruce D'Arcus, Jun 6, 2005.

  1. Hi,

    I'm trying to get my feet wet with Ruby by tackling a manageable, but
    real, issue I'd like to solve.

    I'm an academic, and subscribe to some RSS feeds of journals I read.
    However, the feeds are really bad, and only contain lists of authors
    and titles (with no markup), and links to the issue urls.

    So, I want a script that takes those feeds, goes to the issue pages,
    grabs the links for the articles, and then from there extracts author
    and title information.

    For some reason I don't understand, the below fragment all works,
    except for the author attribute is always blank. The problem is not
    with my regular expression pattern.

    Could someone explain what I'm doing wrong?

    Bruce

    # journals is an array of rss feed urls and titles
    journals.each do |journal|
    open(journal[1]) do |http|
    response = http.read
    result = RSS::parser.parse(response, false)

    # grab first issue url listed from each journal
    issue_url = result.items[0].link

    # regular expression patterns to use below
    article_page = /<a href="(.*?)">Article Description<\/a>/
    title_match = /<span class="article-title">(.*?)<\/span>/
    author_match = /<strong>Author:<\/strong><\/td><td
    class="rightcol">(.*?)</

    articles = open(issue_url)
    # find each article url by screen-scraping
    articles.read.scan(article_page).each do |url|
    article_url = "#{base_url}#{url}"
    open(article_url) do |article|
    # screen-scrap for article author and title
    title = article.read.scan(title_match)
    # for whatever reason, author never returns anything
    author = article.read.scan(author_match)
    # create new article object
    list.append(Article.new(title, author, article_url))
    end
    end
    end
    end
    Bruce D'Arcus, Jun 6, 2005
    #1
    1. Advertising

  2. Bruce D'Arcus

    Pit Capitain Guest

    Bruce D'Arcus schrieb:
    > For some reason I don't understand, the below fragment all works,
    > except for the author attribute is always blank. The problem is not
    > with my regular expression pattern.
    >
    > Could someone explain what I'm doing wrong?


    Hi Bruce,

    I don't know which libraries you're using, but could it be that you can
    only read once from article, like reading from a file?

    Instead of

    > open(article_url) do |article|
    > # screen-scrap for article author and title
    > title = article.read.scan(title_match)
    > # for whatever reason, author never returns anything
    > author = article.read.scan(author_match)


    try something like

    open(article_url) do |article|
    # screen-scrap for article author and title
    article_text = article.read
    title = article_text.scan(title_match)
    author = article_text.scan(author_match)

    HTH

    Regards,
    Pit
    Pit Capitain, Jun 6, 2005
    #2
    1. Advertising

  3. article is a stream and you try to read it twice, this doesn't work like
    you think. I guess the 2nd article.read just returns "", so "".scan(...)
    returns nothing.
    Try the following:

    > articles.read.scan(article_page).each do |url|
    > article_url = "#{base_url}#{url}"
    > open(article_url) do |article|

    articletxt=article.read
    > # screen-scrap for article author and title

    title = articletxt.scan(title_match)
    > # for whatever reason, author never returns anything

    author = articletxt.scan(author_match)
    > # create new article object
    > list.append(Article.new(title, author, article_url))
    > end
    > end



    Dominik
    Dominik Bathon, Jun 6, 2005
    #3
  4. Yes, that solved the problem. I had a feeling it was something pretty
    simple.

    Thanks!

    Bruce
    Bruce D'Arcus, Jun 6, 2005
    #4
  5. One followup.

    Why if I dump my list of article objects to YAML, do I end up with
    this:

    - !ruby/object:Article
    author:
    -
    - "Hovorka, Alice J."
    title:
    -
    - "The (Re) Production of Gendered Positionality in Botswana's
    Commercial Urban
    Agriculture Sector"
    url:
    http://journals.ohiolink.edu/cgi-bin/sciserv.pl?collection=journals&amp;journal=00045608

    I'm referring to the fact that article and title content aren't
    represented the same as url (which is what I was expecting).

    I have these two classes:

    class Article

    include Journals

    attr_reader :title, :author, :description, :url
    def initialize(title, author, url)
    @title = title
    @author = author
    @url = url
    end

    def to_s
    "#@title, #@author"
    end

    def abstract
    #
    end

    def refer
    Journals::const_get:)BASE_URL) + "/" +
    @url + "&form=refer&file=file.txt"
    end

    def pdf
    Journals::const_get:)BASE_URL) + "/" +
    @url + "&form=pdf&file=file.pdf"
    end
    end

    class Articles
    #
    attr_reader :articles

    def initialize
    @articles = Array.new
    end

    def append(article)
    @articles.push(article)
    self
    end

    def [](index)
    @articles[index]
    end
    end

    .... and then:

    list = Articles.new

    ... and at the end:

    File.open("articles.yaml", "w") {|f| YAML.dump(list.articles, f)}

    Or is everything fine?

    Bruce
    Bruce D'Arcus, Jun 6, 2005
    #5
  6. Hi,

    Bruce D'Arcus a écrit :
    > Why if I dump my list of article objects to YAML, do I end up with
    > this:
    >
    > - !ruby/object:Article
    > author:
    > -
    > - "Hovorka, Alice J."
    > title:
    > -
    > - "The (Re) Production of Gendered Positionality in Botswana's
    > Commercial Urban
    > Agriculture Sector"
    > url:
    > http://journals.ohiolink.edu/cgi-bin/sciserv.pl?collection=journals&amp;journal=00045608
    >
    > I'm referring to the fact that article and title content aren't
    > represented the same as url (which is what I was expecting).


    Because your author and title probably aren't strings as you expect them
    to be but rather arrays. You should try to puts @title.inspect somewhere
    to see what it is.

    > I have these two classes:
    >
    > class Article
    >
    > include Journals
    >
    > attr_reader :title, :author, :description, :url
    > def initialize(title, author, url)
    > @title = title
    > @author = author
    > @url = url
    > end
    >
    > def to_s
    > "#@title, #@author"
    > end
    >
    > def abstract
    > #
    > end
    >
    > def refer
    > Journals::const_get:)BASE_URL) + "/" +
    > @url + "&form=refer&file=file.txt"
    > end
    >
    > def pdf
    > Journals::const_get:)BASE_URL) + "/" +
    > @url + "&form=pdf&file=file.pdf"
    > end
    > end
    >
    > class Articles
    > #
    > attr_reader :articles
    >
    > def initialize
    > @articles = Array.new
    > end
    >
    > def append(article)
    > @articles.push(article)
    > self
    > end
    >
    > def [](index)
    > @articles[index]
    > end
    > end


    Why create an Article class and an Articles class? You could make all
    the content of your Articles class also content of the Article class but
    at the class level instead of the instance level. So you just have to
    transform your @articles variable into @@articles and define your append
    and [] methods as self.append and self.[].

    An other thing: I don't think you need to use
    Journals::const_get:)BASE_URL). You could simply use Journals::BASE_URL.

    HTH

    Ghislain
    Ghislain Mary, Jun 6, 2005
    #6
  7. Ghislain Mary wrote:

    > Because your author and title probably aren't strings as you expect them
    > to be but rather arrays.


    Ah, right. Using scan returns an array. On this ...

    > > I have these two classes:
    > >
    > > class Article
    > >
    > > include Journals
    > >
    > > attr_reader :title, :author, :description, :url
    > > def initialize(title, author, url)
    > > @title = title
    > > @author = author
    > > @url = url
    > > end
    > >
    > > def to_s
    > > "#@title, #@author"
    > > end
    > >
    > > def abstract
    > > #
    > > end
    > >
    > > def refer
    > > Journals::const_get:)BASE_URL) + "/" +
    > > @url + "&form=refer&file=file.txt"
    > > end
    > >
    > > def pdf
    > > Journals::const_get:)BASE_URL) + "/" +
    > > @url + "&form=pdf&file=file.pdf"
    > > end
    > > end
    > >
    > > class Articles
    > > #
    > > attr_reader :articles
    > >
    > > def initialize
    > > @articles = Array.new
    > > end
    > >
    > > def append(article)
    > > @articles.push(article)
    > > self
    > > end
    > >
    > > def [](index)
    > > @articles[index]
    > > end
    > > end

    >
    > Why create an Article class and an Articles class?


    Because I'm *real* newbie! My only programming background is with
    XSLT. So I'm trying to also understand basic OO design in this
    example.

    > You could make all
    > the content of your Articles class also content of the Article class but
    > at the class level instead of the instance level. So you just have to
    > transform your @articles variable into @@articles and define your append
    > and [] methods as self.append and self.[].


    Can you give me an abbreviated example of how to do actually do this?
    For example, how do I define @@articles under the Article class, and
    how would I then define the append method there.

    > An other thing: I don't think you need to use
    > Journals::const_get:)BASE_URL). You could simply use Journals::BASE_URL.


    Ah thanks. It took me awhile just to get that far!

    Bruce
    Bruce D'Arcus, Jun 6, 2005
    #7
  8. Bruce D'Arcus a écrit :
    >>Why create an Article class and an Articles class?

    >
    >
    > Because I'm *real* newbie! My only programming background is with
    > XSLT. So I'm trying to also understand basic OO design in this
    > example.


    So welcome into the Ruby community ;-)
    I'm still considering myself as a newby too, and I don't often reply to
    posts on this list because I often think I am not able to contribute in
    a good way to the discussions. But I learn a lot by reading what is
    happening here :)

    > Can you give me an abbreviated example of how to do actually do this?
    > For example, how do I define @@articles under the Article class, and
    > how would I then define the append method there.


    You could do something like:

    class Article

    include Journals

    attr_reader :title, :author, :description, :url

    # Create the Array containing the articles.
    @@articles = Array.new

    def initialize(title, author, url)
    @title, @author, @url = title, author, url

    # Add the new Article to the articles array.
    @@articles << self
    end

    def to_s
    "#@title, #@author"
    end

    def refer
    Journals::BASE_URL + "/" + @url + "&form=refer&file=file.txt"
    end

    def pdf
    Journals::BASE_URL + "/" + @url + "&form=pdf&file=file.pdf"
    end

    # Add a class method to get an Article by its index in the @@articles
    Array.
    def self.[](index)
    @@articles[index]
    end

    # Add a method to get the number of articles.
    # Call it how you want it to be called.
    def self.count
    @@articles.size
    end

    end

    Good luck,

    Ghislain
    Ghislain Mary, Jun 6, 2005
    #8
  9. Oh... I was forgetting.

    You don't even need an append method anymore since when you create a new
    Article it is automatically pushed into the @@articles Array.

    Ghislain
    Ghislain Mary, Jun 6, 2005
    #9
  10. On 06/06/05, Bruce D'Arcus <> wrote:
    >=20
    >=20
    > Ghislain Mary wrote:
    >=20
    > > Because your author and title probably aren't strings as you expect the=

    m
    > > to be but rather arrays.

    >=20
    > Ah, right. Using scan returns an array. On this ...
    >=20
    > > > I have these two classes:
    > > >
    > > > class Article
    > > >
    > > > include Journals
    > > >
    > > > attr_reader :title, :author, :description, :url
    > > > def initialize(title, author, url)
    > > > @title =3D title
    > > > @author =3D author
    > > > @url =3D url
    > > > end
    > > >
    > > > def to_s
    > > > "#@title, #@author"
    > > > end
    > > >
    > > > def abstract
    > > > #
    > > > end
    > > >
    > > > def refer
    > > > Journals::const_get:)BASE_URL) + "/" +
    > > > @url + "&form=3Drefer&file=3Dfile.txt"
    > > > end
    > > >
    > > > def pdf
    > > > Journals::const_get:)BASE_URL) + "/" +
    > > > @url + "&form=3Dpdf&file=3Dfile.pdf"
    > > > end
    > > > end
    > > >
    > > > class Articles
    > > > #
    > > > attr_reader :articles
    > > >
    > > > def initialize
    > > > @articles =3D Array.new
    > > > end
    > > >
    > > > def append(article)
    > > > @articles.push(article)
    > > > self
    > > > end
    > > >
    > > > def [](index)
    > > > @articles[index]
    > > > end
    > > > end

    > >
    > > Why create an Article class and an Articles class?

    >=20
    > Because I'm *real* newbie! My only programming background is with
    > XSLT. So I'm trying to also understand basic OO design in this
    > example.
    >=20
    > > You could make all
    > > the content of your Articles class also content of the Article class bu=

    t
    > > at the class level instead of the instance level. So you just have to
    > > transform your @articles variable into @@articles and define your appen=

    d
    > > and [] methods as self.append and self.[].

    >=20
    > Can you give me an abbreviated example of how to do actually do this?
    > For example, how do I define @@articles under the Article class, and
    > how would I then define the append method there.
    >=20


    I have not followed this thread in depth, but I think it is a good
    idea to distinguish between a set of articles and an article. I don't
    see how you would benefit from mixing these two. If I understand the
    proposal correctly, you would no longer be able to maintain two
    independent sets of articles, because the ArticleSet would be part of
    the article class.

    Anyhow, here is how to define a class variable and class methods.

    class Klass
    @@foo =3D []

    def self.add(bar)
    @@foo << bar
    end

    def self.foo
    @@foo
    end
    end

    Klass.add(1)
    Klass.add(2)
    p Klass.foo

    good luck with ruby,

    Brian

    --=20
    http://ruby.brian-schroeder.de/

    Stringed instrument chords: http://chordlist.brian-schroeder.de/
    Brian Schröder, Jun 6, 2005
    #10
  11. OK, thanks!

    And now how do I then access the @@articles array? If before I had:

    list = Articles.new

    ... what would be the equivalent here?

    Bruce
    Bruce D'Arcus, Jun 6, 2005
    #11
  12. Brian Schröder wrote:

    > Anyhow, here is how to define a class variable and class methods.
    >
    > class Klass
    > @@foo = []
    >
    > def self.add(bar)
    > @@foo << bar
    > end
    >
    > def self.foo
    > @@foo
    > end
    > end
    >
    > Klass.add(1)
    > Klass.add(2)
    > p Klass.foo


    OK, am struggling with translating this to my example. Here's what
    I've done:

    articles.read.scan(article_page).each do |url|
    article_url = "#{base_url}#{url}"
    open(article_url) do |article|
    article_text = article.read
    title = article_text.scan(title_match).to_s
    author = article_text.scan(author_match).to_s
    puts "loading #{title} ...\n"
    a = Article.new(title, author, article_url)
    a.add
    end

    .... and then:

    File.open("articles.yaml", "w") {|f| YAML.dump(p Article.articles, f)}

    But I get a "undefined method `add'" error. I have that part of the
    class defined like so:

    class Article
    include Journals

    @@articles = []

    attr_reader :title, :author, :url

    def initialize(title, author, url)
    @title = title
    @author = author
    @url = url
    end

    def self.add(article)
    @@articles << article
    end

    def self.articles
    @@article
    end

    ...

    > good luck with ruby,


    Thanks!

    Bruce
    Bruce D'Arcus, Jun 6, 2005
    #12
  13. Bruce D'Arcus a écrit :
    > OK, thanks!
    >
    > And now how do I then access the @@articles array? If before I had:
    >
    > list = Articles.new
    >
    > ... what would be the equivalent here?


    You can define the following:

    class Article

    def self.articles
    @@articles
    end

    end


    But in fact, as Brian said, this may not be a good idea to store the
    articles in the Article class. This depends on the fact whether you want
    to be able to store several groups of articles or only one. I hadn't
    think of it because of the way you asked it. I undestood that you were
    only handling one group of articles, but maybe that's not the case.
    However, it's a good situation to learn a little about class variables
    and class methods ;-)

    Ghislain
    Ghislain Mary, Jun 6, 2005
    #13
  14. Brian Schröder wrote:

    > I have not followed this thread in depth, but I think it is a good
    > idea to distinguish between a set of articles and an article. I don't
    > see how you would benefit from mixing these two. If I understand the
    > proposal correctly, you would no longer be able to maintain two
    > independent sets of articles, because the ArticleSet would be part of
    > the article class.


    And actually, I guess the bigger question is how you would deal with
    this then? Are you saying I was on the right track originally with my
    Articles class? Or would there be some other approach?

    Bruce
    Bruce D'Arcus, Jun 6, 2005
    #14
  15. Bruce D'Arcus

    ES Guest

    Le 6/6/2005, "Bruce D'Arcus" <> a =E9crit:
    >Brian Schr=F6der wrote:
    >
    >> Anyhow, here is how to define a class variable and class methods.
    >>
    >> class Klass
    >> @@foo =3D []
    >>
    >> def self.add(bar)
    >> @@foo << bar
    >> end
    >>
    >> def self.foo
    >> @@foo
    >> end
    >> end
    >>
    >> Klass.add(1)
    >> Klass.add(2)
    >> p Klass.foo

    >
    >OK, am struggling with translating this to my example. Here's what
    >I've done:
    >
    > articles.read.scan(article_page).each do |url|
    > article_url =3D "#{base_url}#{url}"
    > open(article_url) do |article|
    > article_text =3D article.read
    > title =3D article_text.scan(title_match).to_s
    > author =3D article_text.scan(author_match).to_s
    > puts "loading #{title} ...\n"
    > a =3D Article.new(title, author, article_url)
    > a.add
    > end


    add is a class method (see the definition of self.add, which is the
    same as saying Article.add), so you would want to call it like

    Article.add a # Need to pass the new article in.

    >.... and then:
    >
    >File.open("articles.yaml", "w") {|f| YAML.dump(p Article.articles, f)}
    >
    >But I get a "undefined method `add'" error. I have that part of the
    >class defined like so:
    >
    >class Article
    > include Journals
    >
    > @@articles =3D []
    >
    > attr_reader :title, :author, :url
    >
    > def initialize(title, author, url)
    > @title =3D title
    > @author =3D author
    > @url =3D url
    > end
    >
    > def self.add(article)
    > @@articles << article
    > end
    >
    > def self.articles
    > @@article
    > end
    >
    > ...
    >
    >> good luck with ruby,

    >
    >Thanks!
    >
    >Bruce


    E

    --
    template<typename duck>
    void quack(duck& d) { d.quack(); }
    ES, Jun 6, 2005
    #15
  16. On 07/06/05, Bruce D'Arcus <> wrote:
    >=20
    >=20
    > Brian Schr=F6der wrote:
    >=20
    > > I have not followed this thread in depth, but I think it is a good
    > > idea to distinguish between a set of articles and an article. I don't
    > > see how you would benefit from mixing these two. If I understand the
    > > proposal correctly, you would no longer be able to maintain two
    > > independent sets of articles, because the ArticleSet would be part of
    > > the article class.

    >=20
    > And actually, I guess the bigger question is how you would deal with
    > this then? Are you saying I was on the right track originally with my
    > Articles class? Or would there be some other approach?
    >=20
    > Bruce
    >=20
    >=20


    Yes, I'd say you were on the right track. Even if you by now only use
    one set of articles (You called this class Articles) I'd say it is
    cleaner to have an explicit class and its more extensible than having
    the Article class contain all its instances.

    regards,

    Brian


    --=20
    http://ruby.brian-schroeder.de/

    Stringed instrument chords: http://chordlist.brian-schroeder.de/
    Brian Schröder, Jun 7, 2005
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kevin Stolt
    Replies:
    9
    Views:
    85
    Bernard Kenik
    Nov 6, 2006
  2. Peter Szinek

    Regexp/scan question

    Peter Szinek, Dec 11, 2006, in forum: Ruby
    Replies:
    7
    Views:
    94
    Peter Szinek
    Dec 11, 2006
  3. string scan question

    , Apr 6, 2007, in forum: Ruby
    Replies:
    8
    Views:
    163
    Peter Szinek
    Apr 6, 2007
  4. Martin Foster
    Replies:
    4
    Views:
    150
    Martin Foster
    Dec 8, 2003
  5. Sandman

    News::Scan question

    Sandman, Aug 8, 2004, in forum: Perl Misc
    Replies:
    8
    Views:
    104
    Greg Bacon
    Aug 12, 2004
Loading...

Share This Page