[QUIZ] Gathering Ruby Quiz 2 Data (#189)

Discussion in 'Ruby' started by Daniel Moore, Jan 23, 2009.

  1. Daniel Moore

    Daniel Moore Guest

    Greetings!

    Welcome to the inaugural Ruby Quiz 3!

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

    The three rules of Ruby Quiz:

    1. Please do not post any solutions or spoiler discussion for this
    quiz until 48 hours have elapsed from the time this message was
    sent.

    2. Support Ruby Quiz by submitting ideas and responses
    as often as you can! Visit: <http://rubyquiz.strd6.com>

    3. Enjoy!

    Suggestion: A [QUIZ] in the subject of emails about the problem
    helps everyone on Ruby Talk follow the discussion. Please reply to
    the original quiz message, if you can.

    -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

    ## Gathering Ruby Quiz 2 Data

    I'm building the new Ruby Quiz website and I need your help...

    This week's quiz involves gathering the existing Ruby Quiz 2 data from
    the Ruby Quiz website: <http://splatbang.com/rubyquiz/>

    Each quiz entry contains the following information:

    * id
    * title
    * description
    * summary

    There are also many quiz solutions that belong to each quiz. The quiz
    solutions have the following:

    * quiz_id
    * author
    * ruby_talk_reference
    * text

    Matthew has some advice for getting at the data:
    > If you start at <http://splatbang.com/rubyquiz/>, you'll see
    > the quiz list on the left are all links to the same quiz.rhtml file
    > (embedded Ruby), but with different id parameters. Those
    > parameters are the name of a subdirectory. So, for example,
    > take quiz #184, which has a link like this:
    >
    > <http://splatbang.com/rubyquiz/quiz.rhtml?id=184_Befunge>
    >
    > So there is a subdirectory called "184_Befunge". There
    > are basically three files in every directory:
    >
    > * quiz.txt -- the quiz description
    > * sols.txt -- a list of author names and the ruby-talk message # of the submission
    > * summ.txt -- the quiz summary
    >
    > Examples:
    > * <http://splatbang.com/rubyquiz/184_Befunge/quiz.txt>
    > * <http://splatbang.com/rubyquiz/184_Befunge/sols.txt>
    > * <http://splatbang.com/rubyquiz/184_Befunge/summ.txt>
    >


    Your program will collect and output this data as yaml (or your favorite data
    serialization standard; xml, json, etc.).

    --
    -Daniel
     
    Daniel Moore, Jan 23, 2009
    #1
    1. Advertising

  2. Daniel Moore

    Robert Dober Guest

    On Fri, Jan 23, 2009 at 7:42 AM, Daniel Moore <> wrote:
    > Greetings!
    >
    > Welcome to the inaugural Ruby Quiz 3!
    >
    > -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=

    =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
    >
    > The three rules of Ruby Quiz:
    >
    > 1. Please do not post any solutions or spoiler discussion for this
    > quiz until 48 hours have elapsed from the time this message was
    > sent.
    >
    > 2. Support Ruby Quiz by submitting ideas and responses
    > as often as you can! Visit: <http://rubyquiz.strd6.com>
    >
    > 3. Enjoy!
    >
    > Suggestion: A [QUIZ] in the subject of emails about the problem
    > helps everyone on Ruby Talk follow the discussion. Please reply to
    > the original quiz message, if you can.
    >
    > -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=

    =3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
    >
    > ## Gathering Ruby Quiz 2 Data
    >
    > I'm building the new Ruby Quiz website and I need your help...
    >
    > This week's quiz involves gathering the existing Ruby Quiz 2 data from
    > the Ruby Quiz website: <http://splatbang.com/rubyquiz/>
    >
    > Each quiz entry contains the following information:
    >
    > * id
    > * title
    > * description
    > * summary
    >
    > There are also many quiz solutions that belong to each quiz. The quiz
    > solutions have the following:
    >
    > * quiz_id
    > * author
    > * ruby_talk_reference
    > * text
    >
    > Matthew has some advice for getting at the data:
    >> If you start at <http://splatbang.com/rubyquiz/>, you'll see
    >> the quiz list on the left are all links to the same quiz.rhtml file
    >> (embedded Ruby), but with different id parameters. Those
    >> parameters are the name of a subdirectory. So, for example,
    >> take quiz #184, which has a link like this:
    >>
    >> <http://splatbang.com/rubyquiz/quiz.rhtml?id=3D184_Befunge>
    >>
    >> So there is a subdirectory called "184_Befunge". There
    >> are basically three files in every directory:
    >>
    >> * quiz.txt -- the quiz description
    >> * sols.txt -- a list of author names and the ruby-talk message # of the=

    submission
    >> * summ.txt -- the quiz summary
    >>
    >> Examples:
    >> * <http://splatbang.com/rubyquiz/184_Befunge/quiz.txt>
    >> * <http://splatbang.com/rubyquiz/184_Befunge/sols.txt>
    >> * <http://splatbang.com/rubyquiz/184_Befunge/summ.txt>
    >>

    >
    > Your program will collect and output this data as yaml (or your favorite =

    data
    > serialization standard; xml, json, etc.).
    >
    > --
    > -Daniel
    >
    >

    Daniel in which time zone are you? What do you and the others think if
    we give our friends in GMT-x some more time? My suggestion would be to
    extend the spoiler period to something like Sunday 13h or 14h GMT.
    Actually I do not care about the Americans ;) I just sleep that long on WEs=
     
    Robert Dober, Jan 23, 2009
    #2
    1. Advertising

  3. Daniel Moore

    Matthew Moss Guest

    > Daniel in which time zone are you? What do you and the others think if
    > we give our friends in GMT-x some more time? My suggestion would be to
    > extend the spoiler period to something like Sunday 13h or 14h GMT.
    > Actually I do not care about the Americans ;) I just sleep that long
    > on WEs.



    Are you suggesting that a duration of 48 hours varies in duration from
    time zone to time zone?

    :D

    *wink wink*
     
    Matthew Moss, Jan 23, 2009
    #3
  4. On Fri, Jan 23, 2009 at 2:05 PM, Andy Cooper <> wrote:
    >
    >> > Daniel in which time zone are you? What do you and the

    >> others think if
    >> > we give our friends in GMT-x some more time? My suggestion

    >> would be to
    >> > extend the spoiler period to something like Sunday 13h or 14h GMT.
    >> > Actually I do not care about the Americans ;) I just sleep

    >> that long
    >> > on WEs.

    >>
    >>
    >> Are you suggesting that a duration of 48 hours varies in
    >> duration from
    >> time zone to time zone?

    >
    >
    > American dollars are not worth as much as the Euro, so I would guess
    > that is exactly what he is saying. I mean time IS money afterall.


    Damn you Daniel! First day on the job and you've got your hand in my pocket! :)

    -greg

    --
    Technical Blaag at: http://blog.majesticseacreature.com
    Non-tech stuff at: http://metametta.blogspot.com
    "Ruby Best Practices" Book now in O'Reilly Roughcuts:
    http://rubybestpractices.com
     
    Gregory Brown, Jan 23, 2009
    #4
  5. Daniel Moore

    Daniel Moore Guest

    I'm not opposed to extending the no spoiler period to give everyone
    more of the weekend to contemplate. **So everyone, please no spoilers
    until Sun 14:00 GMT**. As always feel free to ask questions and post
    non-spoiler discussion any time.

    My local time is UTC-8 so I posted the quiz Thursday night right
    before going to be, which works out well for my schedule.

    Open question to everyone: What day and time would you prefer to have
    the new quizzes posted and how long of a no-spoiler period do you
    prefer?

    On Fri, Jan 23, 2009 at 10:31 AM, Robert Dober <> wro=
    te:
    > On Fri, Jan 23, 2009 at 7:42 AM, Daniel Moore <> wrote:
    >> Greetings!
    >>
    >> Welcome to the inaugural Ruby Quiz 3!
    >>
    >> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D=

    -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
    >>
    >> The three rules of Ruby Quiz:
    >>
    >> 1. Please do not post any solutions or spoiler discussion for this
    >> quiz until 48 hours have elapsed from the time this message was
    >> sent.
    >>
    >> 2. Support Ruby Quiz by submitting ideas and responses
    >> as often as you can! Visit: <http://rubyquiz.strd6.com>
    >>
    >> 3. Enjoy!
    >>
    >> Suggestion: A [QUIZ] in the subject of emails about the problem
    >> helps everyone on Ruby Talk follow the discussion. Please reply to
    >> the original quiz message, if you can.
    >>
    >> -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D=

    -=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-=3D-
    >>
    >> ## Gathering Ruby Quiz 2 Data
    >>
    >> I'm building the new Ruby Quiz website and I need your help...
    >>
    >> This week's quiz involves gathering the existing Ruby Quiz 2 data from
    >> the Ruby Quiz website: <http://splatbang.com/rubyquiz/>
    >>
    >> Each quiz entry contains the following information:
    >>
    >> * id
    >> * title
    >> * description
    >> * summary
    >>
    >> There are also many quiz solutions that belong to each quiz. The quiz
    >> solutions have the following:
    >>
    >> * quiz_id
    >> * author
    >> * ruby_talk_reference
    >> * text
    >>
    >> Matthew has some advice for getting at the data:
    >>> If you start at <http://splatbang.com/rubyquiz/>, you'll see
    >>> the quiz list on the left are all links to the same quiz.rhtml file
    >>> (embedded Ruby), but with different id parameters. Those
    >>> parameters are the name of a subdirectory. So, for example,
    >>> take quiz #184, which has a link like this:
    >>>
    >>> <http://splatbang.com/rubyquiz/quiz.rhtml?id=3D184_Befunge>
    >>>
    >>> So there is a subdirectory called "184_Befunge". There
    >>> are basically three files in every directory:
    >>>
    >>> * quiz.txt -- the quiz description
    >>> * sols.txt -- a list of author names and the ruby-talk message # of th=

    e submission
    >>> * summ.txt -- the quiz summary
    >>>
    >>> Examples:
    >>> * <http://splatbang.com/rubyquiz/184_Befunge/quiz.txt>
    >>> * <http://splatbang.com/rubyquiz/184_Befunge/sols.txt>
    >>> * <http://splatbang.com/rubyquiz/184_Befunge/summ.txt>
    >>>

    >>
    >> Your program will collect and output this data as yaml (or your favorite=

    data
    >> serialization standard; xml, json, etc.).
    >>
    >> --
    >> -Daniel
    >>
    >>

    > Daniel in which time zone are you? What do you and the others think if
    > we give our friends in GMT-x some more time? My suggestion would be to
    > extend the spoiler period to something like Sunday 13h or 14h GMT.
    > Actually I do not care about the Americans ;) I just sleep that long on W=

    Es.
    > Just 0.02=80.
    > Robert
    >
    >




    --=20
    -Daniel
    http://strd6.com
     
    Daniel Moore, Jan 24, 2009
    #5
  6. Daniel Moore

    Guest

    What's the deadline btw? I am almost ready with the solution since the
    weekend, but have too much on my plate to finish it right now :)

    Cheers,
    Peter
    __
    http://www.rubyrailways.com

    > Greetings!
    >
    > Welcome to the inaugural Ruby Quiz 3!
    >
    > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    >
    > The three rules of Ruby Quiz:
    >
    > 1. Please do not post any solutions or spoiler discussion for this
    > quiz until 48 hours have elapsed from the time this message was
    > sent.
    >
    > 2. Support Ruby Quiz by submitting ideas and responses
    > as often as you can! Visit: <http://rubyquiz.strd6.com>
    >
    > 3. Enjoy!
    >
    > Suggestion: A [QUIZ] in the subject of emails about the problem
    > helps everyone on Ruby Talk follow the discussion. Please reply to
    > the original quiz message, if you can.
    >
    > -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
    >
    > ## Gathering Ruby Quiz 2 Data
    >
    > I'm building the new Ruby Quiz website and I need your help...
    >
    > This week's quiz involves gathering the existing Ruby Quiz 2 data from
    > the Ruby Quiz website: <http://splatbang.com/rubyquiz/>
    >
    > Each quiz entry contains the following information:
    >
    > * id
    > * title
    > * description
    > * summary
    >
    > There are also many quiz solutions that belong to each quiz. The quiz
    > solutions have the following:
    >
    > * quiz_id
    > * author
    > * ruby_talk_reference
    > * text
    >
    > Matthew has some advice for getting at the data:
    >> If you start at <http://splatbang.com/rubyquiz/>, you'll see
    >> the quiz list on the left are all links to the same quiz.rhtml file
    >> (embedded Ruby), but with different id parameters. Those
    >> parameters are the name of a subdirectory. So, for example,
    >> take quiz #184, which has a link like this:
    >>
    >> <http://splatbang.com/rubyquiz/quiz.rhtml?id=184_Befunge>
    >>
    >> So there is a subdirectory called "184_Befunge". There
    >> are basically three files in every directory:
    >>
    >> * quiz.txt -- the quiz description
    >> * sols.txt -- a list of author names and the ruby-talk message # of the
    >> submission
    >> * summ.txt -- the quiz summary
    >>
    >> Examples:
    >> * <http://splatbang.com/rubyquiz/184_Befunge/quiz.txt>
    >> * <http://splatbang.com/rubyquiz/184_Befunge/sols.txt>
    >> * <http://splatbang.com/rubyquiz/184_Befunge/summ.txt>
    >>

    >
    > Your program will collect and output this data as yaml (or your favorite
    > data
    > serialization standard; xml, json, etc.).
    >
    > --
    > -Daniel
    >
    >
     
    , Jan 27, 2009
    #6
  7. On Mon, Jan 26, 2009 at 7:18 PM, <> wrote:
    > What's the deadline btw? I am almost ready with the solution since the
    > weekend, but have too much on my plate to finish it right now :)


    Historically there have been no deadlines that I know of, just that if
    you aren't reasonably timely, you won't have a shot at being mentioned
    in the summary. But at least when James ran it, you could certainly
    submit late solutions for the archives. I hope this tradition is
    continued, but you can always of course post here at any rate.

    -greg


    --
    Technical Blaag at: http://blog.majesticseacreature.com
    Non-tech stuff at: http://metametta.blogspot.com
    "Ruby Best Practices" Book now in O'Reilly Roughcuts:
    http://rubybestpractices.com
     
    Gregory Brown, Jan 27, 2009
    #7
  8. Daniel Moore

    Daniel Moore Guest

    On Mon, Jan 26, 2009 at 4:33 PM, Gregory Brown
    <> wrote:
    > On Mon, Jan 26, 2009 at 7:18 PM, <> wrote:
    >> What's the deadline btw? I am almost ready with the solution since the
    >> weekend, but have too much on my plate to finish it right now :)

    >
    > Historically there have been no deadlines that I know of, just that if
    > you aren't reasonably timely, you won't have a shot at being mentioned
    > in the summary. But at least when James ran it, you could certainly
    > submit late solutions for the archives. I hope this tradition is
    > continued, but you can always of course post here at any rate.
    >
    > -greg
    >
    >
    > --
    > Technical Blaag at: http://blog.majesticseacreature.com
    > Non-tech stuff at: http://metametta.blogspot.com
    > "Ruby Best Practices" Book now in O'Reilly Roughcuts:
    > http://rubybestpractices.com
    >
    >


    Gregory is correct, there aren't any hard deadlines. However, if you
    post your solution by early Thursday then it stands a better chance to
    get into the quiz summary.

    --
    -Daniel
    http://strd6.com
     
    Daniel Moore, Jan 27, 2009
    #8
  9. Daniel Moore

    Guest

    > Greetings!
    >
    > Welcome to the inaugural Ruby Quiz 3!


    Here is my scRUBYt! and Nokogiri based solution:

    http://pastie.org/374542

    As far as I can tell (the script is generating a several MB single XML
    file, so it's not trivial do determine) it is working well and it's also
    complete.
    If you need the XML file, drop me a msg.

    A writeup will follow on my blog soon, will post a message here.

    Cheers,
    Peter
    ___
    http://www.rubyrailways.com
     
    , Jan 29, 2009
    #9
  10. Daniel Moore

    Daniel Moore Guest

    [SUMMARY] Gathering Ruby Quiz 2 Data (#189)

    This quiz was an exercise in Web Scraping
    [http://en.wikipedia.org/wiki/Web_scraping]. As more and more
    information becomes available on the internet it is useful to have a
    programatic way to access it. This can be done through web APIs, but
    not all websites have such APIs available or not all information is
    available via the APIs. Scraping may be against the terms of use for
    some sites and smaller sites may suffer if large amounts of data are
    being pulled, so be sure to ask permission and be prudent!

    The one solution to this week's quiz come from Peter Szinek using
    scRUBYt [http://scrubyt.org/]. Despite being just over fifty lines
    long there is a lot packed in here, so let's dive in.

    Here we begin by seting up a scRUBYt Extractor and set it to get the
    main Ruby Quiz 2 page.

    #scrape the stuff with sRUBYt!
    data = Scrubyt::Extractor.define do
    fetch 'http://splatbang.com/rubyquiz/'

    The 'quiz' sets up a node in the XML document, retrieving elements
    that match the XPath. This yields all the links in the side area, that
    is, links to all the quizzes.

    quiz "//div[@id='side']/ol/li/a[1]" do
    link_url do
    quiz_id /id=(\d+)/
    quiz_link /id=(.+)/ do

    These next two sections download the description and summary for each
    quiz. They are saved into temporary files to be loaded into the XML
    document at the end. Notice the use of lambda, it takes in the match
    from /id=(.+)/ in the quiz_link. So for example when the link is
    'quiz.rhtml?id=157_The_Smallest_Circle' it matches
    '157_The_Smallest_Circle' and passes it into the lambda which returns
    it as "http://splatbang.com/rubyquiz/157_The_Smallest_Circle/quiz.txt"
    which is the text for the quiz. The summary is gathered in a likewise
    fashion.

    quiz_desc_url(lambda {|quiz_dir|
    "http://splatbang.com/rubyquiz/#{quiz_dir}/quiz.txt"}, :type =>
    :script) do
    quiz_dl 'descriptions', :type => :download
    end
    quiz_summary_url(lambda {|quiz_dir|
    "http://splatbang.com/rubyquiz/#{quiz_dir}/summ.txt"}, :type =>
    :script) do
    quiz_dl 'summaries', :type => :download
    end
    end
    end

    This next part gets all the solutions for each quiz. It follows the
    link_url from the side area. Once on the new page it creates a node
    for each solution, again by using XPath to get all the links in the
    list on the side. It populates each solution with an author: the text
    from the html anchor tag. It populates the ruby_talk_reference with
    the href attribute of the tag. In order to get the solution text it
    follows (resolves) the link and returns the text within the "//pre[1]"
    element, again using XPath to specify. The text node is added as a
    child node to the solution.

    quiz_detail :resolve => "http://splatbang.com/rubyquiz" do
    solution "/html/body/div/div[2]/ol/li/a" do
    author lambda {|solution_link_text| solution_link_text},
    :type => :script
    ruby_talk_reference "href", :type => :attribute
    solution_detail :resolve => :full do
    text "//pre[1]"
    end
    end
    end

    This select_indices limits the scope of the quiz gathering to just the
    first three, usefull for testing since we don't want to have to
    traverse the entire site to see if code works. I removed it when
    gathering the full dataset.

    end.select_indices(0..2)
    end

    This next part, using Nokogiri, loads the files that were saved
    temporarily and inserts them into the XML document. It also removes
    the link_url nodes to clean up the final output to match the output
    specified in the quiz.

    result = Nokogiri::XML(data.to_xml)

    (result/"//quiz").each do |quiz|
    quiz_id = quiz.text[/\s(\d+)\s/,1].to_i
    file_index = quiz_id > 157 ? "_#{(quiz_id - 157)}" : ""
    (quiz/"//link_url").first.unlink

    desc = Nokogiri::XML::Element.new("description", quiz.document)
    desc.content =open("descriptions/quiz#{file_index}.txt").read
    quiz.add_child(desc)

    summary = Nokogiri::XML::Element.new("summary", quiz.document)
    summary.content =open("summaries/summ#{file_index}.txt").read
    quiz.add_child(summary)
    end

    And finally save the result to an xml file on the filesystem:

    open("ruby_quiz_archive.xml", "w") {|f| f.write result}

    This was my first experience with scRUBYt and it took me a little
    while to "get it". It packs a lot of power into a concise syntax and
    is definitely worth considering for your next web scraping needs.

    --
    -Daniel
    http://rubyquiz.strd6.com
     
    Daniel Moore, Jan 31, 2009
    #10
  11. Daniel Moore

    James Gray Guest

    Re: [SUMMARY] Gathering Ruby Quiz 2 Data (#189)

    On Jan 31, 2009, at 1:36 PM, Daniel Moore wrote:

    > This quiz was an exercise in Web Scraping
    > [http://en.wikipedia.org/wiki/Web_scraping].


    Great summary Daniel. You've got the new quiz off to a great start.

    James Edward Gray II
     
    James Gray, Jan 31, 2009
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. glowfly
    Replies:
    4
    Views:
    440
    =?UTF-8?B?QXJuZSBWYWpow7hq?=
    Sep 14, 2006
  2. Arkadiusz Miskiewicz
    Replies:
    0
    Views:
    282
    Arkadiusz Miskiewicz
    Nov 9, 2006
  3. Replies:
    1
    Views:
    392
    Victor Bazarov
    May 2, 2006
  4. Jon Garvin
    Replies:
    0
    Views:
    109
    Jon Garvin
    Nov 21, 2006
  5. George_V
    Replies:
    1
    Views:
    149
    Abhinav
    Oct 5, 2004
Loading...

Share This Page