Challenge: Extract episode descriptions.

Discussion in 'Ruby' started by Stedwick, Jan 19, 2008.

  1. Stedwick

    Stedwick Guest

    This is just a whimsical question, really. I've been working on a
    website where people can vote on episodes of TV shows (and I happen to
    be a big Star Trek fan, so I'm starting there ha ha). By the way, the
    website is, literally, 40 lines of code. I'm loving Ruby on Rails so
    far.

    http://brocoum.com/voter/startrekvoyager/episodes

    Anyway, I need to extract the episode descriptions for the tool tips,
    and the descriptions come from TV.com. Unfortunately, this has turned
    out to be rather harder than it looks!

    http://www.tv.com/star-trek-deep-sp....html?season=0&tag=season_dropdown;dropdown;7

    If any of you feel up to the challenge, see if you can streamline my
    code below, or write better code yourself. I can't help but think that
    there's an easier way to do this!

    # open html file
    f = File.read("episode_guide.html")

    # keep track of the number of descriptions found
    count = 0

    # each description is enclosed in a multiline <p> </p> tag
    f.scan(/<p>.*?<\/p>/m) do |match|
    # start with a blank description
    desc = ''
    # i want to condense each desc into a single line, and remove the
    stardate info
    match.each_line {|m|
    # remove stardate...<br /> because the stardate is not always on
    its own line
    m.sub!(/^.*<br \/>/,'')
    # remove unnecessary whitespace from beginning
    m.sub!(/^\s*/,'')
    # add non-stardate and non-blank lines to the desc and remove
    trailing \n
    desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
    }

    # remove html tags
    desc.gsub!(/<.*?>/,'')
    # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
    # these period problems were caused by concatenating the paragraphs
    above into one line
    desc.gsub!(/(\w\.)(\w)/,'\1 \2')
    # fix stupid html &nbsp; type stuff
    desc.gsub!(/&nbsp;/," ")
    desc.gsub!(/'/,"'")
    # make all spaces single
    desc.gsub!(/ {2,}/,' ')

    # output finished description followed by blank line and increment
    counter
    puts desc + "\n\n"
    count += 1
    end

    # make sure i got all 176 episode descriptions
    puts count

    Philip
     
    Stedwick, Jan 19, 2008
    #1
    1. Advertising

  2. Stedwick

    yermej Guest

    On Jan 18, 10:18 pm, Stedwick <> wrote:
    > This is just a whimsical question, really. I've been working on a
    > website where people can vote on episodes of TV shows (and I happen to
    > be a big Star Trek fan, so I'm starting there ha ha). By the way, the
    > website is, literally, 40 lines of code. I'm loving Ruby on Rails so
    > far.
    >
    > http://brocoum.com/voter/startrekvoyager/episodes
    >
    > Anyway, I need to extract the episode descriptions for the tool tips,
    > and the descriptions come from TV.com. Unfortunately, this has turned
    > out to be rather harder than it looks!
    >
    > http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.ht...
    >
    > If any of you feel up to the challenge, see if you can streamline my
    > code below, or write better code yourself. I can't help but think that
    > there's an easier way to do this!
    >
    > # open html file
    > f = File.read("episode_guide.html")
    >
    > # keep track of the number of descriptions found
    > count = 0
    >
    > # each description is enclosed in a multiline <p> </p> tag
    > f.scan(/<p>.*?<\/p>/m) do |match|
    > # start with a blank description
    > desc = ''
    > # i want to condense each desc into a single line, and remove the
    > stardate info
    > match.each_line {|m|
    > # remove stardate...<br /> because the stardate is not always on
    > its own line
    > m.sub!(/^.*<br \/>/,'')
    > # remove unnecessary whitespace from beginning
    > m.sub!(/^\s*/,'')
    > # add non-stardate and non-blank lines to the desc and remove
    > trailing \n
    > desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
    > }
    >
    > # remove html tags
    > desc.gsub!(/<.*?>/,'')
    > # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
    > # these period problems were caused by concatenating the paragraphs
    > above into one line
    > desc.gsub!(/(\w\.)(\w)/,'\1 \2')
    > # fix stupid html &nbsp; type stuff
    > desc.gsub!(/&nbsp;/," ")
    > desc.gsub!(/'/,"'")
    > # make all spaces single
    > desc.gsub!(/ {2,}/,' ')
    >
    > # output finished description followed by blank line and increment
    > counter
    > puts desc + "\n\n"
    > count += 1
    > end
    >
    > # make sure i got all 176 episode descriptions
    > puts count
    >
    > Philip


    Look into Hpricot - http://code.whytheluckystiff.net/hpricot/ - or
    another HTML parser. It makes things like this much easier - no need
    for regexes.
     
    yermej, Jan 19, 2008
    #2
    1. Advertising

  3. 2008/1/19, Stedwick <>:

    > If any of you feel up to the challenge, see if you can streamline my
    > code below, or write better code yourself. I can't help but think that
    > there's an easier way to do this!
    >
    > # open html file
    > f =3D File.read("episode_guide.html")
    >
    > # keep track of the number of descriptions found
    > count =3D 0
    >
    > # each description is enclosed in a multiline <p> </p> tag
    > f.scan(/<p>.*?<\/p>/m) do |match|


    [...]

    You should take a look at Hpricot gem to make the
    html scraping easier.

    -- Jean-Fran=E7ois.
     
    Jean-François Trân, Jan 19, 2008
    #3
  4. Stedwick

    Guest

    On Jan 18, 10:18 pm, Stedwick <> wrote:
    > This is just a whimsical question, really. I've been working on a
    > website where people can vote on episodes of TV shows (and I happen to
    > be a big Star Trek fan, so I'm starting there ha ha). By the way, the
    > website is, literally, 40 lines of code. I'm loving Ruby on Rails so
    > far.
    >
    > http://brocoum.com/voter/startrekvoyager/episodes
    >
    > Anyway, I need to extract the episode descriptions for the tool tips,
    > and the descriptions come from TV.com. Unfortunately, this has turned
    > out to be rather harder than it looks!
    >
    > http://www.tv.com/star-trek-deep-space-nine/show/166/episode_guide.ht...
    >
    > If any of you feel up to the challenge, see if you can streamline my
    > code below, or write better code yourself. I can't help but think that
    > there's an easier way to do this!
    >
    > # open html file
    > f = File.read("episode_guide.html")
    >
    > # keep track of the number of descriptions found
    > count = 0
    >
    > # each description is enclosed in a multiline <p> </p> tag
    > f.scan(/<p>.*?<\/p>/m) do |match|
    > # start with a blank description
    > desc = ''
    > # i want to condense each desc into a single line, and remove the
    > stardate info
    > match.each_line {|m|
    > # remove stardate...<br /> because the stardate is not always on
    > its own line
    > m.sub!(/^.*<br \/>/,'')
    > # remove unnecessary whitespace from beginning
    > m.sub!(/^\s*/,'')
    > # add non-stardate and non-blank lines to the desc and remove
    > trailing \n
    > desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
    > }
    >
    > # remove html tags
    > desc.gsub!(/<.*?>/,'')
    > # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
    > # these period problems were caused by concatenating the paragraphs
    > above into one line
    > desc.gsub!(/(\w\.)(\w)/,'\1 \2')
    > # fix stupid html &nbsp; type stuff
    > desc.gsub!(/&nbsp;/," ")
    > desc.gsub!(/'/,"'")
    > # make all spaces single
    > desc.gsub!(/ {2,}/,' ')
    >
    > # output finished description followed by blank line and increment
    > counter
    > puts desc + "\n\n"
    > count += 1
    > end
    >
    > # make sure i got all 176 episode descriptions
    > puts count
    >
    > Philip


    This is not exactly what you want. But you may find it helpful

    require 'hpricot'
    require 'open-uri'

    url ='http://www.tv.com/star-trek-deep-space-nine/show/166/
    episode_guide.html?printable=1'
    @doc =Hpricot(open(url))

    @doc.search("/html/body/div[1]/div").each do |div|

    div.search("h1/a") do |h1|
    puts h1.inner_text.strip().squeeze(" ").gsub("\n"," ")
    end

    div.search("//div[@class='f-verdana f-small lh-16 mt-15 mb-15']") do
    |div|
    puts div.inner_text.strip().squeeze(" ").gsub("\n"," ")
    puts
    end

    end
     
    , Jan 19, 2008
    #4
  5. Stedwick wrote:
    > This is just a whimsical question, really. I've been working on a
    > website where people can vote on episodes of TV shows (and I happen to
    > be a big Star Trek fan, so I'm starting there ha ha). By the way, the
    > website is, literally, 40 lines of code. I'm loving Ruby on Rails so
    > far.
    >
    > http://brocoum.com/voter/startrekvoyager/episodes
    >
    > Anyway, I need to extract the episode descriptions for the tool tips,
    > and the descriptions come from TV.com. Unfortunately, this has turned
    > out to be rather harder than it looks!
    >
    > http://www.tv.com/star-trek-deep-sp....html?season=0&tag=season_dropdown;dropdown;7
    >
    > If any of you feel up to the challenge, see if you can streamline my
    > code below, or write better code yourself. I can't help but think that
    > there's an easier way to do this!
    >
    > # open html file
    > f = File.read("episode_guide.html")
    >
    > # keep track of the number of descriptions found
    > count = 0
    >
    > # each description is enclosed in a multiline <p> </p> tag
    > f.scan(/<p>.*?<\/p>/m) do |match|
    > # start with a blank description
    > desc = ''
    > # i want to condense each desc into a single line, and remove the
    > stardate info
    > match.each_line {|m|
    > # remove stardate...<br /> because the stardate is not always on
    > its own line
    > m.sub!(/^.*<br \/>/,'')
    > # remove unnecessary whitespace from beginning
    > m.sub!(/^\s*/,'')
    > # add non-stardate and non-blank lines to the desc and remove
    > trailing \n
    > desc += m.chomp unless m =~ /stardate:/i or !(m =~ /\w/)
    > }
    >
    > # remove html tags
    > desc.gsub!(/<.*?>/,'')
    > # fix periods ie. "Hi there.I love you." => "Hi there. I love you."
    > # these period problems were caused by concatenating the paragraphs
    > above into one line
    > desc.gsub!(/(\w\.)(\w)/,'\1 \2')
    > # fix stupid html &nbsp; type stuff
    > desc.gsub!(/&nbsp;/," ")
    > desc.gsub!(/'/,"'")
    > # make all spaces single
    > desc.gsub!(/ {2,}/,' ')
    >
    > # output finished description followed by blank line and increment
    > counter
    > puts desc + "\n\n"
    > count += 1
    > end
    >
    > # make sure i got all 176 episode descriptions
    > puts count
    >
    > Philip


    text = IO.read("episode_guide.html")
    a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
    map{|s|
    s.strip.gsub(/&nbsp;/," ").gsub(/<.*?>|&[^;]+;/m,"").
    gsub(/\s+/, " ") }
    puts a.join("\n\n")
    puts
    puts a.size
     
    William James, Jan 20, 2008
    #5
  6. On Jan 19, 10:39 pm, William James <> wrote:

    > text = IO.read("episode_guide.html")
    > a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
    > map{|s|
    > s.strip.gsub(/ /," ").gsub(/<.*?>|&[^;]+;/m,"").
    > gsub(/\s+/, " ") }
    > puts a.join("\n\n")
    > puts
    > puts a.size


    Corrected:

    text = IO.read("episode_guide.html")
    a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
    map{|s|
    s.gsub(/&nbsp;/," ").gsub(/<.*?>/m,"").gsub("'","'").
    gsub(/\s+/, " ").strip }
    puts a.join("\n\n")
    puts
    puts a.size
     
    William James, Jan 20, 2008
    #6
  7. Stedwick

    Stedwick Guest

    On Jan 20, 4:38 am, William James <> wrote:
    > On Jan 19, 10:39 pm, William James <> wrote:
    >
    > > text = IO.read("episode_guide.html")
    > > a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
    > > map{|s|
    > > s.strip.gsub(/ /," ").gsub(/<.*?>|&[^;]+;/m,"").
    > > gsub(/\s+/, " ") }
    > > puts a.join("\n\n")
    > > puts
    > > puts a.size

    >
    > Corrected:
    >
    > text = IO.read("episode_guide.html")
    > a = text.scan(/<p>\s*stardate:[ a-z.\d]*(.*?)<\/p>/mi).flatten.
    > map{|s|
    > s.gsub(/ /," ").gsub(/<.*?>/m,"").gsub("'","'").
    > gsub(/\s+/, " ").strip }
    > puts a.join("\n\n")
    > puts
    > puts a.size


    I'm liking yours so far William :) It's pretty elegant.
     
    Stedwick, Jan 21, 2008
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Ben Finney
    Replies:
    12
    Views:
    617
    Chris Reedy
    Jul 14, 2003
  2. Raymond Hettinger

    Python Mystery Theatre -- Episode 2: Así Fue

    Raymond Hettinger, Jul 14, 2003, in forum: Python
    Replies:
    12
    Views:
    544
    Erik Max Francis
    Jul 16, 2003
  3. Raymond Hettinger

    Python Mystery Theatre -- Episode 3: Extend this

    Raymond Hettinger, Jul 22, 2003, in forum: Python
    Replies:
    1
    Views:
    289
    Raymond Hettinger
    Jul 22, 2003
  4. Army1987

    delirious program, episode II

    Army1987, Mar 21, 2007, in forum: C Programming
    Replies:
    14
    Views:
    532
    Army1987
    Mar 23, 2007
  5. Peter Szinek
    Replies:
    3
    Views:
    132
    Adam Akhtar
    Aug 19, 2008
Loading...

Share This Page