Extracting Data from a Webpage

Discussion in 'Ruby' started by Tj Superfly, Jan 27, 2008.

  1. Tj Superfly

    Tj Superfly Guest

    Hello everyone.

    I was wondering if anyone knew a way to extract the web page title off
    of a specific URL that you input into a program?

    I give it the URL, say www.google.com. It then gives me "Google" - it's
    title.

    Then also, is there anyway that the program could extract the next 5
    characters - after a certain phrase that doesn't change on the webpage?

    Thanks for your help in advance!
    --
    Posted via http://www.ruby-forum.com/.
     
    Tj Superfly, Jan 27, 2008
    #1
    1. Advertising

  2. Tj Superfly

    s.ross Guest

    On Jan 26, 2008, at 7:21 PM, Tj Superfly wrote:

    > Hello everyone.
    >
    > I was wondering if anyone knew a way to extract the web page title off
    > of a specific URL that you input into a program?
    >
    > I give it the URL, say www.google.com. It then gives me "Google" -
    > it's
    > title.
    >
    > Then also, is there anyway that the program could extract the next 5
    > characters - after a certain phrase that doesn't change on the
    > webpage?
    >
    > Thanks for your help in advance!
    > --
    > Posted via http://www.ruby-forum.com/.
    >


    http://code.whytheluckystiff.net/hpricot/

    It's a snap.
     
    s.ross, Jan 27, 2008
    #2
    1. Advertising

  3. Tj Superfly

    7stud -- Guest

    Tj Superfly wrote:
    > Hello everyone.
    >
    > I was wondering if anyone knew a way to extract the web page title off
    > of a specific URL that you input into a program?
    >
    > I give it the URL, say www.google.com. It then gives me "Google" - it's
    > title.
    >
    > Then also, is there anyway that the program could extract the next 5
    > characters - after a certain phrase that doesn't change on the webpage?
    >
    > Thanks for your help in advance!


    You can do something like this:

    require 'open-uri'

    url = "http://www.google.com"

    open(url) do |f|
    f.each do |line|
    if md_obj = /<title>(.*)<\/title>/.match(line)
    puts md_obj[1]
    end

    if md_obj = /type=(.{6})/.match(line)
    puts md_obj[1]
    end

    end
    end

    Ruby also has various html parsing libraries that allow you to search
    html documents by tag name, tag position, etc.
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 27, 2008
    #3
  4. Tj Superfly

    7stud -- Guest

    7stud -- wrote:
    > You can do something like this:
    >
    > require 'open-uri'
    >
    > url = "http://www.google.com"
    >
    > open(url) do |f|
    > f.each do |line|
    > if md_obj = /<title>(.*)<\/title>/.match(line)
    > puts md_obj[1]
    > end
    >
    > if md_obj = /type=(.{6})/.match(line)
    > puts md_obj[1]
    > end
    >
    > end
    > end
    >


    This should be more efficient:

    require 'open-uri'

    url = "http://www.google.com"
    title_re = Regexp.new(/<title>(.*)<\/title>/)
    text_re = Regexp.new(/type=(.{5})/)

    open(url) do |f|
    f.each do |line|
    if md_obj = title_re.match(line)
    puts md_obj[1]
    end

    if md_obj = text_re.match(line)
    puts md_obj[1]
    break
    end

    end
    end

    --output:
    Google
    hidde #first 5 chars of 'hidden'

    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 27, 2008
    #4
  5. On Jan 26, 9:21 pm, Tj Superfly <> wrote:
    > Hello everyone.
    >
    > I was wondering if anyone knew a way to extract the web page title off
    > of a specific URL that you input into a program?
    >
    > I give it the URL, say www.google.com. It then gives me "Google" - its
    > title.


    "www.google.com"[/(www\.)?(.*)\./,2].capitalize
    ==>"Google"
    "google.com"[/(www\.)?(.*)\./,2].capitalize
    ==>"Google"
     
    William James, Jan 27, 2008
    #5
  6. Tj Superfly

    Tj Superfly Guest

    > This should be more efficient:
    >
    > require 'open-uri'
    >
    > url = "http://www.google.com"
    > title_re = Regexp.new(/<title>(.*)<\/title>/)
    > text_re = Regexp.new(/type=(.{5})/)
    >
    > open(url) do |f|
    > f.each do |line|
    > if md_obj = title_re.match(line)
    > puts md_obj[1]
    > end
    >
    > if md_obj = text_re.match(line)
    > puts md_obj[1]
    > break
    > end
    >
    > end
    > end
    >
    > --output:
    > Google
    > hidde #first 5 chars of 'hidden'


    I receive this eror message when trying this code.

    DENTIFIER, expecting $end
    endndndreakmd_obj[1]_re.match(line))/title>/)

    Any suggestions? I did try the other clip of code posted here, but got
    more errors than this one. =/ I'm reading up on that link posted in the
    2nd post to see if I can figure any of this out.

    Thanks.

    --
    Posted via http://www.ruby-forum.com/.
     
    Tj Superfly, Jan 27, 2008
    #6
  7. Tj Superfly

    7stud -- Guest

    Tj Superfly wrote:
    > I receive this eror message when trying this code.
    >
    > DENTIFIER, expecting $end
    > endndndreakmd_obj[1]_re.match(line))/title>/)
    >
    > Any suggestions?
    >


    1) Learn some basic ruby?

    2) Learn how to post a question on a computer programming forum?
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 27, 2008
    #7
  8. Tj Superfly

    Tj Superfly Guest

    7stud -- wrote:
    > Tj Superfly wrote:
    >> I receive this eror message when trying this code.
    >>
    >> DENTIFIER, expecting $end
    >> endndndreakmd_obj[1]_re.match(line))/title>/)
    >>
    >> Any suggestions?
    >>

    >
    > 1) Learn some basic ruby?
    >
    > 2) Learn how to post a question on a computer programming forum?


    Anyone else know what the matter is?
    --
    Posted via http://www.ruby-forum.com/.
     
    Tj Superfly, Jan 27, 2008
    #8
  9. Tj Superfly

    7stud -- Guest

    Tj Superfly wrote:
    > 7stud -- wrote:
    >> Tj Superfly wrote:
    >>> I receive this eror message when trying this code.
    >>>
    >>> DENTIFIER, expecting $end
    >>> endndndreakmd_obj[1]_re.match(line))/title>/)
    >>>
    >>> Any suggestions?
    >>>

    >>
    >> 1) Learn some basic ruby?
    >>
    >> 2) Learn how to post a question on a computer programming forum?

    >
    > Anyone else know what the matter is?


    How to post a question on a computer programming Forum:

    1) Post a simple example program that demonstrates your problem.

    2) Post the error message in its entirety--not an unintelligible portion
    of it.

    3) Post your question about the code.

    4) Use a descriptive title for your post-- not something like
    "URGENT...HELP ME!"

    5) Proof read and spell check your post before clicking submit.
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 27, 2008
    #9
  10. Tj Superfly

    fedzor Guest

    On Jan 27, 2008, at 4:57 PM, Tj Superfly wrote:

    > 7stud -- wrote:
    >> Tj Superfly wrote:
    >>> I receive this eror message when trying this code.
    >>>
    >>> DENTIFIER, expecting $end
    >>> endndndreakmd_obj[1]_re.match(line))/title>/)
    >>>
    >>> Any suggestions?


    I believe that $end means you're missing some sort of end delimiter,
    but NOT 'end'. Check for {} or / / for regexp

    Also, if you can, have your editor do an autoformat thing so you can
    see where the indentation screws up.
     
    fedzor, Jan 27, 2008
    #10
  11. Tj Superfly

    Marc Heiler Guest

    Marc Heiler, Jan 27, 2008
    #11
  12. Tj Superfly

    7stud -- Guest

    7stud -- wrote:
    >
    > title_re = Regexp.new(/<title>(.*)<\/title>/)
    >


    While that regex works for www.google.com, in order for the regex to be
    more general, the regex should be:

    title_re = Regexp.new(/<title>(.*)<\/title>/m)

    and then to output the match:

    puts md_obj[1].strip()
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 28, 2008
    #12

  13. > I was wondering if anyone knew a way to extract the web page title off
    > of a specific URL that you input into a program?


    require 'net/http'
    puts Net::HTTP.new('www.google.com').get('/').
    body[/<title>(.*?)<.title>/i,1]
     
    William James, Jan 28, 2008
    #13
  14. Tj Superfly

    7stud -- Guest

    William James wrote:
    >> I was wondering if anyone knew a way to extract the web page title off
    >> of a specific URL that you input into a program?

    >
    > require 'net/http'
    > puts Net::HTTP.new('www.google.com').get('/').
    > body[/<title>(.*?)<.title>/i,1]


    Nice. I tested your code and it works for me. But my reading of the
    docs says that it shouldn't work: new() doesn't open a connection, and
    get(), "Gets data from path on the connected-to host." The docs seem
    to want you to do something like:

    resp_obj = Net::HTTP.get_response('http://www.google.com',
    '/index.html')
    page = resp_obj.body
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 28, 2008
    #14
  15. 7stud -- wrote:
    > Nice. I tested your code and it works for me. But my reading of the
    > docs says that it shouldn't work: new() doesn't open a connection, and
    > get(), "Gets data from path on the connected-to host." The docs seem
    > to want you to do something like:
    >
    > resp_obj = Net::HTTP.get_response('http://www.google.com',
    > '/index.html')
    > page = resp_obj.body


    Reading the Net::HTTP docs here:
    http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

    It says that Net::HTTP#get will:
    "Send a GET request to the target and return the response as a string"

    and Net::HTTP#get_response will:
    "Send a GET request to the target and return the response as a
    Net::HTTPResponse object"

    The #new in this case is optional because both methods are class methods
    or instance methods? Someone might be able to clarify this a part a
    little more. But the examples at that doc url don't even use
    New::HTTP#new.
    --
    Posted via http://www.ruby-forum.com/.
     
    Joseph Pecoraro, Jan 28, 2008
    #15
  16. Tj Superfly

    7stud -- Guest

    Joseph Pecoraro wrote:
    > 7stud -- wrote:
    >> Nice. I tested your code and it works for me. But my reading of the
    >> docs says that it shouldn't work: new() doesn't open a connection, and
    >> get(), "Gets data from path on the connected-to host." The docs seem
    >> to want you to do something like:
    >>
    >> resp_obj = Net::HTTP.get_response('http://www.google.com',
    >> '/index.html')
    >> page = resp_obj.body

    >
    > Reading the Net::HTTP docs here:
    > http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127
    >
    > It says that Net::HTTP#get will:
    > "Send a GET request to the target and return the response as a string"
    >
    > and Net::HTTP#get_response will:
    > "Send a GET request to the target and return the response as a
    > Net::HTTPResponse object"
    >
    > The #new in this case is optional because both methods are class methods
    > or instance methods?


    According to the docs, Net::HTTP has class methods:

    get()
    get_response()

    and an instance method:

    get()

    As with all ruby classes, new() creates an instance. Therefore, in the
    code example I was wondering about:

    > puts Net::HTTP.new('www.google.com').get('/').
    > body[/<title>(.*?)<.title>/i,1]


    new() creates an instance, which is being used to call get(), so the
    version of get() being called is the instance method. Yet, the docs say
    the get() instance method "Gets data from path on the connected-to
    host". What connected to host? According to the docs on new() it says,
    "This method does not open the TCP connection."

    In addition, the get() version in that code cannot be the class method
    version because the class method version returns a String and Strings do
    not have a body() method, which is the next method call.


    > Someone might be able to clarify this a part a
    > little more. But the examples at that doc url don't even use
    > New::HTTP#new.


    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 28, 2008
    #16
  17. Tj Superfly

    7stud -- Guest

    7stud -- wrote:
    >
    > As with all ruby classes, new() creates an instance. Therefore, in the
    > code example I was wondering about:
    >
    >> puts Net::HTTP.new('www.google.com').get('/').
    >> body[/<title>(.*?)<.title>/i,1]

    >
    > new() creates an instance, which is being used to call get(), so the
    > version of get() being called is the instance method. Yet, the docs say
    > the get() instance method "Gets data from path on the connected-to
    > host". What connected to host? According to the docs on new() it says,
    > "This method does not open the TCP connection."
    >


    As far as I can tell, you should have to call start() on a Net::HTTP
    instance in order to open a connection, e.g.:

    str = Net::HTTP.new('www.google.com').start().get('/').body
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, Jan 28, 2008
    #17
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    1
    Views:
    477
  2. Replies:
    3
    Views:
    686
    Paul McGuire
    May 28, 2008
  3. Replies:
    2
    Views:
    111
    Gunnar Hjalmarsson
    Apr 29, 2008
  4. alfonsobaldaserra

    fetching webpage and extracting contents

    alfonsobaldaserra, Oct 4, 2010, in forum: Perl Misc
    Replies:
    9
    Views:
    164
    alfonsobaldaserra
    Oct 21, 2010
  5. shankar_perl_rookie

    Extracting html urls on a webpage using linktext

    shankar_perl_rookie, Jan 26, 2011, in forum: Perl Misc
    Replies:
    1
    Views:
    122
Loading...

Share This Page