Extracting Data from a Webpage

T

Tj Superfly

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the webpage?

Thanks for your help in advance!
 
S

s.ross

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" -
it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the
webpage?

Thanks for your help in advance!

http://code.whytheluckystiff.net/hpricot/

It's a snap.
 
7

7stud --

Tj said:
Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the webpage?

Thanks for your help in advance!

You can do something like this:

require 'open-uri'

url = "http://www.google.com"

open(url) do |f|
f.each do |line|
if md_obj = /<title>(.*)<\/title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
puts md_obj[1]
end

end
end

Ruby also has various html parsing libraries that allow you to search
html documents by tag name, tag position, etc.
 
7

7stud --

7stud said:
You can do something like this:

require 'open-uri'

url = "http://www.google.com"

open(url) do |f|
f.each do |line|
if md_obj = /<title>(.*)<\/title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
puts md_obj[1]
end

end
end

This should be more efficient:

require 'open-uri'

url = "http://www.google.com"
title_re = Regexp.new(/<title>(.*)<\/title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
puts md_obj[1]
break
end

end
end

--output:
Google
hidde #first 5 chars of 'hidden'
 
W

William James

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - its
title.

"www.google.com"[/(www\.)?(.*)\./,2].capitalize
==>"Google"
"google.com"[/(www\.)?(.*)\./,2].capitalize
==>"Google"
 
T

Tj Superfly

This should be more efficient:
require 'open-uri'

url = "http://www.google.com"
title_re = Regexp.new(/<title>(.*)<\/title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
puts md_obj[1]
break
end

end
end

--output:
Google
hidde #first 5 chars of 'hidden'

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions? I did try the other clip of code posted here, but got
more errors than this one. =/ I'm reading up on that link posted in the
2nd post to see if I can figure any of this out.

Thanks.
 
7

7stud --

Tj said:
I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?
 
T

Tj Superfly

7stud said:
Tj said:
I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?
 
7

7stud --

Tj said:
7stud said:
Tj said:
I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?

How to post a question on a computer programming Forum:

1) Post a simple example program that demonstrates your problem.

2) Post the error message in its entirety--not an unintelligible portion
of it.

3) Post your question about the code.

4) Use a descriptive title for your post-- not something like
"URGENT...HELP ME!"

5) Proof read and spell check your post before clicking submit.
 
F

fedzor

7stud said:
Tj said:
I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

I believe that $end means you're missing some sort of end delimiter,
but NOT 'end'. Check for {} or / / for regexp

Also, if you can, have your editor do an autoformat thing so you can
see where the indentation screws up.
 
7

7stud --

7stud said:
title_re = Regexp.new(/<title>(.*)<\/title>/)

While that regex works for www.google.com, in order for the regex to be
more general, the regex should be:

title_re = Regexp.new(/<title>(.*)<\/title>/m)

and then to output the match:

puts md_obj[1].strip()
 
7

7stud --

William said:
I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn't work: new() doesn't open a connection, and
get(), "Gets data from path on the connected-to host." The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response('http://www.google.com',
'/index.html')
page = resp_obj.body
 
J

Joseph Pecoraro

7stud said:
Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn't work: new() doesn't open a connection, and
get(), "Gets data from path on the connected-to host." The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response('http://www.google.com',
'/index.html')
page = resp_obj.body

Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
"Send a GET request to the target and return the response as a string"

and Net::HTTP#get_response will:
"Send a GET request to the target and return the response as a
Net::HTTPResponse object"

The #new in this case is optional because both methods are class methods
or instance methods? Someone might be able to clarify this a part a
little more. But the examples at that doc url don't even use
New::HTTP#new.
 
7

7stud --

Joseph said:
Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
"Send a GET request to the target and return the response as a string"

and Net::HTTP#get_response will:
"Send a GET request to the target and return the response as a
Net::HTTPResponse object"

The #new in this case is optional because both methods are class methods
or instance methods?

According to the docs, Net::HTTP has class methods:

get()
get_response()

and an instance method:

get()

As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:
puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method "Gets data from path on the connected-to
host". What connected to host? According to the docs on new() it says,
"This method does not open the TCP connection."

In addition, the get() version in that code cannot be the class method
version because the class method version returns a String and Strings do
not have a body() method, which is the next method call.
 
7

7stud --

7stud said:
As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:
puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method "Gets data from path on the connected-to
host". What connected to host? According to the docs on new() it says,
"This method does not open the TCP connection."

As far as I can tell, you should have to call start() on a Net::HTTP
instance in order to open a connection, e.g.:

str = Net::HTTP.new('www.google.com').start().get('/').body
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top