Extracting Data from a Webpage

Tj Superfly · Jan 27, 2008

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the webpage?

Thanks for your help in advance!

s.ross · Jan 27, 2008

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" -
it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the
webpage?

Thanks for your help in advance!

http://code.whytheluckystiff.net/hpricot/

It's a snap.

7stud -- · Jan 27, 2008

Tj said:
Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - it's
title.

Then also, is there anyway that the program could extract the next 5
characters - after a certain phrase that doesn't change on the webpage?

Thanks for your help in advance!

You can do something like this:

require 'open-uri'

url = "http://www.google.com"

open(url) do |f|
f.each do |line|
if md_obj = /<title>(.*)<\/title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
puts md_obj[1]
end

end
end

Ruby also has various html parsing libraries that allow you to search
html documents by tag name, tag position, etc.

7stud -- · Jan 27, 2008

7stud said:
You can do something like this:

require 'open-uri'

url = "http://www.google.com"

open(url) do |f|
f.each do |line|
if md_obj = /<title>(.*)<\/title>/.match(line)
puts md_obj[1]
end

if md_obj = /type=(.{6})/.match(line)
puts md_obj[1]
end

end
end

This should be more efficient:

require 'open-uri'

url = "http://www.google.com"
title_re = Regexp.new(/<title>(.*)<\/title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
puts md_obj[1]
break
end

end
end

--output:
Google
hidde #first 5 chars of 'hidden'

William James · Jan 27, 2008

Hello everyone.

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

I give it the URL, say www.google.com. It then gives me "Google" - its
title.

"www.google.com"[/(www\.)?(.*)\./,2].capitalize
==>"Google"
"google.com"[/(www\.)?(.*)\./,2].capitalize
==>"Google"

Tj Superfly · Jan 27, 2008

This should be more efficient:

require 'open-uri'

url = "http://www.google.com"
title_re = Regexp.new(/<title>(.*)<\/title>/)
text_re = Regexp.new(/type=(.{5})/)

open(url) do |f|
f.each do |line|
if md_obj = title_re.match(line)
puts md_obj[1]
end

if md_obj = text_re.match(line)
puts md_obj[1]
break
end

end
end

--output:
Google
hidde #first 5 chars of 'hidden'

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions? I did try the other clip of code posted here, but got
more errors than this one. =/ I'm reading up on that link posted in the
2nd post to see if I can figure any of this out.

Thanks.

7stud -- · Jan 27, 2008

Tj said:
I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?

Tj Superfly · Jan 27, 2008

7stud said:
Tj said:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

Click to expand...

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?

Anyone else know what the matter is?

7stud -- · Jan 27, 2008

Tj said:
7stud said:

Tj said:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

Click to expand...

1) Learn some basic ruby?

2) Learn how to post a question on a computer programming forum?

Click to expand...

Anyone else know what the matter is?

How to post a question on a computer programming Forum:

1) Post a simple example program that demonstrates your problem.

2) Post the error message in its entirety--not an unintelligible portion
of it.

3) Post your question about the code.

4) Use a descriptive title for your post-- not something like
"URGENT...HELP ME!"

5) Proof read and spell check your post before clicking submit.

fedzor · Jan 27, 2008

7stud said:
7stud said:

Tj said:

I receive this eror message when trying this code.

DENTIFIER, expecting $end
endndndreakmd_obj[1]_re.match(line))/title>/)

Any suggestions?

Click to expand...

Click to expand...

I believe that $end means you're missing some sort of end delimiter,
but NOT 'end'. Check for {} or / / for regexp

Also, if you can, have your editor do an autoformat thing so you can
see where the indentation screws up.

Marc Heiler · Jan 27, 2008

http://code.whytheluckystiff.net/hpricot/

It's a snap.

I believe hpricot, as fine as it may be, is a little bit overkill for
such a task.

At best a simple task should remain simple, at least as simple as
possible.

7stud -- · Jan 28, 2008

7stud said:
title_re = Regexp.new(/<title>(.*)<\/title>/)

While that regex works for www.google.com, in order for the regex to be
more general, the regex should be:

title_re = Regexp.new(/<title>(.*)<\/title>/m)

and then to output the match:

puts md_obj[1].strip()

William James · Jan 28, 2008

I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

7stud -- · Jan 28, 2008

William said:
I was wondering if anyone knew a way to extract the web page title off
of a specific URL that you input into a program?

Click to expand...

require 'net/http'
puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn't work: new() doesn't open a connection, and
get(), "Gets data from path on the connected-to host." The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response('http://www.google.com',
'/index.html')
page = resp_obj.body

Joseph Pecoraro · Jan 28, 2008

7stud said:
Nice. I tested your code and it works for me. But my reading of the
docs says that it shouldn't work: new() doesn't open a connection, and
get(), "Gets data from path on the connected-to host." The docs seem
to want you to do something like:

resp_obj = Net::HTTP.get_response('http://www.google.com',
'/index.html')
page = resp_obj.body

Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
"Send a GET request to the target and return the response as a string"

and Net::HTTP#get_response will:
"Send a GET request to the target and return the response as a
Net::HTTPResponse object"

The #new in this case is optional because both methods are class methods
or instance methods? Someone might be able to clarify this a part a
little more. But the examples at that doc url don't even use
New::HTTP#new.

7stud -- · Jan 28, 2008

Joseph said:
Reading the Net::HTTP docs here:
http://ruby-doc.org/stdlib/libdoc/net/http/rdoc/classes/Net/HTTP.html#M001127

It says that Net::HTTP#get will:
"Send a GET request to the target and return the response as a string"

and Net::HTTP#get_response will:
"Send a GET request to the target and return the response as a
Net::HTTPResponse object"

The #new in this case is optional because both methods are class methods
or instance methods?

According to the docs, Net::HTTP has class methods:

get()
get_response()

and an instance method:

get()

As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:

puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method "Gets data from path on the connected-to
host". What connected to host? According to the docs on new() it says,
"This method does not open the TCP connection."

In addition, the get() version in that code cannot be the class method
version because the class method version returns a String and Strings do
not have a body() method, which is the next method call.

7stud -- · Jan 28, 2008

7stud said:
As with all ruby classes, new() creates an instance. Therefore, in the
code example I was wondering about:

puts Net::HTTP.new('www.google.com').get('/').
body[/<title>(.*?)<.title>/i,1]

Click to expand...

new() creates an instance, which is being used to call get(), so the
version of get() being called is the instance method. Yet, the docs say
the get() instance method "Gets data from path on the connected-to
host". What connected to host? According to the docs on new() it says,
"This method does not open the TCP connection."

As far as I can tell, you should have to call start() on a Net::HTTP
instance in order to open a connection, e.g.:

str = Net::HTTP.new('www.google.com').start().get('/').body

Collect Excel Data from Website	5	Apr 30, 2022
Total Beginner - Extracting Data from a Database Online (Screenshot)	11	May 24, 2013
I am having trouble finding a method of using the git enterprise api to scrape data from projects	1	Jun 1, 2023
Retrieve data 'live' from spinner within a RecyclerView	1	May 20, 2023
How to store data from a sign up form on a website into an sql databse	1	Sep 9, 2022
Extracting the value from Netcdf file with longitude and lattitude	0	May 16, 2014
Extracting html urls on a webpage using linktext	1	Jan 26, 2011
A process take input from /proc/<pid>/fd/0, but won't process it	0	Oct 29, 2023

Extracting Data from a Webpage

Tj Superfly

s.ross

7stud --

7stud --

William James

Tj Superfly

7stud --

Tj Superfly

7stud --

fedzor

Marc Heiler

7stud --

William James

7stud --

Joseph Pecoraro

7stud --

7stud --

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads