how to extract url's from html source of google search result

S

sujeet kumar

hi
I want to make a Tk window where you give some input string and it
search that on google and prints the web address (http url) of the
result found on google in the TkFrame of that window. My program
connects to net and get the html source through function "http.get".
Now from html source , how can I find the url's of the search. Can i
do it by regular expression or any other way.
Give me any suggestion.
Thanks
sujeet
 
M

Marcel Molina Jr.

I want to make a Tk window where you give some input string and it
search that on google and prints the web address (http url) of the
result found on google in the TkFrame of that window. My program
connects to net and get the html source through function "http.get".
Now from html source , how can I find the url's of the search. Can i
do it by regular expression or any other way.
Give me any suggestion.

The URI.extract method from the uri library can extract an array of uri's from
a string:

require 'uri'
URI.extract('My favorite site is http://google.com')
# => ["http://google.com"]

An optional second argument can limit the schemes that it will match against
and return:

URI.extract('Why do people use mailto:[email protected] links?')
# => ["mailto:[email protected]"]
URI.extract('Why do people use mailto:[email protected] links?', 'http')
# => []

marcel
 
A

Alexey Verkhovsky

Marcel said:
The URI.extract method from the uri library can extract an array of uri's from
a string:
A universal regexp that finds URIs from an abstract text is a
complicated thing, indeed. Besides, it can produce false positives
(finding things that look like URIs, but aren't).

If you are sure that the page is a well-formed XHTML (I'm not sure if
that's the case or not with Google), you might instead parse it with
REXML, and use XPath to retrieve href attributes of all <a>..</a>
elements, selecting only those that start with "http://" (there may also
be mailto:, ftp:, JavaScript calls etc).

Best regards,
Alexey Verkhovsky
 
E

Eric Hodel

hi
I want to make a Tk window where you give some input string and it
search that on google and prints the web address (http url) of the
result found on google in the TkFrame of that window. My program
connects to net and get the html source through function "http.get".
Now from html source , how can I find the url's of the search. Can i
do it by regular expression or any other way.

Why not use the Google API?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top