Some problems with URI.extract ?

Nicolas Cavigneaux · May 25, 2004

Hello,

I've written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I've noticed that URI.extract miss a lot of links. In
fact URI.extract doesn't understand (resolve ?) relative links (for
example <a href="../dir/file.pdf">link</a>). Am I wrong ? If I don't,
what way do you advice to me to be sure to retrieve all the relative links ?

Thank you and good evening.

Simon Strandgaard · May 25, 2004

Nicolas Cavigneaux said:
I've written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I've noticed that URI.extract miss a lot of links. In
fact URI.extract doesn't understand (resolve ?) relative links (for
example <a href="../dir/file.pdf">link</a>). Am I wrong ? If I don't,
what way do you advice to me to be sure to retrieve all the relative links ?

I guess you are using Ruby 1.9 from CVS ?

I just read in Oniguruma's ChangeLog :
2004/05/25: [bug] (thanks Masahiro Sakai) [ruby-dev:23560]
ruby -ruri -ve 'URI::ABS_URI =~
"http://example.org/Andr\xC3\xA9"'
nested STK_REPEAT type stack can't backtrack repeat_stk[].
add OP_REPEAT_INC_SG and OP_REPEAT_INC_NG_SG.

I have no idea what that problem was, only that it was URI related.

Does it work on Ruby 1.8.1/2 ?

Robert Klemme · May 25, 2004

Nicolas Cavigneaux said:
Hello,

I've written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I've noticed that URI.extract miss a lot of links. In
fact URI.extract doesn't understand (resolve ?) relative links (for
example <a href="../dir/file.pdf">link</a>). Am I wrong ? If I don't,
what way do you advice to me to be sure to retrieve all the relative links

?

I think I remember there was a method URI.join which could join an absolute
URI and a relative one. Or was it URL.join?

robert

Nicolas Cavigneaux · May 25, 2004

I guess you are using Ruby 1.9 from CVS ?

No, I'm using Ruby 1.8.1.

Mark Hubbart · May 25, 2004

Hello,

I've written, some times ago, a Ruby code that allows me to follow web
links and to retrieve easily interesting files. This little software
works well. To extract the links from a downloaded webpage I use
URI.extract and I've noticed that URI.extract miss a lot of links. In
fact URI.extract doesn't understand (resolve ?) relative links (for
example <a href="../dir/file.pdf">link</a>). Am I wrong ? If I don't,
what way do you advice to me to be sure to retrieve all the relative
links ?

IIRC, URI.extract(str) just scans plain text for URIs. So, links in
html would have to be absolute, not relative, ie.
"http://google.com/help", not just "/help".

To get all the links out of html, you would probably need to create a
regular expression that finds all link-ish html attributes (<a
href="">, <link rel="">, <img src="">, etc), parse them to see what
type of link they are, then construct a full URI based on the page's
original location.

A quick, incomplete, untested example.

# open-uri is nice
require 'open-uri'

def get_URI_list(uri)

# download the page at the uri passed
page_data = open(uri){|f|f.read}

# scan it for the contents of html
# attributes that are usually links
uris = page_data.scan(/(?:href|src|rel|)="([^"]*)"/)

# convert relative links to absolute links
uris.map do |item|
case item
when /^\// # it's relative to site root
"http://" + URI.parse(uri).host + item
when /^http:/ #it's absolute
item
else # it's relative to the current page
# merge the two uris here. This is left as an exercise

end
end
end

HTH,
Mark

Nicolas Cavigneaux · May 27, 2004

IIRC, URI.extract(str) just scans plain text for URIs. So, links in html
would have to be absolute, not relative, ie. "http://google.com/help", not
just "/help".

OK, that's what I was thinking.

else # it's relative to the current page
# merge the two uris here. This is left as an exercise
end

eh eh ;-) Thank you for your help and for this little exercise

Bye.

Help with some CSS	2	Mar 29, 2023
Help with Visual Lightbox: Scripts	2	May 3, 2023
Big problem I need to solve with some unix utils	1	Jun 19, 2022
Need help with code on website (noob)	2	Jul 18, 2022
Iframe link overlapping text	4	Jan 18, 2021
Hello from beginner with some questions!	3	Jul 30, 2021
I want to get some resources on Twilio and creating a predictive dialer with Python.	0	Jul 12, 2023
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022

Some problems with URI.extract ?

Nicolas Cavigneaux

Simon Strandgaard

Robert Klemme

Nicolas Cavigneaux

Mark Hubbart

Nicolas Cavigneaux

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads