How to check if a webpage exists

D

Davide Benini

This probably is trivial, but I have been googling for almost 2hs
without finding a viable solution.
Basically I have this rails app which uses hpricot to parse web pages.
There's this line

page = Hpricot( open(url))

If the url is wrong, or the server is down, obviously I get an
exception.
First I tried my luck with

if page = Hpricot( open(url))
blah blah
end

but this did not work.

Then I started googling like crazy for a method to check if a webpage is
loadable. I only found this thread

http://markmail.org/message/iurqf4ejbndbczqq

I tried the suggested code, it does not work.
Now, I am pretty sure there's a straightforward way of checking whether
a webpage is loadable.
Can you help me?
Thanks in advance,
Davide
 
T

Todd Benson

This probably is trivial, but I have been googling for almost 2hs
without finding a viable solution.
Basically I have this rails app which uses hpricot to parse web pages.
There's this line

page = Hpricot( open(url))

If the url is wrong, or the server is down, obviously I get an
exception.
First I tried my luck with

if page = Hpricot( open(url))
blah blah
end

but this did not work.

Then I started googling like crazy for a method to check if a webpage is
loadable. I only found this thread

http://markmail.org/message/iurqf4ejbndbczqq

I tried the suggested code, it does not work.
Now, I am pretty sure there's a straightforward way of checking whether
a webpage is loadable.
Can you help me?
Thanks in advance,
Davide

Not really an answer, but should point you in the right direction.
Also, I can't test with Hpricot right now due to gem install issues.
But, using open-uri...

require 'open-uri'; begin; open('http://www.www.www') {} rescue '404 error'; end

Todd
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Thu, 11 Sep 2008 07:32:40 +0900
Von: "Todd Benson" <[email protected]>
An: (e-mail address removed)
Betreff: Re: How to check if a webpage exists
Not really an answer, but should point you in the right direction.
Also, I can't test with Hpricot right now due to gem install issues.
But, using open-uri...

require 'open-uri'; begin; open('http://www.www.www') {} rescue '404
error'; end

Todd

Dear Davide,

I was just about to suggest the same thing. It works on my Ubuntu machine.

Best regards,

Axel
 
D

Davide Benini

Thanks folks,
your suggestion works, but I am not able to integrate it with my
existing code; I have some problems understanding how rescue interacts
with code blocks.
Basically, I have a chunk of code that must be executed ONLY IF there is
no exception; if there is an exception, I need to execute another chunk
of code.
I tried a couple of syntaxes:

require 'open-uri';
begin
open('http://www.www.www') {
// code to execute if everythinkg's ok
} rescue '404 error'
// code to execute in case of error
end

I also tried

require 'open-uri';
begin
open('http://www.www.www') {}
rescue '404 error'
// code to execute in case of error
else
// code to execute if everythinkg's ok
end

Also

require 'open-uri';
begin
open('http://www.www.www')
puts "ok"
rescue '404 error'
puts "error"
end

None of this works.
Which is the proper syntax?
Davide
 
P

Peña, Botp

RnJvbTogRGF2aWRlIEJlbmluaSBbbWFpbHRvOm51dHNtdWdnbGVyQGhvdG1haWwuY29tXSANCiMg
Li4uDQojIE5vbmUgb2YgdGhpcyB3b3Jrcy4NCiMgV2hpY2ggaXMgdGhlIHByb3BlciBzeW50YXg/
DQoNCmNvbXBhcmUsDQoNCj4gcmVxdWlyZSAnb3Blbi11cmknDQo9PiBmYWxzZQ0KPiBiZWdpbg0K
KiAgIG9wZW4gImh0dHA6Ly93d3cuZ29vZ2xlLmNvbSIsIDpwcm94eT0+dHJ1ZQ0KPiAgIHAgImkn
bSBvayIgICAjPC0tIG9rIGNvZGVzIGhlcmUNCj4gcmVzY3VlDQo+ICAgcCAic29ycnkgY2FuJ3Qg
ZG8iICAjPC0tIG5vdCBvayBjb2RlcyBoZXJlDQo+IGVuZA0KImknbSBvayINCj0+IG5pbA0KDQo+
IGJlZ2luDQoqICAgb3BlbiAiaHR0cDovL3RoaXMuZG9lcy5ub3QuZXhpc3QuY29tIiwgOnByb3h5
PT50cnVlDQo+ICAgcCAiaSdtIG9rIg0KPiByZXNjdWUgPT4gZQ0KPiAgIHAgInNvcnJ5IGNhbid0
IGRvIg0KPiAgIHAgImVycm9yIGlzOiAje2V9Ig0KPiBlbmQNCiJzb3JyeSBjYW4ndCBkbyINCiJl
cnJvciBpczogNTAzIFNlcnZpY2UgVW5hdmFpbGFibGUiDQo9PiBuaWwNCg==
 
D

Davide Benini

Thanks for your super-fast answer :)
Yet, none of this works on my system; to be double sure I copied and
pasted.
In a script, I get "I'm ok" even when a page does not exist; in IRB , I
always get "sorry can't do"
Any suggestion?
Davide
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Thu, 11 Sep 2008 17:15:39 +0900
Von: Davide Benini <[email protected]>
An: (e-mail address removed)
Betreff: Re: How to check if a webpage exists
Thanks for your super-fast answer :)
Yet, none of this works on my system; to be double sure I copied and
pasted.
In a script, I get "I'm ok" even when a page does not exist; in IRB , I
always get "sorry can't do"
Any suggestion?
Davide

Dear Davide,

hmmm ... the suggested code works on my system (Ubuntu 8.04 /ruby 1.8.7.p-22), both for scripts
and on irb.
How do you enter the code on irb ?
Do you do

begin (enter)
line 1 (enter)
rescue (enter)
line2 (enter)
end (enter) ?

I tried instead

begin ; line1 ; recue ; line 2; end (enter)

This caused irb to work correctly.

Best regards,

Axel
 
D

Davide Benini

Hi Axel,
I work on Mac Os X Leopard. Ruby works allright, I have also a number of
rails websites running locally, no problems so far.
Could you help me with the proper "script" sintax; I am sure the
rationale beyond the mechanism is correct, but I ultimately need to
integrate this script in a rails application, so I need to have it
woking in a common .rb file. As I said, tried this

require 'open-uri'
begin
open "www.does.not.exist.sdadasdas.com", :proxy=>true
p "i'm ok" #<-- ok codes here
rescue => e
p "sorry can't do"
p "error is: #{e}"
end

Simple as it seems, it does not work, I always end up with "i'm ok". I'm
sure it's some stupid syntactic glitch...
Any suggestion?
Davide
 
P

Peña, Botp

RnJvbTogRGF2aWRlIEJlbmluaSBbbWFpbHRvOm51dHNtdWdnbGVyQGhvdG1haWwuY29tXSANCiMg
SSB3b3JrIG9uIE1hYyBPcyBYIExlb3BhcmQuIFJ1Ynkgd29ya3MgYWxscmlnaHQsIEkgaGF2ZSBh
bHNvIA0KIyBhIG51bWJlciBvZiANCiMgcmFpbHMgd2Vic2l0ZXMgcnVubmluZyBsb2NhbGx5LCBu
byBwcm9ibGVtcyBzbyBmYXIuDQojIENvdWxkIHlvdSBoZWxwIG1lIHdpdGggdGhlIHByb3BlciAi
c2NyaXB0IiBzaW50YXg7IEkgYW0gc3VyZSB0aGUgDQojIHJhdGlvbmFsZSBiZXlvbmQgdGhlIG1l
Y2hhbmlzbSBpcyBjb3JyZWN0LCBidXQgSSB1bHRpbWF0ZWx5IG5lZWQgdG8gDQojIGludGVncmF0
ZSB0aGlzIHNjcmlwdCBpbiBhIHJhaWxzIGFwcGxpY2F0aW9uLCBzbyBJIG5lZWQgdG8gaGF2ZSBp
dCANCiMgd29raW5nIGluIGEgY29tbW9uIC5yYiBmaWxlLiBBcyBJIHNhaWQsIHRyaWVkIHRoaXMN
Cg0KdHJ5DQoNCiMgcmVxdWlyZSAnb3Blbi11cmknDQojICBiZWdpbg0KIyAgICBvcGVuICJ3d3cu
ZG9lcy5ub3QuZXhpc3Quc2RhZGFzZGFzLmNvbSIsIDpwcm94eT0+dHJ1ZQ0KDQpyZXBsYWNlIHRo
ZSBhYm92ZSBsaW5lIHdpdGgNCg0KICAgIHB1dHMgb3Blbigid3d3LmRvZXMubm90LmV4aXN0LnNk
YWRhc2Rhcy5jb20iLDpwcm94eT0+dHJ1ZSkucmVhZA0KDQpwb3N0IHRoZSBvdXRwdXQgYWdhaW4N
Cg0KIyAgICBwICJpJ20gb2siICAgIzwtLSBvayBjb2RlcyBoZXJlDQojICByZXNjdWUgPT4gZQ0K
IyAgIHAgInNvcnJ5IGNhbid0IGRvIg0KIyAgIHAgImVycm9yIGlzOiAje2V9Ig0KIyBlbmQNCg==
 
P

Peña, Botp

RnJvbTogRGF2aWRlIEJlbmluaSBbbWFpbHRvOm51dHNtdWdnbGVyQGhvdG1haWwuY29tXSANCiMu
Li4uDQojICAgIG9wZW4gInd3dy5kb2VzLm5vdC5leGlzdC5zZGFkYXNkYXMuY29tIiwgOnByb3h5
PT50cnVlDQoNCmZ3aXcsIG1pbmUgZG9lcyBub3Qgd29yayBpZiBpIGRvIG5vdCBxdWFsaWZ5IHRo
ZSB1cmwsIGllIHNob3VsZCBiZQ0KDQogICBvcGVuICJodHRwOi8vd3d3LmRvZXMubm90LmV4aXN0
LnNkYWRhc2Rhcy5jb20iLCA6cHJveHk9PnRydWUNCg0Kbm90ZSB0aGUgaHR0cDovLw0KDQpidXQg
eW91ciBjYXNlIGlzIHdlaXJkLCBzaW5jZSBpdCBhbHdheXMgd29ya3MgcmVnYXJkbGVzcyA6KQ0K
DQo=
 
P

Peña, Botp

RnJvbTogRGF2aWRlIEJlbmluaSBbbWFpbHRvOm51dHNtdWdnbGVyQGhvdG1haWwuY29tXSANCiMg
PiB0cnkNCiMgPiANCiMgPiAjIHJlcXVpcmUgJ29wZW4tdXJpJw0KIyA+ICMgIGJlZ2luDQojID4g
IyAgICBvcGVuICJ3d3cuZG9lcy5ub3QuZXhpc3Quc2RhZGFzZGFzLmNvbSIsIDpwcm94eT0+dHJ1
ZQ0KIyA+IA0KIyA+IHJlcGxhY2UgdGhlIGFib3ZlIGxpbmUgd2l0aA0KIyA+IA0KIyA+ICAgICBw
dXRzIG9wZW4oInd3dy5kb2VzLm5vdC5leGlzdC5zZGFkYXNkYXMuY29tIiw6cHJveHk9PnRydWUp
LnJlYWQNCiMgPiANCiMgPiBwb3N0IHRoZSBvdXRwdXQgYWdhaW4NCiMgPiANCiMgPiAjICAgIHAg
ImknbSBvayIgICAjPC0tIG9rIGNvZGVzIGhlcmUNCiMgPiAjICByZXNjdWUgPT4gZQ0KIyA+ICMg
ICBwICJzb3JyeSBjYW4ndCBkbyINCiMgPiAjICAgcCAiZXJyb3IgaXM6ICN7ZX0iDQojID4gIyBl
bmQNCiMgDQojIFRoZSBvdXRwdXQgaXMNCiMgDQojICJzb3JyeSBjYW4ndCBkbyINCiMgImVycm9y
IGlzOiBjYW4ndCBjb252ZXJ0IEhhc2ggaW50byBTdHJpbmciDQojIA0KIyBTbyBub3cgdGhlcmUg
aXMgdHlwZSBjb25mbGljdC4uLg0KDQpiZWNhdXNlIGkganVzdCBjb3BpZWQgeW91ciB1cmwuDQp0
cnkgdGhpcywNCg0KICBwdXRzIG9wZW4oImh0dHA6Ly93d3cuZG9lcy5ub3QuZXhpc3Quc2RhZGFz
ZGFzLmNvbSIsOnByb3h5PT50cnVlKS5yZWFkDQo=
 
D

Davide Benini

Ok folks, I probably see what's wrong.
The latest example works

begin
puts open("http://www.does.not.exist.sdadasdas.com",:proxy=>true).read
p "i'm ok"
rescue => e
p "sorry can't do"
p "error is: #{e}"
end

I get "sorry can't do" "error is: 400 Bad Request".

But if I do this

begin
puts open("http://www.jhkhjhkj.com",:proxy=>true).read
p "i'm ok"
rescue => e
p "sorry can't do"
p "error is: #{e}"
end
I get
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html
xmlns="http://www.w3.org/1999/xhtml"><style
type="text/css">body{font-family:Arial, Helvetica, FreeSans,
sans-serif;font-size:16px;color:#333;margin:10px 0 0
0;padding:0}a{color:#333;outline:0}a:hover{text-

etc etc

This is a page that my (!%&%&!) ISP loads when no page is encountered.
So the problem is that open gets a page from a redirect, right?
The problem is that my rails application, a kind of spider, is supposed
to load a number of pages; if one server gets down, I don't want it to
be get stuck. Moreover, when the app will run online, my hosting server
might react in a different way. Is there a way to make sure the page
loaded is the page I asked for, not an error, a redirect or anything
else?
What do you suggest?
Davide
Davide
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Thu, 11 Sep 2008 18:55:27 +0900
Von: Davide Benini <[email protected]>
An: (e-mail address removed)
Betreff: Re: How to check if a webpage exists
Ok folks, I probably see what's wrong.
The latest example works

begin
puts open("http://www.does.not.exist.sdadasdas.com",:proxy=>true).read
p "i'm ok"
rescue => e
p "sorry can't do"
p "error is: #{e}"
end

I get "sorry can't do" "error is: 400 Bad Request".

But if I do this

begin
puts open("http://www.jhkhjhkj.com",:proxy=>true).read
p "i'm ok"
rescue => e
p "sorry can't do"
p "error is: #{e}"
end
I get
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html
xmlns="http://www.w3.org/1999/xhtml"><style
type="text/css">body{font-family:Arial, Helvetica, FreeSans,
sans-serif;font-size:16px;color:#333;margin:10px 0 0
0;padding:0}a{color:#333;outline:0}a:hover{text-

etc etc

This is a page that my (!%&%&!) ISP loads when no page is encountered.
So the problem is that open gets a page from a redirect, right?
The problem is that my rails application, a kind of spider, is supposed
to load a number of pages; if one server gets down, I don't want it to
be get stuck. Moreover, when the app will run online, my hosting server
might react in a different way. Is there a way to make sure the page
loaded is the page I asked for, not an error, a redirect or anything
else?
What do you suggest?
Davide
Davide

Dear Davide,

glad that it worked this far.
You could use Hpricot's parsing capabilities to check whether the page you loaded

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html
xmlns="http://www.w3.org/1999/xhtml"><style
type="text/css">body{font-family:Arial, Helvetica, FreeSans,
sans-serif;font-size:16px;color:#333;margin:10px 0 0
0;padding:0}a{color:#333;outline:0}a:hover{text-




is the one you asked for by comparing the html address in what your search resturns with
the html address you first gave it.

Best regards,

Axel
 
D

Davide Benini

Dear Davide,
glad that it worked this far.
You could use Hpricot's parsing capabilities to check whether the page
you loaded

Good point Alex, I was so focused on this small chunk of code I didn't
think I can chek the page later, during the processing.
At any rate, thanks all here, now I'm able to deal with the 404
exception, I just have to take care of other problems in other parts of
the app.
Cheers,
Davide
 
M

Michael Fellinger

Ok folks, I probably see what's wrong.
The latest example works

begin
puts open("http://www.does.not.exist.sdadasdas.com",:proxy=>true).read
p "i'm ok"
rescue => e
p "sorry can't do"
p "error is: #{e}"
end

I get "sorry can't do" "error is: 400 Bad Request".

But if I do this

begin
puts open("http://www.jhkhjhkj.com",:proxy=>true).read
p "i'm ok"
rescue => e
p "sorry can't do"
p "error is: #{e}"
end
I get
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html
xmlns="http://www.w3.org/1999/xhtml"><style
type="text/css">body{font-family:Arial, Helvetica, FreeSans,
sans-serif;font-size:16px;color:#333;margin:10px 0 0
0;padding:0}a{color:#333;outline:0}a:hover{text-

etc etc

This is a page that my (!%&%&!) ISP loads when no page is encountered.
So the problem is that open gets a page from a redirect, right?
The problem is that my rails application, a kind of spider, is supposed
to load a number of pages; if one server gets down, I don't want it to
be get stuck. Moreover, when the app will run online, my hosting server
might react in a different way. Is there a way to make sure the page
loaded is the page I asked for, not an error, a redirect or anything
else?
What do you suggest?

You can use something really low level:
http://p.ramaze.net/1960

Note that you can use the status to determine what happened... see the
http rfc for more information on the status codes.
http://www.w3.org/Protocols/rfc2616/rfc2616-sec6.html#sec6.1.1
 
M

Michael Fellinger

Hi Micheal, the low-level solution you've linked to doesn't seem to work
at all; I changed the code and tested http://www.google.com and I get
"what to do when it doesn't work".
Yours would actually be the solution I like best; does it work on your
system?

Sorry, i should test last-minute changes...
http://p.ramaze.net/1961
but please read and try to understand the code, it's no use if you're
simply copy&pasting solutions.

^ manveru
 
D

Davide Benini

but please read and try to understand the code, it's no use if you're
simply copy&pasting solutions.

Hi Micheal,
sorry if you got the impression I am just copying and pasting. I had got
the gist of your script rationale, and I tried to modify this and that,
but I am probably too of a beginner to debug it by myself. At any rate,
thanks for your help so far.

Anyway, for some reason the script does not work yet. Working pages
(like repubblica.it or corriere.it, italian most popular newpaper
websites) output a 400 status, and the script fails. Moreover, if I
insert a url without a www (like for my own website,
http://davidebenini.it), the connection doesn't even start.
Why is that? Do you think there's a problem with the request format?
Cheers,
Davide
 
M

Michael Fellinger

Hi Micheal,
sorry if you got the impression I am just copying and pasting. I had got
the gist of your script rationale, and I tried to modify this and that,
but I am probably too of a beginner to debug it by myself. At any rate,
thanks for your help so far.

Anyway, for some reason the script does not work yet. Working pages
(like repubblica.it or corriere.it, italian most popular newpaper
websites) output a 400 status, and the script fails. Moreover, if I
insert a url without a www (like for my own website,
http://davidebenini.it), the connection doesn't even start.
Why is that? Do you think there's a problem with the request format?

I extended the format, so this should work:
http://p.ramaze.net/1962
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,773
Messages
2,569,594
Members
45,119
Latest member
IrmaNorcro
Top