open-uri bug

S

Steve H.

Hello all, I'm using open-uri combined with hpricot to make a basic
web crawler that scrapes for different links that I need. It seems to
be working perfectly, but I have encountered the following bug when
this type of link is encountered:

irb(main):015:0> URI.parse('http://hello.com/a.php?%1')
URI::InvalidURIError: bad URI(is not URI?): http://hello.com/a.php?%1
from c:/ruby/lib/ruby/1.8/uri/common.rb:436:in `split'
from c:/ruby/lib/ruby/1.8/uri/common.rb:485:in `parse'
from (irb):15

Can anyone illuminate why this is a problem? Thanks!
 
R

Rob Biedenharn

Hello all, I'm using open-uri combined with hpricot to make a basic
web crawler that scrapes for different links that I need. It seems to
be working perfectly, but I have encountered the following bug when
this type of link is encountered:

irb(main):015:0> URI.parse('http://hello.com/a.php?%1')
URI::InvalidURIError: bad URI(is not URI?): http://hello.com/a.php?%1
from c:/ruby/lib/ruby/1.8/uri/common.rb:436:in `split'
from c:/ruby/lib/ruby/1.8/uri/common.rb:485:in `parse'
from (irb):15

Can anyone illuminate why this is a problem? Thanks!


Probably because %1 looks like a partially escaped character. Try:

?%251
Where %25 is an escaped %

-Rob

Rob Biedenharn http://agileconsultingllc.com
(e-mail address removed)
 
S

Steve H.

Probably because %1 looks like a partially escaped character. Try:

?%251
Where %25 is an escaped %

-Rob

I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a "correct" URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!
 
S

Siep Korteling

Steve said:
I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a "correct" URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!

Maybe this helps:

URI.escape('http://hello.com/a.php?%1')

=> "http://hello.com/a.php?%1"

Regards,

Siep
 
E

Eric Hodel

I appreciate the reply. This is a bit unfortunate, I am developing a
tool which has to handle URIs the same way the browser does. While I
realize that is not a "correct" URI, the browser still fetches the
pages without a problem. In some sense, I wish I could mirror the
functionality of the browser fetch using the URI module. Anyhow, thank
you for your help!

What about Mechanize?
 
P

Piyush Ranjan

[Note: parts of this message were removed to make it a legal post.]

I too want to know how to handle invalid URIs in mechanize. Is there any way
to override url checking ?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top