nokogiri ip ban?

L

Luis G.

Hi there...

I've been playing with Ruby and Nokogiri to crawl some website to get
text information, but after a while I realize that some of those
websites block my access while the script is running. Since the moment
they block the access, the script keeps running (cause I handled the
exception) but is getting what is suppose to.
After the block, if I try to access using the browser, I just can't, so
I guess they block the IP address, right?

I also tried using TOR, like this:

Nokogiri::HTML(open(url, :proxy => 'http://(ip_address):(port)'))

But I still have the same problem: works in the beginning, but after a
while stops working.

I can just run the crawler in steps, to not do lots of calls to the
website in the same moment, but is kinda boring... :)

Any of you face the same problem? Any of you have a solution for this?

thanks,

Luis
 
A

Andrea Dallera

Bombing a webserver in the fashion you describe is not advisable in any
way. They're clearly not happy with what you're doing... so either lower
the frequency of the requests or ask them directly about your needs -
they might be willing to let you run your script more often or even give
you the raw data directly.
--

Andrea Dallera
http://github.com/bolthar/freightrain
http://usingimho.wordpress.com




Il 23/11/2010 12:19, Luis G. ha scritto:
 
P

Phillip Gawlowski

Hi there...
But I still have the same problem: works in the beginning, but after a
while stops working.

I can just run the crawler in steps, to not do lots of calls to the
website in the same moment, but is kinda boring... :)

Any of you face the same problem? Any of you have a solution for this?

Run your crawler in steps (don't crawl the whole site, and only grab
what is new; that's what the Age header is for!), and respect
robots.txt.

Otherwise, well, you get what you deserve, if you hog a server's CPU
cycles and create a Denial of Service attack (nobody cares if it is by
accident or by design).
--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.
 
L

Luis G.

Hey guys... Thanks for your replies.

I thought the program I built was not so heavy for the website I'm
trying to get info from.
The thing is, I'm accessing the website to get information, but I just
access there to specific pages in that domain, so I'm not really
crawling everything.
I build the url based in some info I have in my DB and after I have the
url, I access to that url directly and collect the information in that
specific page. And more, the page I'm accessing have just some <p> html
tags, so not so much info to look into. And I'm just accessing the
web-pages I didn't access before (just the new ones).

So, of course I understand that they need to protect webserver, but I
think my program is not really a threat :D

I'm gonna run the script in steps and in different days like I thought
before and like you told me.

Thanks a lot for your help.

Luis
 
A

Ammar Ali

[Note: parts of this message were removed to make it a legal post.]

Hey guys... Thanks for your replies.

I thought the program I built was not so heavy for the website I'm
trying to get info from.
The thing is, I'm accessing the website to get information, but I just
access there to specific pages in that domain, so I'm not really
crawling everything.
I build the url based in some info I have in my DB and after I have the
url, I access to that url directly and collect the information in that
specific page. And more, the page I'm accessing have just some <p> html
tags, so not so much info to look into. And I'm just accessing the
web-pages I didn't access before (just the new ones).

So, of course I understand that they need to protect webserver, but I
think my program is not really a threat :D

I'm gonna run the script in steps and in different days like I thought
before and like you told me.
One more thought, what are you using for user-agent? Some sites block empty
or known to be automated user-agents.

Regards,
Ammar
 
L

Luis G.

Hi Ammar

Yeah, that's one of the reasons I asked this question, because I though
that we can solve this issue just changing the user agent or the headers
or something... Like they have here:
http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/

Anyway, now I'm using an empty user agent, but I tried to define a user
agent but the result was the same. I tried something like:

Nokogiri::HTML(open(url, "User-Agent" => "Ruby/#{RUBY_VERSION}"))

I also tried the user agents we can use in Mechanize ('Linux Mozilla',
for example) but nothing worked.

Luis
 
A

Ammar Ali

[Note: parts of this message were removed to make it a legal post.]

Anyway, now I'm using an empty user agent, but I tried to define a user
agent but the result was the same. I tried something like:

Nokogiri::HTML(open(url, "User-Agent" => "Ruby/#{RUBY_VERSION}"))

I also tried the user agents we can use in Mechanize ('Linux Mozilla',
for example) but nothing worked.



It was just a guess. But, as you mentioned, your IP is being blocked, so
it's too late to change agents now. You may have been blocked for any reason
really, frequency of requests, user-agent, or something else entirely.
Usually such blocks are temporary (it could be a dynamic IP) so you could
try again later. But who knows how long it will take, or if you will be
blocked again.

Andrea's suggestion is probably your best bet, contact the owners of the
site and request access. You might find out why you got blocked and avoid it
in the future.

Regards,
Ammar
 
L

Luis G.

Actually I was blocked before, and is like during 24 hours or so.
But the thing is, I'm running the crawlers in a test server, and not in
the production one. And they are not under the same network, so the IP's
are different :)

Thanks guys.

Luis
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,565
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top