screen scraping gotcha

R

Roedy Green

I used a thread pool to speed up the screenscraping I use to find out
which bookstores carry which books. Then I discovered some bookstores
sometimes were returning 403 forbidden codes. I think they do this if
you have more than one request outstanding from a given IP. I later
discovered that Xenu link checker was getting 403 codes that
BrokenLinks (which does one probe at a time) was finding were 200
(ok).

So I think screenscraping/link checking etc code needs some mechanism
to optionally avoid hitting a site with more than one request at a
time or perhaps even with a pause of X seconds between requests.

It might do that with an explicit Semaphore, ordering the requests to
increased distance between probes to the same site, reducing the pool
size... ??

--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 
E

Eric Sosman

I used a thread pool to speed up the screenscraping I use to find out
which bookstores carry which books. Then I discovered some bookstores
sometimes were returning 403 forbidden codes. I think they do this if
you have more than one request outstanding from a given IP. I later
discovered that Xenu link checker was getting 403 codes that
BrokenLinks (which does one probe at a time) was finding were 200
(ok).

So I think screenscraping/link checking etc code needs some mechanism
to optionally avoid hitting a site with more than one request at a
time or perhaps even with a pause of X seconds between requests.

It might do that with an explicit Semaphore, ordering the requests to
increased distance between probes to the same site, reducing the pool
size... ??

I'd suggest making the request scheduling explicit in the data
structures, and not burying it in the locking mechanisms. Maintain
a pool of "requests contemplated" and another of "requests in progress,"
and limit the number of in-progress requests for any one site. When
the in-progress pool completes a site S request, it can fish in the
contemplated pool for another S request, but not for a T request.

If you want to get fancier, you could try to discover each site's
throttling mechanism on the fly, by observing the 403's. But I think
keeping things simple to start with would be better -- after all, you
are only hypothesizing about the natures of the throttles!
 
R

Roedy Green

It might do that with an explicit Semaphore, ordering the requests to
increased distance between probes to the same site, reducing the pool
size... ??

I have tried throttling so that requests are separated by 30 seconds,
it is still sending me 403s. Yet when I hit the site with browser,
instantly all is forgiven.

The stupid buggers don't seem to realise I am trying to HELP them sell
books. If they had half a brain they would give me a soap interface
where I could submit a list of ISBNs and they would give be back a
list of booleans telling me which ones they have in stock.


Most online stores go to extreme lengths to foil screen scraping. Many
affiliate programs want to you go to their site and spend ten minutes
to set up the html just to sell one product.

Allposters.com invented a SOAP interface, but then left out sizes,
formats and prices, and it was not in sync with the web site. not
even the sizes of jpgs were correct.I have to start positing malice
the incompetence is so extreme.

--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 
D

Daniel Pitts

I have tried throttling so that requests are separated by 30 seconds,
it is still sending me 403s. Yet when I hit the site with browser,
instantly all is forgiven.

The stupid buggers don't seem to realise I am trying to HELP them sell
books. If they had half a brain they would give me a soap interface
where I could submit a list of ISBNs and they would give be back a
list of booleans telling me which ones they have in stock.


Most online stores go to extreme lengths to foil screen scraping. Many
affiliate programs want to you go to their site and spend ten minutes
to set up the html just to sell one product.

Allposters.com invented a SOAP interface, but then left out sizes,
formats and prices, and it was not in sync with the web site. not
even the sizes of jpgs were correct.I have to start positing malice
the incompetence is so extreme.

If you're going to violate the TOS and robots.txt, you might as well do
it right:

Make sure you spoof an appropriate "Referrer" header and User Agent
header. Keep track of cookies. If possible, pre-process which requests
you will make, and then build a thread-per-site thread pool each with a
Queue of requests to make, and a randomized delay between each request.

Also, I would recommend supporting Cache headers of various sorts
(etags, Expires on, time to live, etc...) This reduces load on the
remote server, bandwidth, and processing time.
 
R

Roedy Green

If you're going to violate the TOS and robots.txt, you might as well do
it right:
TOS = Terms of Service
Make sure you spoof an appropriate "Referrer" header and User Agent
header. Keep track of cookies. If possible, pre-process which requests
you will make, and then build a thread-per-site thread pool each with a
Queue of requests to make, and a randomized delay between each request.

I figured I did not need a referrer since browsers don't send one . I
can try supporting cookies. I figured they too would not be necessary
since many browsers refuse them.
Also, I would recommend supporting Cache headers of various sorts
(etags, Expires on, time to live, etc...) This reduces load on the
remote server, bandwidth, and processing time.

It seems to me those are about asking for the same page more once. I
don't understand what my app would do differently.

I have written Abe Books asking for a computer friendly interface
arguing it would attract more bulk book displayers, bookfinders, and
that the bandwidth would be much lower than screenscraping.

I once wrote the ASP people about their giant list of PADsites,
suggesting some things to make it more computer friendly. They
responded that they did not want anyone USING the list, just casually
looking at small parts of it. So their goofy formatting was
deliberately designed to frustrate those trying to import information
from it. I was baffled by the dog in a manger attitude. Why go to all
that work then not let people use it.

I maintain a similar list, http://mindprod.com/jgloss/hassle.html
better pruned of deadwood. I let people view it in HTML or download
it as csv files.

--
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.
 
D

Daniel Pitts

TOS = Terms of Service


I figured I did not need a referrer since browsers don't send one . I
can try supporting cookies. I figured they too would not be necessary
since many browsers refuse them.


It seems to me those are about asking for the same page more once. I
don't understand what my app would do differently.

I have written Abe Books asking for a computer friendly interface
arguing it would attract more bulk book displayers, bookfinders, and
that the bandwidth would be much lower than screenscraping.

I once wrote the ASP people about their giant list of PADsites,
suggesting some things to make it more computer friendly. They
responded that they did not want anyone USING the list, just casually
looking at small parts of it. So their goofy formatting was
deliberately designed to frustrate those trying to import information
from it. I was baffled by the dog in a manger attitude. Why go to all
that work then not let people use it.

I maintain a similar list, http://mindprod.com/jgloss/hassle.html
better pruned of deadwood. I let people view it in HTML or download
it as csv files.
Well, in whatever case, a few hours with Wireshark might help you
understand what is different. The rest of my advice involving queues
and all is potentially worth looking at.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,778
Messages
2,569,605
Members
45,238
Latest member
Top CryptoPodcasts

Latest Threads

Top