a few questions about scrapy

Nomen Nescio · Sep 18, 2012

I've installed scrapy and gotten a basic set-up working, and I have a
few odd questions that I haven't been able to find in the
documentation.

I plan to run it occasionally from the command line or as a cron job,
to scrape new content from a few sites. To avoid duplication, I have
in memory two sets of long with the md5 hashes of the URLs and files
crawled, and the spider ignores any that it has seen before. I need to
load them from two disk files when the scrapy job starts, and save
them to disk when it ends. Are there hooks or something similar for
start-up and shut-down tasks?

How can I put a random waiting interval between HTTP GET calls?

Is there any way to set the proxy configuration in my Python code, or
do I have so set the environment variables http_proxy and https_proxy
before running scrapy?

thanks

Newbie question about scrapy tutorial	3	Nov 20, 2009
A few questiosn about encoding	103	Jun 9, 2013
about pyyaml questions!	0	Aug 28, 2013
A few questions	9	May 21, 2007
Few questions on SOAP	8	Feb 18, 2010
A few VHDL questions	3	Aug 11, 2009
A few questions about singletons...	18	Sep 24, 2009
A few minor questions	30	May 18, 2009

a few questions about scrapy

Nomen Nescio

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads