Screen Scraping Advice

C

Charles Pareto

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?
 
J

John Joyce

I work for Cisco Systems in San Jose Ca. I proposed a project to
perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I
want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their
MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain
names
so I could read through that and extract all domain names with
Cisco in
it. Once I do that I want to go to each page and search/scrape for
these
products, but I don't really know the best approach to take. Can
anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?
Doesn't sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.
 
C

Chuck Dawit

John said:
Doesn't sound like much scraping, just searching text for a string.
You could even do a lot of that work with Google.
but just download the file and search for a string. create a data
file of your own that tells you what line you found the string.
Scraping is really for getting data from other sites, using the DOM
structure they have to get (for example) the weather report.


Well, I disagree. Once I have all the websites with Cisco in its domain
name and I look through them, there are lots of pages that won't show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say "WIC-1T" and then
search for a price below a specific amount for that item.
 
K

Konrad Meyer

--nextPart70006898.XZClvLpk8L
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Chuck Dawit:
=20
=20
Well, I disagree. Once I have all the websites with Cisco in its domain=20
name and I look through them, there are lots of pages that won't show me= =20
info unless I do a search within that page itself. (ex. usedcisco.com)=20
To search for specific items on this website I would have to use the=20
search bar located within its page to search for say "WIC-1T" and then=20
search for a price below a specific amount for that item.

Do a search on froogle for "cisco productname" with the max price set at
60% MSRP. Should turn up a few hits.

HTH,
=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart70006898.XZClvLpk8L
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBG7tT2CHB0oCiR2cwRAkSTAKCBidojLcTIOPFVe7RXTK7a+E+RvgCgy4bW
WL0lg6K5kV5r6PbMs8CimsA=
=HOjz
-----END PGP SIGNATURE-----

--nextPart70006898.XZClvLpk8L--
 
J

John Joyce

Well, I disagree. Once I have all the websites with Cisco in its
domain
name and I look through them, there are lots of pages that won't
show me
info unless I do a search within that page itself. (ex. usedcisco.com)
To search for specific items on this website I would have to use the
search bar located within its page to search for say "WIC-1T" and then
search for a price below a specific amount for that item.
What I mean is, scraping usually relies on the document's structure
in some way. Without looking at the structure that a give site uses
(a given page if it isn't a templated dynamically generated page)
there is no way to know what corresponds to what. Page structure is
pretty arbitrary. Presentation and structure don't necessarily
correspond well, or in a way you could guess.
Ironically, the better their web designers, the easier it will be.

But if you are talking about searching a dynamically generated site,
you still have to find out if it has a search mechanism, what does it
call the form field and submit buttons? The names in html can be
arbitrary, especially if they use graphic buttons.

If you have long list of products to search for, you will still save
yourself some work, but scraping involves some visual inspection of
pages and page source to get things going. Be aware that their
sysadmin may spot you doing a big blast of searches all at once and
block you from the site. If they check their logs and see that
somebody is searching for all cisco stuff, in an automated fashion,
they might just block you anyway, whether or not they are legit
themselves. Many sysadmins don't like bots searching their
databases! They might see it as searching for exploits.
 
B

brabuhr

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

If someone knows of a super library that can recognize and interact
with arbitrary search forms, I would love to see it :)

My first suggestion would be to write a simple script using Mechanize
to connect to the homepage of each site in an input list and check for
any forms. Bin the sites into three groups (no forms, at least one
form matching the regex /search/i, and at least one form). Then start
by just focusing at the ones which appear to have some sort of search
form (which may be a small or a large subset :).
 
F

franco

I work for Cisco Systems in San Jose Ca. I proposed a project to perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain names
so I could read through that and extract all domain names with Cisco in
it. Once I do that I want to go to each page and search/scrape for these
products, but I don't really know the best approach to take. Can anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

Hpricot
http://code.whytheluckystiff.net/hpricot/ is a great screen scrape
library for ruby.

scraping might not be the best approach because each site/page uses a
different layout, therefore the same scrape recipe probably won't work
for another page.

you could scrape froogle (google products?) or some other aggregate
consumer sales site. it will have one interface and probably a lot of
data. you might want to see if there are web services for froogle,
usually better than scraping.
 
G

Glenn Gillen

I work for Cisco Systems in San Jose Ca. I proposed a project to
perform
a screen scrape/spider hack to go out and look for websites with the
Cisco name in its domain name (ex. usedcisco.com, ciscoequipment.com,
etc.) and see if those companies are selling Cisco equipment. I
want to
look for specific products (ex. WIC-1T, NM-4E, WS-2950-24) on these
websites and see if they are being sold for under 60% of their
MSRP. We
are trying to track down companies that are selling counterfeit
equipment. So I started by downloading the DNS list of all domain
names
so I could read through that and extract all domain names with
Cisco in
it. Once I do that I want to go to each page and search/scrape for
these
products, but I don't really know the best approach to take. Can
anyone
give me advice? Should I just do keyword searches for those 20+
products? Or is there a better approach?

I'm slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://scrubyt.org/

Glenn
 
B

brabuhr

give me advice? Should I just do keyword searches for those 20+
I'm slightly biased, but scrubyt should be able to do most of the
remaining heavy lifting for you

http://scrubyt.org/

On that note:

require "rubygems"
require "scrubyt"

froogle_data = Scrubyt::Extractor.define do
fetch "http://www.google.com/products"
fill_textfield "q", "WIC-1T"
submit

info do
product "WIC-1T"
vendor "NEW2U Hardware from ..."
price "$40.00"
end
next_page "Next", :limit => 10
end

puts froogle_data.to_xml

(tons of improvement needed, but):

<root>
<info>
<product>WIC-1T</product>
<vendor>NEW2U Hardware from ...</vendor>
<price>$40.00</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>ATS Computer Systems...</vendor>
<price>$353.95</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eBay</vendor>
<price>$49.95</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eBay</vendor>
<price>$149.99</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>PCsForEveryone.com</vendor>
<price>$337.07</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>COL - Computer Onlin...</vendor>
<price>$149.00</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eCOST.com</vendor>
<price>$297.14</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>eBay</vendor>
<price>$45.00</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>ATACOM</vendor>
<price>$291.95</price>
</info>
<info>
<product>WIC-1T</product>
<vendor>Express IT Options</vendor>
<price>$216.44</price>
</info>
</root>
 
G

Glenn Gillen

On that note:

<snip>
(tons of improvement needed, but):
<snip>

It's by no means a silver bullet, but could very well get you 80%
there. Setup a basic learning extract that is fairly generic looking
for terms you know will exist on the domains you want (say a model
number and a dollar sign?), have it loop over the URLs with products
on them, output the learner to production extractor and then tweak
the sites that aren't giving you the exact results you want.

Or, make life easier if you can and let froogle put it all into a
single format for you.

Best of luck,

Glenn
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top