scraping web pages for cisco products

Chuck Dawit · Sep 19, 2007

I submitted a post a few days ago about scraping the web for Cisco
products. I didn't receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldbycisco.net
http://www.ciscoindia.net
http://www.ciscobootcamp.net
http://www.cisco-guy.net

and I want to scrape through them and determine which websites are
selling new cisco products, I'm actually looking for around 20 or so
products (ex. WIC-1T, NM-4E, WS-G2950-24). One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don't know how to handle after that. Does
anyone have a different/better approach? Any help would be appreciated.

Konrad Meyer · Sep 19, 2007

--nextPart2034462.32gXqZMtcQ
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Glen Holcomb:

=20
=20
I don't remember who but someone suggested using Froogle and parsing that
output. Froogle and a few other sites like Pricewatch might be a far less
complicated approach, you won't find all of them but then again I don't
think you can possibly find everything anyway.
=20
--=20
"Hey brother Christian with your high and mighty errand, Your actions spe= ak
so loud, I can't hear a word you're saying."
=20
-Greg Graffin (Bad Religion)

That was me. Seems to me you shouldn't parse froogle so much as just use it.
Writing a script is a lot more work and won't get you what you want; froogle
will.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart2034462.32gXqZMtcQ
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBG8WaaCHB0oCiR2cwRAoxRAJ0YfUzjSQTl3uY5425bHwB+FUFbmwCdG1yO
lorZNLwuPVwewROHM8c3u0c=
=EJGr
-----END PGP SIGNATURE-----

--nextPart2034462.32gXqZMtcQ--

Chuck Dawit · Sep 19, 2007

Konrad said:
Quoth Glen Holcomb:

That was me. Seems to me you shouldn't parse froogle so much as just use
it.
Writing a script is a lot more work and won't get you what you want;
froogle
will.

But see I need to use only the list that I have with Cisco in the domain
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up
website names like the ones I have?

Konrad Meyer · Sep 19, 2007

--nextPart5575520.dxeH3hTIou
Content-Type: text/plain;
charset="utf-8"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Quoth Chuck Dawit:

=20
But see I need to use only the list that I have with Cisco in the domain= =20
name. (ex. usedcisco.com, ciscoequipment.com) Can froogle look up=20
website names like the ones I have?

Assuming it uses a similar interface to google (I don't know much about it),
yes, "site:usedcisco.com" etc.

Why do you need the list? Just search for anything below 60% MSRP, and ANY
website selling counterfeit cisco devices should come up.

=2D-=20
Konrad Meyer <[email protected]> http://konrad.sobertillnoon.com/

--nextPart5575520.dxeH3hTIou
Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQBG8XGMCHB0oCiR2cwRArLvAJsGGnuGZkNVf2jRREopuqCoLggvXwCeOGak
cx8tGhRIgyBauGfs0LNOVDc=
=Nv46
-----END PGP SIGNATURE-----

--nextPart5575520.dxeH3hTIou--

Chuck Dawit · Sep 19, 2007

Glen said:
Why is the domain important if you are looking for fraudulent equipment
based on selling price? I don't think you can search by url, I don't
see
why anyone looking for a specific product would need to do that.

--
"Hey brother Christian with your high and mighty errand, Your actions
speak
so loud, I can't hear a word you're saying."

-Greg Graffin (Bad Religion)

I'm looking for copywright infrigment on Cisco's name 2. So I'm not only
looking for those companies that are selling Cisco counterfeit equipment
but also those who are infringing on Cisco's name as well.

brabuhr · Sep 19, 2007

One idea I was given was to
split the pages into ones with forms and those without forms. Those
without forms probably wont have anything for sale so I can eliminate
those. But then I really don't know how to handle after that.

Here's a naive implementation of binning by forms:

cat sites www.cnn.com
www.usedcisco.com
www.rubyforge.org
slashdot.org
technocrat.net
bk.com

cat firstbin.rb

#!/usr/bin/env ruby

require 'rubygems'
require 'mechanize'

agent = WWW::Mechanize.new

sites = File.readlines("sites")
bin1 = []
bin2 = []
bin3 = []

sites.each do |site|
site.chomp!

page = agent.get "http://#{site}"
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

p bin1
p bin2
p bin3

ruby firstbin.rb

["www.cnn.com", "www.rubyforge.org", "slashdot.org"]
["www.usedcisco.com", "technocrat.net"]
["bk.com"]

Chuck Dawit · Sep 20, 2007

With this method do I need to know the name of the form to use it? With
mechanize I thought you had to look at the form name first before you
could use it?

brabuhr · Sep 20, 2007

With this method do I need to know the name of the form to use it? With

mechanize I thought you had to look at the form name first before you
could use it?

It helps to know someway to distinguish the form you're looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don't see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).

Chuck Dawit · Sep 20, 2007

unknown said:
It helps to know someway to distinguish the form you're looking for
from the other forms on the page. It would be possible to iterate
through all the forms on a page, entering some text into the text
fields in the form and submitting them; but, most of the time the
script would probably be in either the wrong form or the wrong field
in the right form (and, of course, there are other issues, e.g. forms
that require multiple fields to be edited). I don't see anyway to
avoid customizing the code for each site (though, if you get a good
framework built the effort per site should decrease?).

I agree but I have around 2000 sites to look at and I can't look at each
and every form, that would take way to long. Do you think a better
approach would be to use a search engines API to search for the products
on each site? I've never used any search engine API, if I know the
website name and the product name and a price I want can I use those
parameters in the search to find results?

Brad Phelan · Sep 20, 2007

Chuck said:
I agree but I have around 2000 sites to look at and I can't look at each
and every form, that would take way to long. Do you think a better
approach would be to use a search engines API to search for the products
on each site? I've never used any search engine API, if I know the
website name and the product name and a price I want can I use those
parameters in the search to find results?

This query seems to work

site:solecentral.com.au OR site:xtargets.com AND crocs

I advertise my brothers e-commerce site on my site and they both contain
the same keyword "crocs". Google returns all the pages from my site and
his site that contain the word "crocs". However I am not sure how high
the query scales as I think Google truncates the search string after
some length so adding in 2000 sites for the query string might break.

Not sure if the same query trick also works in froogle as well as
vanilla google.

Hope this is somewhat helpful.

Chuck Dawit · Sep 20, 2007

unknown said:
Here's a naive implementation of binning by forms:

page = agent.get "http://#{site}"
forms = page.forms
search_forms = forms.select{|f|
(f.name and f.name.match /search/i) or
(f.action and f.action.to_s.match /search/i)
}

if search_forms.size > 0
bin1 << site
elsif forms.size > 0
bin2 << site
else
bin3 << site
end
end

I'm checking the size of the form like in the code above but when it
gets to the 13th url to check the script just exits. Does anyone know
why? How can I run a check on this?

Todd Benson · Sep 21, 2007

I submitted a post a few days ago about scraping the web for Cisco
products. I didn't receive that much input so I thought I would ask
again. Here are the requirments. I have a list of 2000 urls that all
have Cisco in its domain name.
(ex. http://www.soldbycisco.net
http://www.ciscoindia.net
http://www.ciscobootcamp.net
http://www.cisco-guy.net

I suspect that if Cisco has a problem with counterfeit products that
hurt their long term bottom line, it would most certainly come from
web sites that do not have the word cisco in DNS name.

You should have asked about scraping for some more generic term, maybe?

There are basically two things that bother me with your question.

1: there is something fundamentally wrong with using an open source
product to protect the integrity of a select few relatively expensive
products.

2. an employee of Cisco would have no problem securing funds for a
proposal that was delivered on a hardware level (unless Cisco is
having some monetary problems I'm not aware of). If you don't know
what I'm talking about, then I'll shut up.

Todd

Web scraping i guess (Yet to start, maybe this should be done in python?)	1	Nov 10, 2021
Screen Scraping Advice	9	Sep 17, 2007
[ANN] Rails demo application for web page scraping	0	Mar 7, 2006
Configuring LAMP for Ruby created web pages (not Rails). 2010	2	Nov 24, 2010
Web pages for mobile devices	7	Jul 18, 2008
JavaScript web scraping test cases?	6	Aug 20, 2003
The Python Web Authoring and Application Pages	0	Jun 14, 2010
Populating Web Pages from Spreadsheet	7	Dec 5, 2006

scraping web pages for cisco products

Chuck Dawit

Konrad Meyer

Chuck Dawit

Konrad Meyer

Chuck Dawit

brabuhr

Chuck Dawit

brabuhr

Chuck Dawit

Brad Phelan

Chuck Dawit

Todd Benson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads