K
Ken Fine
I would like some guidance regarding a "content scanner" I'm trying to
build. This ASP widget will automatically scan remote web sites for certain
kinds of content using a screen scraping component and simple pattern
matching. The widget will generate reports about what it found and where.
Ideally, I would like the widget to follow all of the http:// links on the
remote page for one level, and scan the child pages for certain kinds of
content. I'm trying to figure out the best way to do this. .
Here's what I'm thinking:
1) Scan known URL, make string of page content, embed that in a variable
strPageContent
2) Examine strPageContent for search term, generate report
3) Use a function to strip out everything from strPageContent except a list
of valid URLs
4) Use another function to remove all duplicate URLs from modified
strPageContent
5) Move strPageContent to an array
6) Loop through all items in the array, screen scraping each URL, and
testing it for the search term.Repeat as necessary.
7) Repeat as necessary for other search terms
I think this will probably work. However, I can't escape the nagging feeling
that either a) someone's already done this far more elegantly, or b) the
functionality may be baked in to ASP or ASP.NET, or available as an add-on.
Any pointers or good ideas out there?
Thanks.
build. This ASP widget will automatically scan remote web sites for certain
kinds of content using a screen scraping component and simple pattern
matching. The widget will generate reports about what it found and where.
Ideally, I would like the widget to follow all of the http:// links on the
remote page for one level, and scan the child pages for certain kinds of
content. I'm trying to figure out the best way to do this. .
Here's what I'm thinking:
1) Scan known URL, make string of page content, embed that in a variable
strPageContent
2) Examine strPageContent for search term, generate report
3) Use a function to strip out everything from strPageContent except a list
of valid URLs
4) Use another function to remove all duplicate URLs from modified
strPageContent
5) Move strPageContent to an array
6) Loop through all items in the array, screen scraping each URL, and
testing it for the search term.Repeat as necessary.
7) Repeat as necessary for other search terms
I think this will probably work. However, I can't escape the nagging feeling
that either a) someone's already done this far more elegantly, or b) the
functionality may be baked in to ASP or ASP.NET, or available as an add-on.
Any pointers or good ideas out there?
Thanks.