Design guidance needed: traversing links in ASP

K

Ken Fine

I would like some guidance regarding a "content scanner" I'm trying to
build. This ASP widget will automatically scan remote web sites for certain
kinds of content using a screen scraping component and simple pattern
matching. The widget will generate reports about what it found and where.

Ideally, I would like the widget to follow all of the http:// links on the
remote page for one level, and scan the child pages for certain kinds of
content. I'm trying to figure out the best way to do this. .

Here's what I'm thinking:

1) Scan known URL, make string of page content, embed that in a variable
strPageContent
2) Examine strPageContent for search term, generate report
3) Use a function to strip out everything from strPageContent except a list
of valid URLs
4) Use another function to remove all duplicate URLs from modified
strPageContent
5) Move strPageContent to an array
6) Loop through all items in the array, screen scraping each URL, and
testing it for the search term.Repeat as necessary.
7) Repeat as necessary for other search terms

I think this will probably work. However, I can't escape the nagging feeling
that either a) someone's already done this far more elegantly, or b) the
functionality may be baked in to ASP or ASP.NET, or available as an add-on.

Any pointers or good ideas out there?

Thanks.
 
D

David Morgan

Your lack of responses to this post are probably down to the fact that you
have written a functional specification, almost psuedo code rather than
posted details of an ASP problem.

What is preventing you from actually getting started with this?

I have inserted some keywords to look up in MSDN, Google or ASPFAQ in your
original comments. They relate to ASP as this is not a .NET forum.

HTH

Ken Fine said:
I would like some guidance regarding a "content scanner" I'm trying to
build. This ASP widget will automatically scan remote web sites for certain
kinds of content using a screen scraping component and simple pattern
matching. The widget will generate reports about what it found and where.

Ideally, I would like the widget to follow all of the http:// links on the
remote page for one level, and scan the child pages for certain kinds of
content. I'm trying to figure out the best way to do this. .

Here's what I'm thinking:

1) Scan known URL, make string of page content, embed that in a variable
strPageContent
MSXML2.ServerXMLHTTP

2) Examine strPageContent for search term, generate report

ResponseText InStr
3) Use a function to strip out everything from strPageContent except a list
of valid URLs

Regular Expressions
4) Use another function to remove all duplicate URLs from modified
strPageContent

(Swap 5 with 4)
Deduplicate array
5) Move strPageContent to an array

Split(strPageContent, "http://")
 
K

Ken Fine

David,

Thanks much for your helpful reply. The reason that I posted is because
I didn't really know if my specification was actually functional, or if
it was duplicative of functionalities that were already baked into
ASP/ASP.NET.

And although I might seem to have some idea of what I'm talking about,
I've never actually done many of these things. I've never built an array
or looped through it, for instance, even though I think I understand why
people make them and what they're useful for.

I'll look around at your links and get going with this; sounds like you
think it's doable.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top