crawling the net...

A

ask josephsen

Hi NG

I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the smartest
way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
the relative link in relation to the url where the relative link was found
and then concatenate it? Does anyone know how other search-engines/ crawlers
walk the net?


Thanks :)

../ask
 
J

JKop

ask josephsen posted:
Hi NG

I'm making a program to crawl the internet. It works by retrieving all
links in a page, downloading the page of each link and again retrieving
all the links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the
smartest way to get the full url (http://www.xyz.com/wohoo.asp)? Do I
have to parse the relative link in relation to the url where the
relative link was found and then concatenate it? Does anyone know how
other search-engines/ crawlers walk the net?


Thanks :)

./ask

You should have posted this on:

alt.sports.gymnastics


It would've been more on-topic _there_.

-JKop
 
M

Morten Wennevik

Hi Ask,

You could try using the features of Path.GetFullPath which collapses /../
and /./ and returns the proper path. However, it insists on adding the
application path so you will need to do something like

string newUrl =
Path.GetFullPath(url).Substring(Application.StartupPath.Length+1));

It will switch the / to \ though. Oh, and remove the http:// from the url
first.

There are plenty web crawlers, just do a web searh on "web crawler" and
"web bot".


Happy coding!
Morten Wennevik [C# MVP]
 
M

mortb

I'm not developing webcrawlers, but a quick thought of mine is

string link = "../../wohoo.asp"
string thisPageURL = "http://www.xyz.com/wohoo.asp"
stirng [] linkParts = System.Text.RegularExpressions.Regex.Split(link,
"x2Ex2E/"); // split on ../
string [] URLParts = System.Text.RegularExpressions.Regex.Split(thisPageURL,
"/");

the length of linkParts.Lenght - 1 will now contain the wanted numbers of
"../" "directory recursion" and the last element will be the wanted page
the URL to the new page will be concatenated from the URLParts array,
exluding the the linkPartLength number of elements, and the last element in
LinkParts

Just a quick shot at an solution...

/mortb
 
C

Christopher Benson-Manica

ask josephsen said:
I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

(You could look at how wget is implemented. Or, better, just USE wget.)

Your post is off-topic for comp.lang.c++. Please visit

http://www.slack.net/~shiva/welcome.txt
http://www.parashift.com/c++-faq-lite/

for posting guidelines and frequently asked questions. Thank you.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top