crawling the net...

ask josephsen · Apr 29, 2004

Hi NG

I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the smartest
way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
the relative link in relation to the url where the relative link was found
and then concatenate it? Does anyone know how other search-engines/ crawlers
walk the net?

Thanks

../ask

JKop · Apr 29, 2004

ask josephsen posted:

Hi NG

I'm making a program to crawl the internet. It works by retrieving all
links in a page, downloading the page of each link and again retrieving
all the links. (If there is better ways I'd like to hear)

My problem is relative links (like "../../wohoo.asp"). What is the
smartest way to get the full url (http://www.xyz.com/wohoo.asp)? Do I
have to parse the relative link in relation to the url where the
relative link was found and then concatenate it? Does anyone know how
other search-engines/ crawlers walk the net?

Thanks

./ask

You should have posted this on:

alt.sports.gymnastics

It would've been more on-topic _there_.

-JKop

Morten Wennevik · Apr 29, 2004

Hi Ask,

You could try using the features of Path.GetFullPath which collapses /../
and /./ and returns the proper path. However, it insists on adding the
application path so you will need to do something like

string newUrl =
Path.GetFullPath(url).Substring(Application.StartupPath.Length+1));

It will switch the / to \ though. Oh, and remove the http:// from the url
first.

There are plenty web crawlers, just do a web searh on "web crawler" and
"web bot".

Happy coding!
Morten Wennevik [C# MVP]

mortb · Apr 29, 2004

I'm not developing webcrawlers, but a quick thought of mine is

string link = "../../wohoo.asp"
string thisPageURL = "http://www.xyz.com/wohoo.asp"
stirng [] linkParts = System.Text.RegularExpressions.Regex.Split(link,
"x2Ex2E/"); // split on ../
string [] URLParts = System.Text.RegularExpressions.Regex.Split(thisPageURL,
"/");

the length of linkParts.Lenght - 1 will now contain the wanted numbers of
"../" "directory recursion" and the last element will be the wanted page
the URL to the new page will be concatenated from the URLParts array,
exluding the the linkPartLength number of elements, and the last element in
LinkParts

Just a quick shot at an solution...

/mortb

Christopher Benson-Manica · Apr 29, 2004

ask josephsen said:
I'm making a program to crawl the internet. It works by retrieving all links
in a page, downloading the page of each link and again retrieving all the
links. (If there is better ways I'd like to hear)

(You could look at how wget is implemented. Or, better, just USE wget.)

Your post is off-topic for comp.lang.c++. Please visit

http://www.slack.net/~shiva/welcome.txt
http://www.parashift.com/c++-faq-lite/

for posting guidelines and frequently asked questions. Thank you.

Crawling	1	Mar 10, 2021
Is crawling the stack "bad"? Why?	13	Feb 25, 2008
Only one table shows up with the information	2	Mar 29, 2023
Search engines crawling our .NET site	3	Mar 4, 2005
Having difficulty with the layout of these images / video for this web page	2	Jul 5, 2022
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Web Crawling/Threading and Things That Go Bump in the Night	1	Aug 4, 2006
Need help with code on website (noob)	2	Jul 18, 2022

crawling the net...

ask josephsen

JKop

Morten Wennevik

mortb

Christopher Benson-Manica

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads