crawling the net...

Discussion in 'C++' started by ask josephsen, Apr 29, 2004.

  1. Hi NG

    I'm making a program to crawl the internet. It works by retrieving all links
    in a page, downloading the page of each link and again retrieving all the
    links. (If there is better ways I'd like to hear)

    My problem is relative links (like "../../wohoo.asp"). What is the smartest
    way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
    the relative link in relation to the url where the relative link was found
    and then concatenate it? Does anyone know how other search-engines/ crawlers
    walk the net?


    Thanks :)

    ../ask
     
    ask josephsen, Apr 29, 2004
    #1
    1. Advertising

  2. ask josephsen

    JKop Guest

    ask josephsen posted:

    > Hi NG
    >
    > I'm making a program to crawl the internet. It works by retrieving all
    > links in a page, downloading the page of each link and again retrieving
    > all the links. (If there is better ways I'd like to hear)
    >
    > My problem is relative links (like "../../wohoo.asp"). What is the
    > smartest way to get the full url (http://www.xyz.com/wohoo.asp)? Do I
    > have to parse the relative link in relation to the url where the
    > relative link was found and then concatenate it? Does anyone know how
    > other search-engines/ crawlers walk the net?
    >
    >
    > Thanks :)
    >
    > ./ask


    You should have posted this on:

    alt.sports.gymnastics


    It would've been more on-topic _there_.

    -JKop
     
    JKop, Apr 29, 2004
    #2
    1. Advertising

  3. Hi Ask,

    You could try using the features of Path.GetFullPath which collapses /../
    and /./ and returns the proper path. However, it insists on adding the
    application path so you will need to do something like

    string newUrl =
    Path.GetFullPath(url).Substring(Application.StartupPath.Length+1));

    It will switch the / to \ though. Oh, and remove the http:// from the url
    first.

    There are plenty web crawlers, just do a web searh on "web crawler" and
    "web bot".


    Happy coding!
    Morten Wennevik [C# MVP]
     
    Morten Wennevik, Apr 29, 2004
    #3
  4. ask josephsen

    mortb Guest

    I'm not developing webcrawlers, but a quick thought of mine is

    string link = "../../wohoo.asp"
    string thisPageURL = "http://www.xyz.com/wohoo.asp"
    stirng [] linkParts = System.Text.RegularExpressions.Regex.Split(link,
    "x2Ex2E/"); // split on ../
    string [] URLParts = System.Text.RegularExpressions.Regex.Split(thisPageURL,
    "/");

    the length of linkParts.Lenght - 1 will now contain the wanted numbers of
    "../" "directory recursion" and the last element will be the wanted page
    the URL to the new page will be concatenated from the URLParts array,
    exluding the the linkPartLength number of elements, and the last element in
    LinkParts

    Just a quick shot at an solution...

    /mortb


    "ask josephsen" <jaj(((a)))oticon.dk> wrote in message
    news:4090c8a4$0$1118$...
    > Hi NG
    >
    > I'm making a program to crawl the internet. It works by retrieving all

    links
    > in a page, downloading the page of each link and again retrieving all the
    > links. (If there is better ways I'd like to hear)
    >
    > My problem is relative links (like "../../wohoo.asp"). What is the

    smartest
    > way to get the full url (http://www.xyz.com/wohoo.asp)? Do I have to parse
    > the relative link in relation to the url where the relative link was found
    > and then concatenate it? Does anyone know how other search-engines/

    crawlers
    > walk the net?
    >
    >
    > Thanks :)
    >
    > ./ask
    >
    >
     
    mortb, Apr 29, 2004
    #4
  5. ask josephsen <jaj(((a)))oticon.dk> spoke thus:

    > I'm making a program to crawl the internet. It works by retrieving all links
    > in a page, downloading the page of each link and again retrieving all the
    > links. (If there is better ways I'd like to hear)


    (You could look at how wget is implemented. Or, better, just USE wget.)

    Your post is off-topic for comp.lang.c++. Please visit

    http://www.slack.net/~shiva/welcome.txt
    http://www.parashift.com/c -faq-lite/

    for posting guidelines and frequently asked questions. Thank you.

    --
    Christopher Benson-Manica | I *should* know what I'm talking about - if I
    ataru(at)cyberspace.org | don't, I need to know. Flames welcome.
     
    Christopher Benson-Manica, Apr 29, 2004
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Mark
    Replies:
    3
    Views:
    454
    fd123456
    Mar 7, 2005
  2. chris

    Crawling

    chris, Jun 15, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    478
    Steve C. Orr [MVP, MCSD]
    Jun 15, 2005
  3. John Bradbury

    Web-crawling

    John Bradbury, Oct 4, 2003, in forum: Python
    Replies:
    4
    Views:
    433
    John J. Lee
    Oct 4, 2003
  4. S Borg

    web crawling.

    S Borg, Jan 19, 2006, in forum: Python
    Replies:
    4
    Views:
    446
    John M. Gabriele
    Jan 20, 2006
  5. Remarkable
    Replies:
    1
    Views:
    328
Loading...

Share This Page