Whats not used anymore?

Discussion in 'HTML' started by Travis Newbury, Jun 16, 2005.

  1. Does anyone know of a program that can crawl a website and tell what
    files are not used any more?

    The servers are running on IIS

    Thanks

    --
    -=tn=-
     
    Travis Newbury, Jun 16, 2005
    #1
    1. Advertising

  2. In article <>,
    "Travis Newbury" <> wrote:

    > Does anyone know of a program that can crawl a website and tell what
    > files are not used any more?
    >
    > The servers are running on IIS


    Short answer: no, I don't.

    Hand-wavy answer: Are you talking static HTML files? Or image files? In
    that case, I'd be inclined to trawl the server logs, run a find on the
    web root to get a list of all files, and do a diff. I know it's not
    quite that easy on a Windows box perhaps, but that'd be the basic idea.

    If you're talking about some sort of server-side scripts, it might be
    possible to do the same thing, and also grep around to see which scripts
    are included in which.

    --
    Joel.
     
    Joel Shepherd, Jun 17, 2005
    #2
    1. Advertising

  3. Travis Newbury

    Toby Inkster Guest

    Travis Newbury wrote:

    > Does anyone know of a program that can crawl a website and tell what
    > files are not used any more?


    Aren't you reading AWW?

    If your site is entirely static, try using wget or similar to crawl the
    site and create a mirror. That way you'll know which files *are* still
    being used and can infer the ones which aren't.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, Jun 17, 2005
    #3
  4. Joel Shepherd wrote:
    > Short answer: no, I don't.
    > Hand-wavy answer: Are you talking static HTML files? Or image files? In
    > that case, I'd be inclined to trawl the server logs...


    Yea, we are about to write our oun home grown crawler that will do what
    we need, I was just hoping there was someting out there that was
    already written. The HTML is pretty clean, and well maintained so we
    are not as worried about that, it is mostly the image files, PDF's and
    the like.

    Rather than use the logs, as some of the pages may not get accessed but
    once a quarter, we are walking all the HTML and CSS files looking for
    every instance of the extensions. Then doing a diff on the files them
    selves.

    Thanks

    --
    -=tn=-
     
    Travis Newbury, Jun 17, 2005
    #4
  5. Travis Newbury

    Andy Dingley Guest

    On 16 Jun 2005 11:28:50 -0700, "Travis Newbury"
    <> wrote:

    >Does anyone know of a program that can crawl a website and tell what
    >files are not used any more?


    What's "Not used" ? No-one reading it lately? Or no longer linked to
    the main site?
     
    Andy Dingley, Jun 17, 2005
    #5
  6. Travis Newbury

    data64 Guest

    "Travis Newbury" <> wrote in
    news::

    > Does anyone know of a program that can crawl a website and tell what
    > files are not used any more?
    >
    > The servers are running on IIS
    >
    > Thanks
    >


    We did something similar using perl, essentially comparing the files indexed
    by our search engine with the files in the webserver directory. Being static
    files, this was fairly simply.

    If you are looking for a spider to crawl things, and don't mind using perl
    there's Merlyn's article on a simple spider
    http://www.stonehenge.com/merlyn/WebTechniques/col07.html

    The swish-e open source search engine ships with a spider that you could use
    to return a list of files for your site and another for you filesystem.
    You would have to modify it to only return the name rather than entire
    document in your case.

    http://swish-e.org/docs/spider.html
    data64
     
    data64, Jun 17, 2005
    #6
  7. Andy Dingley wrote:
    > >Does anyone know of a program that can crawl a website and tell what
    > >files are not used any more?

    > What's "Not used" ? No-one reading it lately? Or no longer linked to
    > the main site?


    Not linked.

    --
    -=tn=-
     
    Travis Newbury, Jun 17, 2005
    #7
  8. Travis Newbury

    Mitja Guest

    On Thu, 16 Jun 2005 20:28:50 +0200, Travis Newbury
    <> wrote:

    > Does anyone know of a program that can crawl a website and tell what
    > files are not used any more?


    Obviously just by crawling the site (ie following links) you can only tell
    which files ARE in use (if you disregard files that may be dynamically
    referenced by scripts).

    Try Xenu. It's primary intented use is checking for broken links, but it
    can also crawl a website, then crawl the server using ftp and finally
    compare the two structures to find redundant files. You can easily get,
    install and configure a simple free ftp server just for this purpose (not
    as much work as it sounds).

    > The servers are running on IIS


    Servers, plural? That may be less convenient... Don't know how Xenu can
    handle that, play with it :)
     
    Mitja, Jun 17, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tersia Ehlert

    HTTPS - Window. code not working anymore

    Tersia Ehlert, Jan 10, 2005, in forum: ASP .Net
    Replies:
    1
    Views:
    465
    =?Utf-8?B?QnJhZCBSb2JlcnRz?=
    Jan 10, 2005
  2. Gregoire Poncet
    Replies:
    0
    Views:
    363
    Gregoire Poncet
    Feb 2, 2006
  3. =?Utf-8?B?UGF1bA==?=

    whats wrong with this XML, used xmlwriterclass

    =?Utf-8?B?UGF1bA==?=, Nov 15, 2006, in forum: ASP .Net
    Replies:
    0
    Views:
    280
    =?Utf-8?B?UGF1bA==?=
    Nov 15, 2006
  4. Eric Lilja
    Replies:
    4
    Views:
    487
  5. arthernan
    Replies:
    1
    Views:
    311
    Mike Placentra II
    Oct 22, 2007
Loading...

Share This Page