Whats not used anymore?

T

Travis Newbury

Does anyone know of a program that can crawl a website and tell what
files are not used any more?

The servers are running on IIS

Thanks
 
J

Joel Shepherd

Travis Newbury said:
Does anyone know of a program that can crawl a website and tell what
files are not used any more?

The servers are running on IIS

Short answer: no, I don't.

Hand-wavy answer: Are you talking static HTML files? Or image files? In
that case, I'd be inclined to trawl the server logs, run a find on the
web root to get a list of all files, and do a diff. I know it's not
quite that easy on a Windows box perhaps, but that'd be the basic idea.

If you're talking about some sort of server-side scripts, it might be
possible to do the same thing, and also grep around to see which scripts
are included in which.
 
T

Toby Inkster

Travis said:
Does anyone know of a program that can crawl a website and tell what
files are not used any more?

Aren't you reading AWW?

If your site is entirely static, try using wget or similar to crawl the
site and create a mirror. That way you'll know which files *are* still
being used and can infer the ones which aren't.
 
T

Travis Newbury

Joel said:
Short answer: no, I don't.
Hand-wavy answer: Are you talking static HTML files? Or image files? In
that case, I'd be inclined to trawl the server logs...

Yea, we are about to write our oun home grown crawler that will do what
we need, I was just hoping there was someting out there that was
already written. The HTML is pretty clean, and well maintained so we
are not as worried about that, it is mostly the image files, PDF's and
the like.

Rather than use the logs, as some of the pages may not get accessed but
once a quarter, we are walking all the HTML and CSS files looking for
every instance of the extensions. Then doing a diff on the files them
selves.

Thanks
 
A

Andy Dingley

Does anyone know of a program that can crawl a website and tell what
files are not used any more?

What's "Not used" ? No-one reading it lately? Or no longer linked to
the main site?
 
D

data64

Does anyone know of a program that can crawl a website and tell what
files are not used any more?

The servers are running on IIS

Thanks

We did something similar using perl, essentially comparing the files indexed
by our search engine with the files in the webserver directory. Being static
files, this was fairly simply.

If you are looking for a spider to crawl things, and don't mind using perl
there's Merlyn's article on a simple spider
http://www.stonehenge.com/merlyn/WebTechniques/col07.html

The swish-e open source search engine ships with a spider that you could use
to return a list of files for your site and another for you filesystem.
You would have to modify it to only return the name rather than entire
document in your case.

http://swish-e.org/docs/spider.html
data64
 
M

Mitja

Does anyone know of a program that can crawl a website and tell what
files are not used any more?

Obviously just by crawling the site (ie following links) you can only tell
which files ARE in use (if you disregard files that may be dynamically
referenced by scripts).

Try Xenu. It's primary intented use is checking for broken links, but it
can also crawl a website, then crawl the server using ftp and finally
compare the two structures to find redundant files. You can easily get,
install and configure a simple free ftp server just for this purpose (not
as much work as it sounds).
The servers are running on IIS

Servers, plural? That may be less convenient... Don't know how Xenu can
handle that, play with it :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top