How to make list of all htm file...

P

Pero

I want to write search script in perl.
How to make list of all htm file on Linux - Apache web server?

Tnx.
 
B

Bjoern Hoehrmann

* Pero wrote in comp.lang.perl.misc:
I want to write search script in perl.
How to make list of all htm file on Linux - Apache web server?

There may not be any files on a web server (all pages could be generated
by the web server software directly in memory) or an infinite number of
files (dynamically created based on user input). Further, if you do not
have direct access to the server but rather want to create this list for
a remote server, you are limited by the options of the protocol the web
server system supports (usually only HTTP for the general public). You'd
have to write a crawler, or use an existing one, that visits a page and
follows all the links on it, recursively, until "all" pages have been
visited. This is a rather limited approach as some pages might only be
accessible via links from third party web pages, so you would have to
index "the whole web" for a usable list.
 
G

Gunnar Hjalmarsson

Pero said:
I want to write search script in perl.
How to make list of all htm file on Linux - Apache web server?

locate -r \.html$ > htmlfiles.txt
 
J

Jürgen Exner

Pero said:
I want to write search script in perl.
How to make list of all htm file on Linux - Apache web server?

I'd use File::Find to loop through all files. Then for each file found
you could use one of the tools from http://validator.w3.org to check if
the file contains valid HTML code. You can also download the validator
code and install it locally to avoid calling their service a gazillion
times.

jue
 
S

szr

David said:
Perl is a big hammer for such a small nail.

How about just typing this at your commandline:

find . -name "*.htm"

(that recurses down from your current directory. cd to \ if you want
to find ALL such files anywhere they may exist. But you probably
want to start at your Apache DocumentRoot).

Or, to find .htm or .html:

$ find . | grep -P 'html?$'

Or also .shtml and .pshtml:

$ find . | grep -P '[sp]?html?$'

Or to also find .xml

$ find . | grep -P '([sp]?html?|xml)$'


You get the idea. Also, grep with the -P arg uses a Perl style regex :)
 
A

Andrew DeFaria

David said:
Perl is a big hammer for such a small nail.

How about just typing this at your commandline:

find . -name "*.htm"

(that recurses down from your current directory. cd to \ if you want
to find ALL such files anywhere they may exist. But you probably want
to start at your Apache DocumentRoot).
"find" doesn't do this on Windows. On Unix there is no "\" to cd too. So
which OS are you speaking of?
 
D

Dr.Ruud

szr schreef:
$ find . | grep -P 'html?$'

That is quite wasteful, even if the current directory doesn't contain
millions of subdirectories and files.

And it would erroneously return ./test_html and such.

$ find . -type f -name "*.htm" -or -name "*.html"

$ find . -type f -regex ".*\.html?"
 
S

szr

Dr.Ruud said:
szr schreef:


That is quite wasteful, even if the current directory doesn't contain
millions of subdirectories and files.

And it would erroneously return ./test_html and such.

$ find . -type f -name "*.htm" -or -name "*.html"

$ find . -type f -regex ".*\.html?"

Ah, yes, I forgot the *. in my examples. And I forgot you could use
regex with find.
 
S

szr

Dr.Ruud said:
szr schreef:


That is quite wasteful, even if the current directory doesn't contain
millions of subdirectories and files.

Aside form forgetting *. which should of been at the beginning of my
patterns, is it really more wasteful? Does find not have to also check
each file it comes across too? Or is it just the over of piping the
final output from find over to grep? Other then that I don't see why it
would be more wasteful? On my both my Dual core Linux system as well as
an old P2 400 also running Linux, I see no difference in speed, even on
a large sprawling directory. find does it's thing, grep prunes it's
results.
 
S

szr

Glenn said:
They're not regular expressions: they're shell glob patterns.

I know that. I didn't mean it as a regex. The *.htm is anything, ending
with .htm

It is nice, though, that one can use just -regex when using find :)
 
D

Doug Miller

Aside form forgetting *. which should of been at the beginning of my
patterns, is it really more wasteful?

Yes, absolutely.
Does find not have to also check
each file it comes across too?

Certainly. But you're piping *all* of them to grep, thus making both find
*and* grep process all of them.
Or is it just the over of piping the
final output from find over to grep?

That, too.
Other then that I don't see why it
would be more wasteful?

Because it:
a) creates, opens, and closes a pipe that is not necessary
b) spawns an additional process (grep) that is not necessary
c) ships *every* filename across that unnecessary pipe to that unnecessary
process to be filtered
... when you could instead simply filter the filenames at the source, as
they're generated by find.
On my both my Dual core Linux system as well as
an old P2 400 also running Linux, I see no difference in speed, even on
a large sprawling directory.

That's because
a) you're on a single-user machine, and
b) you're not examining a large enough directory to notice the difference.
Try that in a multi-user environment with typical production directory trees,
and the difference will become visible.
find does it's thing, grep prunes it's results.

Pointless. find can both find *and* prune.
 
S

szr

Doug said:
Yes, absolutely.


Certainly. But you're piping *all* of them to grep, thus making both
find *and* grep process all of them.
Yep.

s/over/overhead/

That, too.


Because it:
a) creates, opens, and closes a pipe that is not necessary
b) spawns an additional process (grep) that is not necessary
c) ships *every* filename across that unnecessary pipe to that
unnecessary process to be filtered
.. when you could instead simply filter the filenames at the source,
as
they're generated by find.


That's because
a) you're on a single-user machine, and
b) you're not examining a large enough directory to notice the
difference.
Try that in a multi-user environment with typical production
directory trees, and the difference will become visible.

I logged into one of the large servers that I manage and ran the same
test, and found there to be a difference, especially when running it
using the system root (/) as the starting point. It is indeed better to
go the efficient route.
Pointless. find can both find *and* prune.

True. Wonderful, -regex, is.
 
D

Dr.Ruud

szr schreef:
find does it's thing, grep prunes it's results.

Be very careful with that approach, it can easily get you fired.

On a heavy loaded production server, not only make your find do the
pruning itself, but nice it too.

Just accept that a wide find can take tons of minutes. When you need a
wide find, you shouldn't be in a hurry.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,731
Messages
2,569,432
Members
44,832
Latest member
GlennSmall

Latest Threads

Top