How to make list of all htm file...

Discussion in 'Perl Misc' started by Pero, Jun 21, 2008.

  1. Pero

    Pero Guest

    I want to write search script in perl.
    How to make list of all htm file on Linux - Apache web server?

    Tnx.
     
    Pero, Jun 21, 2008
    #1
    1. Advertising

  2. * Pero wrote in comp.lang.perl.misc:
    >I want to write search script in perl.
    >How to make list of all htm file on Linux - Apache web server?


    There may not be any files on a web server (all pages could be generated
    by the web server software directly in memory) or an infinite number of
    files (dynamically created based on user input). Further, if you do not
    have direct access to the server but rather want to create this list for
    a remote server, you are limited by the options of the protocol the web
    server system supports (usually only HTTP for the general public). You'd
    have to write a crawler, or use an existing one, that visits a page and
    follows all the links on it, recursively, until "all" pages have been
    visited. This is a rather limited approach as some pages might only be
    accessible via links from third party web pages, so you would have to
    index "the whole web" for a usable list.
    --
    Björn Höhrmann · mailto: · http://bjoern.hoehrmann.de
    Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
    68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
     
    Bjoern Hoehrmann, Jun 21, 2008
    #2
    1. Advertising

  3. Pero wrote:
    > I want to write search script in perl.
    > How to make list of all htm file on Linux - Apache web server?


    locate -r \.html$ > htmlfiles.txt

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
     
    Gunnar Hjalmarsson, Jun 21, 2008
    #3
  4. "Pero" <> wrote:
    >I want to write search script in perl.
    >How to make list of all htm file on Linux - Apache web server?


    I'd use File::Find to loop through all files. Then for each file found
    you could use one of the tools from http://validator.w3.org to check if
    the file contains valid HTML code. You can also download the validator
    code and install it locally to avoid calling their service a gazillion
    times.

    jue
     
    Jürgen Exner, Jun 21, 2008
    #4
  5. Pero

    szr Guest

    David Filmer wrote:
    > Pero wrote:
    >> I want to write search script in perl.
    >> How to make list of all htm file on Linux - Apache web server?

    >
    > Perl is a big hammer for such a small nail.
    >
    > How about just typing this at your commandline:
    >
    > find . -name "*.htm"
    >
    > (that recurses down from your current directory. cd to \ if you want
    > to find ALL such files anywhere they may exist. But you probably
    > want to start at your Apache DocumentRoot).


    Or, to find .htm or .html:

    $ find . | grep -P 'html?$'

    Or also .shtml and .pshtml:

    $ find . | grep -P '[sp]?html?$'

    Or to also find .xml

    $ find . | grep -P '([sp]?html?|xml)$'


    You get the idea. Also, grep with the -P arg uses a Perl style regex :)

    --
    szr
     
    szr, Jun 21, 2008
    #5
  6. David Filmer wrote:
    > Pero wrote:
    >> I want to write search script in perl.
    >> How to make list of all htm file on Linux - Apache web server?

    > Perl is a big hammer for such a small nail.
    >
    > How about just typing this at your commandline:
    >
    > find . -name "*.htm"
    >
    > (that recurses down from your current directory. cd to \ if you want
    > to find ALL such files anywhere they may exist. But you probably want
    > to start at your Apache DocumentRoot).

    "find" doesn't do this on Windows. On Unix there is no "\" to cd too. So
    which OS are you speaking of?
    --
    Andrew DeFaria <http://defaria.com>
    If all the world is a stage, where is the audience sitting?
     
    Andrew DeFaria, Jun 22, 2008
    #6
  7. Pero

    Dr.Ruud Guest

    Re: [OT] How to make list of all htm file...

    szr schreef:

    > $ find . | grep -P 'html?$'


    That is quite wasteful, even if the current directory doesn't contain
    millions of subdirectories and files.

    And it would erroneously return ./test_html and such.

    $ find . -type f -name "*.htm" -or -name "*.html"

    $ find . -type f -regex ".*\.html?"

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 22, 2008
    #7
  8. Pero

    szr Guest

    Dr.Ruud wrote:
    > szr schreef:
    >
    >> $ find . | grep -P 'html?$'

    >
    > That is quite wasteful, even if the current directory doesn't contain
    > millions of subdirectories and files.
    >
    > And it would erroneously return ./test_html and such.
    >
    > $ find . -type f -name "*.htm" -or -name "*.html"
    >
    > $ find . -type f -regex ".*\.html?"


    Ah, yes, I forgot the *. in my examples. And I forgot you could use
    regex with find.

    --
    szr
     
    szr, Jun 22, 2008
    #8
  9. Pero

    szr Guest

    Re: [OT] How to make list of all htm file...

    Dr.Ruud wrote:
    > szr schreef:
    >
    >> $ find . | grep -P 'html?$'

    >
    > That is quite wasteful, even if the current directory doesn't contain
    > millions of subdirectories and files.


    Aside form forgetting *. which should of been at the beginning of my
    patterns, is it really more wasteful? Does find not have to also check
    each file it comes across too? Or is it just the over of piping the
    final output from find over to grep? Other then that I don't see why it
    would be more wasteful? On my both my Dual core Linux system as well as
    an old P2 400 also running Linux, I see no difference in speed, even on
    a large sprawling directory. find does it's thing, grep prunes it's
    results.

    --
    szr
     
    szr, Jun 22, 2008
    #9
  10. Pero

    szr Guest

    Glenn Jackman wrote:
    > At 2008-06-22 05:55PM, "szr" wrote:
    >> Dr.Ruud wrote:
    >>> szr schreef:
    >>>
    >>>> $ find . | grep -P 'html?$'
    >>>
    >>> That is quite wasteful, even if the current directory doesn't
    >>> contain millions of subdirectories and files.
    >>>
    >>> And it would erroneously return ./test_html and such.
    >>>
    >>> $ find . -type f -name "*.htm" -or -name "*.html"
    >>>
    >>> $ find . -type f -regex ".*\.html?"

    >>
    >> Ah, yes, I forgot the *. in my examples. And I forgot you could use
    >> regex with find.

    >
    > They're not regular expressions: they're shell glob patterns.
    >
    >
    > --
    > Glenn Jackman
    > Write a wise saying and your name will live forever. -- Anonymous


    I know that. I didn't mean it as a regex. The *.htm is anything, ending
    with .htm

    It is nice, though, that one can use just -regex when using find :)

    --
    szr
     
    szr, Jun 23, 2008
    #10
  11. Pero

    Doug Miller Guest

    Re: [OT] How to make list of all htm file...

    In article <>, "szr" <> wrote:
    >Dr.Ruud wrote:
    >> szr schreef:
    >>
    >>> $ find . | grep -P 'html?$'

    >>
    >> That is quite wasteful, even if the current directory doesn't contain
    >> millions of subdirectories and files.

    >
    >Aside form forgetting *. which should of been at the beginning of my
    >patterns, is it really more wasteful?


    Yes, absolutely.

    >Does find not have to also check
    >each file it comes across too?


    Certainly. But you're piping *all* of them to grep, thus making both find
    *and* grep process all of them.

    >Or is it just the over of piping the
    >final output from find over to grep?


    That, too.

    >Other then that I don't see why it
    >would be more wasteful?


    Because it:
    a) creates, opens, and closes a pipe that is not necessary
    b) spawns an additional process (grep) that is not necessary
    c) ships *every* filename across that unnecessary pipe to that unnecessary
    process to be filtered
    ... when you could instead simply filter the filenames at the source, as
    they're generated by find.

    >On my both my Dual core Linux system as well as
    >an old P2 400 also running Linux, I see no difference in speed, even on
    >a large sprawling directory.


    That's because
    a) you're on a single-user machine, and
    b) you're not examining a large enough directory to notice the difference.
    Try that in a multi-user environment with typical production directory trees,
    and the difference will become visible.

    > find does it's thing, grep prunes it's results.


    Pointless. find can both find *and* prune.
     
    Doug Miller, Jun 29, 2008
    #11
  12. Pero

    szr Guest

    Re: [OT] How to make list of all htm file...

    Doug Miller wrote:
    > In article <>, "szr"
    > <> wrote:
    >> Dr.Ruud wrote:
    >>> szr schreef:
    >>>
    >>>> $ find . | grep -P 'html?$'
    >>>
    >>> That is quite wasteful, even if the current directory doesn't
    >>> contain millions of subdirectories and files.

    >>
    >> Aside form forgetting *. which should of been at the beginning of my
    >> patterns, is it really more wasteful?

    >
    > Yes, absolutely.
    >
    >> Does find not have to also check
    >> each file it comes across too?

    >
    > Certainly. But you're piping *all* of them to grep, thus making both
    > find *and* grep process all of them.


    Yep.

    >> Or is it just the over of piping the
    >> final output from find over to grep?


    s/over/overhead/

    > That, too.
    >
    >> Other then that I don't see why it
    >> would be more wasteful?

    >
    > Because it:
    > a) creates, opens, and closes a pipe that is not necessary
    > b) spawns an additional process (grep) that is not necessary
    > c) ships *every* filename across that unnecessary pipe to that
    > unnecessary process to be filtered
    > .. when you could instead simply filter the filenames at the source,
    > as
    > they're generated by find.
    >
    >> On my both my Dual core Linux system as well as
    >> an old P2 400 also running Linux, I see no difference in speed, even
    >> on a large sprawling directory.

    >
    > That's because
    > a) you're on a single-user machine, and
    > b) you're not examining a large enough directory to notice the
    > difference.
    > Try that in a multi-user environment with typical production
    > directory trees, and the difference will become visible.


    I logged into one of the large servers that I manage and ran the same
    test, and found there to be a difference, especially when running it
    using the system root (/) as the starting point. It is indeed better to
    go the efficient route.

    >> find does it's thing, grep prunes it's results.

    >
    > Pointless. find can both find *and* prune.


    True. Wonderful, -regex, is.

    --
    szr
     
    szr, Jun 29, 2008
    #12
  13. Pero

    Dr.Ruud Guest

    Re: [OT] How to make list of all htm file...

    szr schreef:

    > find does it's thing, grep prunes it's results.


    Be very careful with that approach, it can easily get you fired.

    On a heavy loaded production server, not only make your find do the
    pruning itself, but nice it too.

    Just accept that a wide find can take tons of minutes. When you need a
    wide find, you shouldn't be in a hurry.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Jun 29, 2008
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rob

    Create htm file from Web Form

    Rob, Aug 22, 2003, in forum: ASP .Net
    Replies:
    0
    Views:
    2,103
  2. Kylin
    Replies:
    3
    Views:
    365
    Peter Larsson
    May 14, 2005
  3. work4u

    how to make a htm page by ASP FSO

    work4u, Sep 5, 2003, in forum: ASP General
    Replies:
    5
    Views:
    124
    Bite My Bubbles
    Sep 6, 2003
  4. Tamara
    Replies:
    2
    Views:
    136
    Michele Dondi
    Apr 7, 2004
  5. Replies:
    6
    Views:
    182
    Dr.Ruud
    Feb 6, 2007
Loading...

Share This Page