George Orwell said:
#!/usr/bin/perl
# Public Domain
use strict;
use warnings;
Good so far.
open(IN, "$ARGV[0]") or die("No input file. Use perldoc $0 for more
info.\n");
You needlessly quote $ARGV[0] there; I'm not sure why. This can be
improved with lexical filehandles and the three-arg form of open(),
but major kudos for handling the failure with die. However, there are
more reasons for failing to open than the file not existing-- you may
not have read permissions, for instance. Just let Perl tell you why
it didn't work:
open my $infile, '<', $ARGV[0] or die "Couldn't open $ARGV[0]: $!";
You're only processing it line-by-line below, so why bother reading it
in all at once? If the file is large, this could use a lot more
memory than you are expecting. Omit this line entirely.
if($ARGV[1]){
open(OUT, ">>", "$ARGV[1]");
select OUT;
}
See above-- you need to check for failure to open the output file as
well, because you might not have permissions to write in that
directory, or the file could already exist, and you don't have
permissions to write to it, or any number of other things.
if($ARGV[1]){
open my $outfile, ">>", $ARGV[1])
or die "Couldn't open $ARGV[1] for output: $!";
select $outfile;
}
Note: because $outfile is declared in the if block, you can't refer to
it later. If you want to modify the program to use it later, you'll
need to declare $outfile outside the loop:
my $outfile;
if ($ARGV[1]) {....}
# The regex below does all the work... (I'm sure you could write this
program in a one-liner...)
foreach(@in){
if($_=~m/(http:\/\/\S+\.[\w\d.\/_%]+)/){
print "$1\n";
}
}
close(IN);
It's hard to tell what you want from your regex, as you are not
matching URLs with port numbers in them, or with GET parameters
(following a '?' character). I'll fix it in a VERY ugly way below.
Use HTML::LinkExtor instead.
# this reads one line at a time from $infile, which means you don't have
# to have the whole thing in memory at once.
# also, the regex is untested, and probably misses a few special cases--
# kinda rushed through rfc1808
while(<$infile>) {
print $1 if m|(http:// # protocol
[-\w.]+ # host
(?: :\d+)? # port number (optional)
/[\`-%$.+!*'(),\w

&=]* # path
(?: \?[\`-%$.+!*'(),\w;/?

&=]) # query
)
|igsmx;
}
Ugh! That's ugly (and probably broken, because I can't be arsed to
test it.)
Here's something pretty:
#!/usr/bin/perl
use warnings;
use strict;
use HTML::LinkExtor;
my $extractor = HTML::LinkExtor->new();
$extractor->parse_file($ARGV[0]);
if($ARGV[1]){
open my $outfile, ">>", $ARGV[1])
or die "Couldn't open $ARGV[1] for output: $!";
select $outfile;
}
print "Links found in $ARGV[0]:\n"
foreach ($extractor->links()) {
my ($tag, %links) = @{$_};
print join("\n", grep /^http:/ values %links);
}
NOTE: also untested, but I'm a heck of a lot more confident in that
one than I am in the previous one.
-=Eric