Silly li'l perl script, plx improve.

G

George Orwell

#!/usr/bin/perl
# Public Domain
use strict;
use warnings;
open(IN, "$ARGV[0]") or die("No input file. Use perldoc $0 for more
info.\n");
my @in = <IN>;
if($ARGV[1]){
open(OUT, ">>", "$ARGV[1]");
select OUT;
}
# The regex below does all the work... (I'm sure you could write this
program in a one-liner...)
foreach(@in){
if($_=~m/(http:\/\/\S+\.[\w\d.\/_%]+)/){
print "$1\n";
}
}
close(IN);
 
K

Keith Keller

}
# The regex below does all the work... (I'm sure you could write this
program in a one-liner...)
foreach(@in){
if($_=~m/(http:\/\/\S+\.[\w\d.\/_%]+)/){
print "$1\n";
}

perl -n -e 'print "$1\n" if (m/(your_pattern_here)/)' infile > outfile

I'll let somebody else nitpick the regex. I will say that it's hard to
suggest improvements when you don't state what you expect to send to the
program or what you expect to be output. Have you read the Posting
Guidelines that are frequently posted here?

--keith
 
E

Eric Schwartz

George Orwell said:
#!/usr/bin/perl
# Public Domain
use strict;
use warnings;

Good so far.
open(IN, "$ARGV[0]") or die("No input file. Use perldoc $0 for more
info.\n");

You needlessly quote $ARGV[0] there; I'm not sure why. This can be
improved with lexical filehandles and the three-arg form of open(),
but major kudos for handling the failure with die. However, there are
more reasons for failing to open than the file not existing-- you may
not have read permissions, for instance. Just let Perl tell you why
it didn't work:

open my $infile, '<', $ARGV[0] or die "Couldn't open $ARGV[0]: $!";
my @in = <IN>;

You're only processing it line-by-line below, so why bother reading it
in all at once? If the file is large, this could use a lot more
memory than you are expecting. Omit this line entirely.
if($ARGV[1]){
open(OUT, ">>", "$ARGV[1]");
select OUT;
}

See above-- you need to check for failure to open the output file as
well, because you might not have permissions to write in that
directory, or the file could already exist, and you don't have
permissions to write to it, or any number of other things.

if($ARGV[1]){
open my $outfile, ">>", $ARGV[1])
or die "Couldn't open $ARGV[1] for output: $!";
select $outfile;
}

Note: because $outfile is declared in the if block, you can't refer to
it later. If you want to modify the program to use it later, you'll
need to declare $outfile outside the loop:

my $outfile;
if ($ARGV[1]) {....}
# The regex below does all the work... (I'm sure you could write this
program in a one-liner...)
foreach(@in){
if($_=~m/(http:\/\/\S+\.[\w\d.\/_%]+)/){
print "$1\n";
}
}
close(IN);

It's hard to tell what you want from your regex, as you are not
matching URLs with port numbers in them, or with GET parameters
(following a '?' character). I'll fix it in a VERY ugly way below.
Use HTML::LinkExtor instead.

# this reads one line at a time from $infile, which means you don't have
# to have the whole thing in memory at once.
# also, the regex is untested, and probably misses a few special cases--
# kinda rushed through rfc1808
while(<$infile>) {
print $1 if m|(http:// # protocol
[-\w.]+ # host
(?: :\d+)? # port number (optional)
/[\`-%$.+!*'(),\w:mad:&=]* # path
(?: \?[\`-%$.+!*'(),\w;/?:mad:&=]) # query
)
|igsmx;
}

Ugh! That's ugly (and probably broken, because I can't be arsed to
test it.)

Here's something pretty:

#!/usr/bin/perl
use warnings;
use strict;
use HTML::LinkExtor;

my $extractor = HTML::LinkExtor->new();

$extractor->parse_file($ARGV[0]);

if($ARGV[1]){
open my $outfile, ">>", $ARGV[1])
or die "Couldn't open $ARGV[1] for output: $!";
select $outfile;
}

print "Links found in $ARGV[0]:\n"
foreach ($extractor->links()) {
my ($tag, %links) = @{$_};
print join("\n", grep /^http:/ values %links);
}

NOTE: also untested, but I'm a heck of a lot more confident in that
one than I am in the previous one.

-=Eric
 
K

Klaus

George said:
#!/usr/bin/perl
# Public Domain
use strict;
use warnings;
open(IN, "$ARGV[0]") or die("No input file. Use perldoc $0 for more
info.\n");
my @in = <IN>;
if($ARGV[1]){
open(OUT, ">>", "$ARGV[1]");
select OUT;
}
# The regex below does all the work... (I'm sure you could write this
program in a one-liner...)
foreach(@in){
if($_=~m/(http:\/\/\S+\.[\w\d.\/_%]+)/){

I have some doubts about the second "." in the regexp (the one after
\d.), but I'll ignore my doubts...
print "$1\n";
}
}
close(IN);


#!/usr/bin/perl
# Public Domain
use strict;
use warnings;

defined $ARGV[0] or die "Error - No parameters";

my $OUT;
if (defined $ARGV[1]){
open $OUT, '>>', $ARGV[1] or die "Error - Open >>$ARGV[1]: $!";
select $OUT;
}

open my $IN, '<', $ARGV[0] or die "Error - Open <$ARGV[0]: $!";

while (<$IN>) {
# The regex below does all the work...
# (I'm sure you could write this program in a one-liner...)
if(m{(http://\S+\.[\w\d./_%]+)}){ print $1, "\n" }
}
 
K

Klaus

Eric said:
if($ARGV[1]){
open my $outfile, ">>", $ARGV[1])
or die "Couldn't open $ARGV[1] for output: $!";
select $outfile;
}

Note: because $outfile is declared in the if block, you can't refer to
it later. If you want to modify the program to use it later, you'll
need to declare $outfile outside the loop:

I agree, and I would like to add what I think will happen in the above
code fragment under the condition that $ARGV[1] is defined and not ""
and not "0" and not 0:

- the condition if($ARGV[1]) is true, so we enter the if-branch
- the file $ARGV[1] is opened as $outfile for output (append)
- and, given that the open was successful,
- sets the default to $outfile for all subsequent prints (select)
- the file $outfile is closed as the scope of the if exits

Now, for the rest of the program, all subsequent prints fail silently
because print now defaults to the closed filehandle $outfile.

I did not test the above scenario, but would like to ask anyway:
Am I right ?
my $outfile;
if ($ARGV[1]) {....}
 
E

Eric Schwartz

Klaus said:
Eric said:
if($ARGV[1]){
open my $outfile, ">>", $ARGV[1])
or die "Couldn't open $ARGV[1] for output: $!";
select $outfile;
}

Note: because $outfile is declared in the if block, you can't refer to
it later. If you want to modify the program to use it later, you'll
need to declare $outfile outside the loop:

I agree, and I would like to add what I think will happen in the above
code fragment under the condition that $ARGV[1] is defined and not ""
and not "0" and not 0:

How about testing what does happen? It probably would have taken you
less time than it took to compose your post. That's not an insult, by
the way-- I'm just pointing out that reality trumps logic by a large
factor.

$ cat /tmp/foo
if ($ARGV[0]) {
open my $outfile, '>', $ARGV[0] or die "couldn't open: $!";
select $outfile;
}

for (0..20_000_000) {
print
}

emschwar@aragorn:~$ perl /tmp/foo /tmp/beh
emschwar@aragorn:~$ cat /tmp/beh
01234567891011121314151617181920212223242526272829303132.....
- the condition if($ARGV[1]) is true, so we enter the if-branch
- the file $ARGV[1] is opened as $outfile for output (append)
- and, given that the open was successful,
- sets the default to $outfile for all subsequent prints (select)

So far, you're correct.
- the file $outfile is closed as the scope of the if exits

Where do you get this last assertion? I don't find it anywhere in
perldoc -f open-- in fact, it suggests you use IO::File if you want
the behaviour you're asking for. In this case I cheated, and let the
OS close all open filehandles on exit for me. Yes, it's a bit
naughty, but not enough (in this context only) to keep me up at
nights. Toss in a

close();

at the end if you like.
Now, for the rest of the program, all subsequent prints fail silently
because print now defaults to the closed filehandle $outfile.

I did not test the above scenario, but would like to ask anyway:
Am I right ?

Why didn't you test it? It's very easy to create such a test (see
above, took me about 2 minutes), and you'd have answered your own
question before I even had my morning caffeine.
my $outfile;
if ($ARGV[1]) {....}

That gives you a filehandle that is accessible in a much larger scope
than you need it. Which is not evil, but I prefer to keep things as
tightly scoped as possible.

-=Eric
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top