regex for URL in a log file

J

Jaga

hail all,
I am trying to write a regular expression to match a url in a text file.
the test file looks like below under the *********
I would like to match all the urls a print them out...
I think this is easy for most but a pain in the neck for me

thanks!


************
°;V8q|Ã`<F- ÃL/&¤ ?Q ` h þ  
6/$h :2003091520030922:
tfred@http://quintillium.com/mslegal/tssi986
URL  ssóq|Ã`<F- ÃL/²¥ ?Q ` h þ  
6/$h :2003091520030922:
tfred@http://ninet/Lists/Announcements/DispForm.h
 
J

Jaga

Hail again,
here is some code I 'lifted' from different places to do pretty much
what I want... unforutnately, it doesn't work and I am working on trying to
fix it...
##########################
open IFILE,"<log.txt" or die "Can't Open file:: $!";

@lines=<IFILE>;

$text = join "\n", @lines;

@hrefs=($text=~ m{ \"(?:(-)|http\:\/\/(.*?))\"\s+ }x);

print "list of href values\n";
$count = 1;
foreach $href (@hrefs) {
print "$href\n";
$count++;
}
print $count;

close IFILE;
##########################
thanks,
Jaga
 
J

Jaga

I change the regex to look like this:
@hrefs=($text=~ m{http\:\/\/(.*?)\s+ }x);
unfortunately, it only returns:
quintillium.com/mslegal/tssi986

and doesn't return the other url
how can I do it recursivly through out the whole $text string?
or how can I do this more efficiently...
 
G

Glenn Jackman

Jaga said:
I am trying to write a regular expression to match a url in a text file.

Don't reinvent the wheel:

use Regexp::Common qw(URI);
my @urls;
while (<>) {
push @urls, /$RE{URI}{HTTP}/g;
}
 
F

Florian von Savigny

One way to do it:

$text = "blabla soiu apoj match poi aigjpo match poua ier";

while ($text =~ /[^a-z](match)[^a-z]/g) {
print $1, "\n";
}

this outputs:

match
match

The crucial thing is the /g (global) modifier, which causes the
matching to go on after the first match, until there's no more.
@hrefs=($text=~ m{http\:\/\/(.*?)\s+ }x);
unfortunately, it only returns:
quintillium.com/mslegal/tssi986

This seems obvious, since you've excluded the "http://" from the
parentheses. I've never formulated such a thing the way you have done
here, but you might try to exchange your x modifier for g (x is
misled: it means "extended regular expressions", which means that you
can use comments and whitespace inside your regex to make it more
readable); it might work similar to my while () loop. However, as this
seems to return the contents of the first pair of parentheses (all $1,
so to speak), I wouldn't want to guess what it returns if you use more
than one pair.

Some more hints:

- if you use delimiters other than //, as you have done, you need not
escape the "/" in the regex; and you never need to escape ":"

- it is often a good idea to define matches by what they must NOT be:
e.g., formulate the body of the URL as "[^\s]+" (assuming it is
indeed delimited by some whitespace character). This has the side
effect of being helpful with tools such as grep, which don't support
minimal matching quantifiers (*?).

- if you do not want to exclude protocols other than HTTP, you might
want to say sth like "(http|ftp|news|mailto)" instead of just
"http" (but see above). You'd have to adjust the slashes, of course.

--


Florian v. Savigny

If you are going to reply in private, please be patient, as I only
check for mail something like once a week. - Si vous allez répondre
personellement, patientez s.v.p., car je ne lis les courriels
qu'environ une fois par semaine.
 
F

Florian von Savigny

Florian von Savigny said:
However, as this
seems to return the contents of the first pair of parentheses (all $1,
so to speak), I wouldn't want to guess what it returns if you use more
than one pair.

Sorry, got it: it returns What You Would Expect: if you have two pairs
of parentheses, it will return $1, $2, for the first match, then $1,
$2 for the second, and so on. So using more than one pair of
parentheses probably makes your approach unwieldy, as you'd probably
have to post-process your list.

--


Florian v. Savigny

If you are going to reply in private, please be patient, as I only
check for mail something like once a week. - Si vous allez répondre
personellement, patientez s.v.p., car je ne lis les courriels
qu'environ une fois par semaine.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,266
Messages
2,571,075
Members
48,772
Latest member
Backspace Studios

Latest Threads

Top