regex for URL in a log file

Discussion in 'Perl Misc' started by Jaga, Oct 2, 2003.

  1. Jaga

    Jaga Guest

    hail all,
    I am trying to write a regular expression to match a url in a text file.
    the test file looks like below under the *********
    I would like to match all the urls a print them out...
    I think this is easy for most but a pain in the neck for me

    thanks!


    ************
    °;V8q|Ã`<F- ÃL/&¤ ?Q ` h þ  
    6/$h :2003091520030922:
    tfred@http://quintillium.com/mslegal/tssi986
    URL  ssóq|Ã`<F- ÃL/²¥ ?Q ` h þ  
    6/$h :2003091520030922:
    tfred@http://ninet/Lists/Announcements/DispForm.h
    Jaga, Oct 2, 2003
    #1
    1. Advertising

  2. Jaga

    Jaga Guest

    Hail again,
    here is some code I 'lifted' from different places to do pretty much
    what I want... unforutnately, it doesn't work and I am working on trying to
    fix it...
    ##########################
    open IFILE,"<log.txt" or die "Can't Open file:: $!";

    @lines=<IFILE>;

    $text = join "\n", @lines;

    @hrefs=($text=~ m{ \"(?:(-)|http\:\/\/(.*?))\"\s+ }x);

    print "list of href values\n";
    $count = 1;
    foreach $href (@hrefs) {
    print "$href\n";
    $count++;
    }
    print $count;

    close IFILE;
    ##########################
    thanks,
    Jaga

    "Jaga" <> wrote in message
    news:blhq2d$4v4$...
    > hail all,
    > I am trying to write a regular expression to match a url in a text

    file.
    > the test file looks like below under the *********
    > I would like to match all the urls a print them out...
    > I think this is easy for most but a pain in the neck for me
    >
    > thanks!
    >
    >
    > ************
    > °;V8q|Ã`<F- ÃL/&¤ ?Q ` h þ  
    > 6/$h :2003091520030922:
    > tfred@http://quintillium.com/mslegal/tssi986
    > URL  ssóq|Ã`<F- ÃL/²¥ ?Q ` h þ  
    > 6/$h :2003091520030922:
    > tfred@http://ninet/Lists/Announcements/DispForm.h
    >
    >
    Jaga, Oct 2, 2003
    #2
    1. Advertising

  3. Jaga

    Jaga Guest

    I change the regex to look like this:
    @hrefs=($text=~ m{http\:\/\/(.*?)\s+ }x);
    unfortunately, it only returns:
    quintillium.com/mslegal/tssi986

    and doesn't return the other url
    how can I do it recursivly through out the whole $text string?
    or how can I do this more efficiently...

    "Jaga" <> wrote in message
    news:bli00h$9l0$...
    > Hail again,
    > here is some code I 'lifted' from different places to do pretty much
    > what I want... unforutnately, it doesn't work and I am working on trying

    to
    > fix it...
    > ##########################
    > open IFILE,"<log.txt" or die "Can't Open file:: $!";
    >
    > @lines=<IFILE>;
    >
    > $text = join "\n", @lines;
    >
    > @hrefs=($text=~ m{ \"(?:(-)|http\:\/\/(.*?))\"\s+ }x);
    >
    > print "list of href values\n";
    > $count = 1;
    > foreach $href (@hrefs) {
    > print "$href\n";
    > $count++;
    > }
    > print $count;
    >
    > close IFILE;
    > ##########################
    > thanks,
    > Jaga
    >
    > "Jaga" <> wrote in message
    > news:blhq2d$4v4$...
    > > hail all,
    > > I am trying to write a regular expression to match a url in a text

    > file.
    > > the test file looks like below under the *********
    > > I would like to match all the urls a print them out...
    > > I think this is easy for most but a pain in the neck for me
    > >
    > > thanks!
    > >
    > >
    > > ************
    > > °;V8q|Ã`<F- ÃL/&¤ ?Q ` h þ  
    > > 6/$h :2003091520030922:
    > > tfred@http://quintillium.com/mslegal/tssi986
    > > URL  ssóq|Ã`<F- ÃL/²¥ ?Q ` h þ  
    > > 6/$h :2003091520030922:
    > > tfred@http://ninet/Lists/Announcements/DispForm.h
    > >
    > >

    >
    >
    Jaga, Oct 2, 2003
    #3
  4. Jaga <> wrote:
    > I am trying to write a regular expression to match a url in a text file.


    Don't reinvent the wheel:

    use Regexp::Common qw(URI);
    my @urls;
    while (<>) {
    push @urls, /$RE{URI}{HTTP}/g;
    }

    --
    Glenn Jackman
    NCF Sysadmin
    Glenn Jackman, Oct 2, 2003
    #4
  5. One way to do it:

    $text = "blabla soiu apoj match poi aigjpo match poua ier";

    while ($text =~ /[^a-z](match)[^a-z]/g) {
    print $1, "\n";
    }

    this outputs:

    match
    match

    The crucial thing is the /g (global) modifier, which causes the
    matching to go on after the first match, until there's no more.

    > @hrefs=($text=~ m{http\:\/\/(.*?)\s+ }x);
    > unfortunately, it only returns:
    > quintillium.com/mslegal/tssi986


    This seems obvious, since you've excluded the "http://" from the
    parentheses. I've never formulated such a thing the way you have done
    here, but you might try to exchange your x modifier for g (x is
    misled: it means "extended regular expressions", which means that you
    can use comments and whitespace inside your regex to make it more
    readable); it might work similar to my while () loop. However, as this
    seems to return the contents of the first pair of parentheses (all $1,
    so to speak), I wouldn't want to guess what it returns if you use more
    than one pair.

    Some more hints:

    - if you use delimiters other than //, as you have done, you need not
    escape the "/" in the regex; and you never need to escape ":"

    - it is often a good idea to define matches by what they must NOT be:
    e.g., formulate the body of the URL as "[^\s]+" (assuming it is
    indeed delimited by some whitespace character). This has the side
    effect of being helpful with tools such as grep, which don't support
    minimal matching quantifiers (*?).

    - if you do not want to exclude protocols other than HTTP, you might
    want to say sth like "(http|ftp|news|mailto)" instead of just
    "http" (but see above). You'd have to adjust the slashes, of course.

    --


    Florian v. Savigny

    If you are going to reply in private, please be patient, as I only
    check for mail something like once a week. - Si vous allez répondre
    personellement, patientez s.v.p., car je ne lis les courriels
    qu'environ une fois par semaine.
    Florian von Savigny, Oct 2, 2003
    #5
  6. Florian von Savigny <> writes:

    > However, as this
    > seems to return the contents of the first pair of parentheses (all $1,
    > so to speak), I wouldn't want to guess what it returns if you use more
    > than one pair.


    Sorry, got it: it returns What You Would Expect: if you have two pairs
    of parentheses, it will return $1, $2, for the first match, then $1,
    $2 for the second, and so on. So using more than one pair of
    parentheses probably makes your approach unwieldy, as you'd probably
    have to post-process your list.

    --


    Florian v. Savigny

    If you are going to reply in private, please be patient, as I only
    check for mail something like once a week. - Si vous allez répondre
    personellement, patientez s.v.p., car je ne lis les courriels
    qu'environ une fois par semaine.
    Florian von Savigny, Oct 2, 2003
    #6
  7. Florian von Savigny <> wrote:

    > e.g., formulate the body of the URL as "[^\s]+"



    or as \S+ which matches exactly the same characters.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Oct 3, 2003
    #7
  8. Jaga

    Ted Zlatanov Guest

    Ted Zlatanov, Oct 3, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Henrik_the_boss
    Replies:
    0
    Views:
    2,647
    Henrik_the_boss
    Nov 5, 2003
  2. Amratash
    Replies:
    0
    Views:
    519
    Amratash
    Apr 13, 2004
  3. unomystEz
    Replies:
    0
    Views:
    548
    unomystEz
    Nov 19, 2006
  4. Replies:
    3
    Views:
    754
    Reedick, Andrew
    Jul 1, 2008
  5. Jaga

    regex for URL in a log file

    Jaga, Oct 2, 2003, in forum: Perl Misc
    Replies:
    0
    Views:
    78
Loading...

Share This Page