regex to match any url

Discussion in 'Perl Misc' started by nodiseos@gmail.com, Feb 14, 2006.

  1. Guest

    , Feb 14, 2006
    #1
    1. Advertising

  2. wrote in news:1139950940.817938.158230
    @g14g2000cwa.googlegroups.com:

    > I am struggling way too much with this. Does someone have a regex

    that
    > will match any url-ish string like. Not worried about mail links.
    >
    > http://sd.org
    > www.dssd.com
    > ibm.mil
    > https://sdsdsd.jobs
    > xyz.travel


    Please show what you have tried and what has not worked so that we can
    help you with what you don't know rather than acting as a "write-my-
    code-for-me" service.

    #!/usr/bin/perl

    use strict;
    use warnings;

    while ( <DATA> ) {
    print if m{ \A (?: https?:// )? \w+ (?: \. \w+)+ \n \z }x;
    }

    __DATA__
    http://sd.org
    www.dssd.com
    ibm.mil
    https://sdsdsd.jobs
    xyz.travel

    D:\Home\asu1\UseNet\clpmisc> u
    http://sd.org
    www.dssd.com
    ibm.mil
    https://sdsdsd.jobs
    xyz.travel



    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Feb 14, 2006
    #2
    1. Advertising

  3. Keith Keller Guest

    On 2006-02-14, <> wrote:
    > I am struggling way too much with this. Does someone have a regex that
    > will match any url-ish string like. Not worried about mail links.
    >
    > http://sd.org
    > www.dssd.com
    > ibm.mil
    > https://sdsdsd.jobs
    > xyz.travel


    What code did you actually try, and what was the actual output versus
    the expected output?

    Have you read the Posting Guidelines for this newsgroup?

    --keith

    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom
    see X- headers for PGP signature information
     
    Keith Keller, Feb 14, 2006
    #3
  4. DJ Stunks Guest

    wrote:
    > I am struggling way too much with this.


    three words: Regexp::Common::URI

    -jp
     
    DJ Stunks, Feb 15, 2006
    #4
  5. robic0 Guest

    On 14 Feb 2006 13:02:20 -0800, wrote:

    >I am struggling way too much with this. Does someone have a regex that
    >will match any url-ish string like. Not worried about mail links.
    >
    >http://sd.org
    >www.dssd.com
    >ibm.mil
    >https://sdsdsd.jobs
    >xyz.travel
    >
    >Thanks!


    I do, but I won't give it away for free
     
    robic0, Feb 15, 2006
    #5
  6. wrote:
    > I am struggling way too much with this. Does someone have a regex
    > that will match any url-ish string like. Not worried about mail
    > links.
    >
    > http://sd.org
    > www.dssd.com
    > ibm.mil
    > https://sdsdsd.jobs
    > xyz.travel


    That's easy: /.*/ will match not only all of your examples but any URL you
    can imagine.

    Now, having said that, maybe it actually was a different question you wanted
    to ask?

    jue
     
    Jürgen Exner, Feb 15, 2006
    #6
  7. Keith Keller Guest

    On 2006-02-15, robic0 <robic0> wrote:
    >
    > I do, but I won't give it away for free


    If you exchanged your code for what it is worth, you'd need to pay the
    OP to take it and fix it.

    --keith

    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom
    see X- headers for PGP signature information
     
    Keith Keller, Feb 15, 2006
    #7
  8. Guest

    A. Sinan Unur <> wrote:
    > wrote in news:1139950940.817938.158230
    > @g14g2000cwa.googlegroups.com:


    >> I am struggling way too much with this. Does someone have a regex

    > that
    >> will match any url-ish string like. Not worried about mail links.
    >>
    >> http://sd.org
    >> www.dssd.com
    >> ibm.mil
    >> https://sdsdsd.jobs
    >> xyz.travel


    > Please show what you have tried and what has not worked so that we can
    > help you with what you don't know rather than acting as a "write-my-
    > code-for-me" service.


    > #!/usr/bin/perl
    >
    > use strict;
    > use warnings;
    >
    > while ( <DATA> ) {
    > print if m{ \A (?: https?:// )? \w+ (?: \. \w+)+ \n \z }x;

    ^
    |
    Perhaps this should changed to *
    to relect one word valid URLs
    such as 'localhost' :)

    > }


    Axel
     
    , Feb 15, 2006
    #8
  9. DJ Stunks schrieb:

    > wrote:
    >
    >>I am struggling way too much with this.

    >
    >
    > three words: Regexp::Common::URI
    >
    > -jp
    >


    Hm, let me have a look again at what the OP wrote:

    schrieb:
    > I am struggling way too much with this. Does someone have a regex that
    > will match any url-ish string like. Not worried about mail links.
    >
    > http://sd.org
    > www.dssd.com
    > ibm.mil
    > https://sdsdsd.jobs
    > xyz.travel
    >
    > Thanks!
    >


    I read this as: 'I want a RE that matches all of my example-URIs, because they
    all look url-ish.' ( a very vague and, at least in my eyes, error-prone
    criterium, tempting me to give this: /.*\.\w{2,6}/ as an answer). To the OP:
    What, exactly, do you want to accomplish?

    If my assumption of the OP's intention is correct, then you're out of luck with
    Regexp::Common, as it will only match valid URIs, as shown here:

    D:\Temp\test_area>cat stunks.pl
    #!/usr/bin/perl

    use warnings;
    use strict;

    use Regexp::Common qw/URI/;

    chomp ( my @uris = ( <DATA> ) );
    foreach ( @uris ) {
    /$RE{URI}{-keep}/ ? print "Found: $1\n" : print "Discarding: $_\n";
    }

    __DATA__
    http://sd.org
    www.dssd.com
    ibm.mil
    https://sdsdsd.jobs
    xyz.travel

    D:\Temp\test_area>perl stunks.pl
    Found: http://sd.org
    Discarding: www.dssd.com
    Discarding: ibm.mil
    Discarding: https://sdsdsd.jobs
    Discarding: xyz.travel

    If I did misunderstand the OP I sincerely apologize for jumping at you when you
    were giving a perfectly valid Solution ( though I still see some issues coming
    up with the https-uris... , but hey, here's where the Fun(tm) begins: hooking
    your own REs into Regexp::Common :-> )


    Greetings,
    Andreas Pürzer

    --
    Have Fun,
    and if you can't have fun,
    have someone else's fun.
    The Beautiful South
     
    Andreas Puerzer, Feb 15, 2006
    #9
  10. wrote in news:yLMIf.20922$wl.12746
    @text.news.blueyonder.co.uk:

    > A. Sinan Unur <> wrote:
    >> wrote in news:1139950940.817938.158230
    >> @g14g2000cwa.googlegroups.com:

    >
    >>> I am struggling way too much with this. Does someone have a regex

    >> that
    >>> will match any url-ish string like. Not worried about mail links.
    >>>
    >>> http://sd.org
    >>> www.dssd.com
    >>> ibm.mil
    >>> https://sdsdsd.jobs
    >>> xyz.travel

    >
    >> Please show what you have tried and what has not worked so that we
    >> can help you with what you don't know rather than acting as a "write-
    >> my-code-for-me" service.

    >

    ....

    >> print if m{ \A (?: https?:// )? \w+ (?: \. \w+)+ \n \z }x;

    > ^
    > |
    > Perhaps this should changed to *
    > to relect one word valid URLs
    > such as 'localhost' :)
    >


    I wrote it to match the strings the OP provided. Further extension is
    left to the reader as an exercise ;-)

    Sinan

    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
     
    A. Sinan Unur, Feb 15, 2006
    #10
  11. DJ Stunks Guest

    Andreas Puerzer wrote:

    > If my assumption of the OP's intention is correct, then you're out of luck with
    > Regexp::Common, as it will only match valid URIs, as shown here:
    >
    > D:\Temp\test_area>cat stunks.pl
    > #!/usr/bin/perl
    >
    > use warnings;
    > use strict;
    >
    > use Regexp::Common qw/URI/;
    >
    > chomp ( my @uris foreach ( @uris ) {
    > /$RE{URI}{-keep}/ ? print "Found: $1\n" : print "Discarding: $_\n";
    > }
    >
    > __DATA__
    > http://sd.org
    > www.dssd.com
    > ibm.mil
    > https://sdsdsd.jobs
    > xyz.travel
    >
    > D:\Temp\test_area>perl stunks.pl
    > Found: http://sd.org
    > Discarding: www.dssd.com
    > Discarding: ibm.mil
    > Discarding: https://sdsdsd.jobs
    > Discarding: xyz.travel
    >


    Thanks, I had assumed that "www.dssd.com" for instance, would have
    matched. Clearly, ibm.mil and xyz.travel are so ambiguous as to be
    anything and I can understand them not matching.

    I suppose one would have to insist on valid IANA GTLDs in the regex
    my @TLDs = qw{aero biz cat com coop info jobs mobi museum} and so
    on...

    > If I did misunderstand the OP I sincerely apologize for jumping at you when you
    > were giving a perfectly valid Solution ( though I still see some issues coming
    > up with the https-uris... , but hey, here's where the Fun(tm) begins: hooking
    > your own REs into Regexp::Common :-> )
    >


    I don't feel jumped on :) but from the docs regarding https:

    $RE{URI}{HTTP}{-scheme}

    If -scheme => P is specified the pattern P is used as the
    scheme. By default P is qr/http/. https and https? are
    reasonable alternatives.

    altering this value could also allow a match of "www.dssd.com", but
    then you're starting to get to such a generic regex it would open you
    up to a lot of false positives.

    -jp
     
    DJ Stunks, Feb 15, 2006
    #11
  12. DJ Stunks schrieb:

    [previous discussion about matching ambigous URIs with Regexp::Common snipped]

    >
    > I don't feel jumped on :) but from the docs regarding https:
    >
    > $RE{URI}{HTTP}{-scheme}
    >
    > If -scheme => P is specified the pattern P is used as the
    > scheme. By default P is qr/http/. https and https? are
    > reasonable alternatives.
    >
    > altering this value could also allow a match of "www.dssd.com", but
    > then you're starting to get to such a generic regex it would open you
    > up to a lot of false positives.
    >
    > -jp
    >


    Aaaargh, Shame on me! I don't know why I missed that part of the description, as
    it is the second sentence in the pod!!

    <blush>
    Now, where's my BOfH-Excuse-Generator? Ah, I see, it's due to the Phases of the
    Jupiter-moons and the reversed magnetic field of the Sun on a Wednesday the 15th
    in a non-leap-year that I overlooked these pretty obvious sentences...
    </blush>
    ;->

    Thanks for pointing this out,
    Andreas Pürzer

    --
    Have Fun,
    and if you can't have fun,
    have someone else's fun.
    The Beautiful South
     
    Andreas Puerzer, Feb 16, 2006
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hiwa
    Replies:
    0
    Views:
    653
  2. Victor
    Replies:
    2
    Views:
    667
    Victor
    May 17, 2004
  3. Replies:
    3
    Views:
    822
    Reedick, Andrew
    Jul 1, 2008
  4. Dave
    Replies:
    8
    Views:
    167
    Dave Weaver
    Aug 10, 2005
  5. Replies:
    2
    Views:
    175
    Thomas 'PointedEars' Lahn
    Oct 27, 2007
Loading...

Share This Page