Regexp help.

Discussion in 'Perl Misc' started by Cab, Jun 2, 2006.

  1. Cab

    Cab Guest

    Hi all.

    I'm trying to set up a script to strip out URL's from the body of a
    Usenet post.

    Any clues please? I have some expressions that I'm using, but they're
    very long winded and inefficient, as seen below. At the moment, I've
    done this in bash, but want to eventually set up a perl script to do
    this.

    So far I've got this small script that will remove URLs that start at
    the beginning of a line, into a file. This is the easy part (Note, I
    know this is messy, but this is still a dev script, at the moment).

    ---
    echo remove spaces from the start of lines
    sed 's/^ *//g' sorted_file > 1

    echo Remove all '>' from a file.
    sed '/>/d' 1 > 2

    echo uniq the file
    uniq 2 > 3


    echo Move all lines beginning with http or www into another file
    sed -n '/^http/p' 3 > 4
    sed -n '/^www/p' 3 >> 4

    echo Remove all junk on lines from "space" to EOL
    sed '/ .*$/d' 4 > 4.1

    echo uniq the file
    uniq 4.1 > 4.2

    echo So far, I've got a file with all www and http only.
    mv 4.2 http_and_www_only
    ---

    Once I've stripped these lines (easy enough), I have a file that
    remains like this:

    ----
    And the URL is:
    Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
    Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
    Anyone still got the url of the pages about the woman who keeps going
    Are available on: http://www.spete.net/ukrm/sedan06/index.html
    are July 6-8. The reason being "Power Big Meet",
    http://www.bigmeet.com/ ,
    Are you sure? http://www.usgpru.net/
    a scout around www.nslu2-linux.org - and perhaps there isn't any easier
    asked where the sinks were and if you could plug curling tongs into the
    ----

    The result I want is a list like the following:

    http://ukrm.net/faq/UKRMsCBT.html
    http://www.girlsbike2.com/
    http://www.spete.net/ukrm/sedan06/index.html
    http://www.bigmeet.com/
    http://www.usgpru.net/
    www.nslu2-linux.org

    Can anyone give me some clues or pointers to websites where I can go
    into this in more detail please?
    --
    Cab
    Cab, Jun 2, 2006
    #1
    1. Advertising

  2. Cab

    Mirco Wahab Guest

    Thus spoke Cab (on 2006-06-02 15:57):

    > I'm trying to set up a script to strip out URL's from the body of a
    > Usenet post.
    > The result I want is a list like the following:
    >
    > http://ukrm.net/faq/UKRMsCBT.html
    > http://www.girlsbike2.com/
    > http://www.spete.net/ukrm/sedan06/index.html
    > http://www.bigmeet.com/
    > http://www.usgpru.net/
    > www.nslu2-linux.org


    The following prints all links
    (starting w/http or www) from $text

    use:
    $> perl dumplinks.pl < text.txt

    #!/usr/bin/perl
    use strict;
    use warnings;

    my $data = do {local $/; <> };
    print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

    # or:
    # while (<>) {
    # print "$1\n" while /(\b(http|www)\S+)/g;
    # }


    Of course, this can be done by an one-liner ;-)

    Regards

    Mirco
    Mirco Wahab, Jun 2, 2006
    #2
    1. Advertising

  3. Cab

    Dr.Ruud Guest

    Cab schreef:

    > Subject: Regexp help.


    Please go and read the Posting Guidelines.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Jun 2, 2006
    #3
  4. Cab

    Paul Lalli Guest

    Cab wrote:
    > I'm trying to set up a script to strip out URL's from the body of a
    > Usenet post.


    <snip bash script>

    > Can anyone give me some clues or pointers to websites where I can go
    > into this in more detail please?


    open the original file for reading
    open two files for writing - one for the modified file, one for the
    list of URLs
    loop through each line of the original file
    Search for a URI, using Regexp::Common::URI. Replace it with nothing,
    and be sure to capture the URI.
    print the modified line to the modified file
    print the captured URI to the URI file.

    Documentation to help you in this goal:
    open a file: perldoc -f open
    Looping: perldoc perlsyn
    Reading a line from a file: perldoc -f readline
    Using search-and-replace: perldoc perlop, perldoc perlretut
    Regexp::Common::URI:
    http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI.pm
    printing to a file: perldoc -f print

    Once you have made your *perl* attempt, if it doesn't work the way you
    want, feel free to post it here to seek assistance. In the mean time,
    be sure to read the posting guidelines for this group. They are posted
    here twice a week.

    Paul Lalli
    Paul Lalli, Jun 2, 2006
    #4
  5. Cab

    Xicheng Jia Guest

    Cab wrote:
    > Hi all.
    >
    > I'm trying to set up a script to strip out URL's from the body of a
    > Usenet post.
    >
    > Any clues please? I have some expressions that I'm using, but they're
    > very long winded and inefficient, as seen below. At the moment, I've
    > done this in bash, but want to eventually set up a perl script to do
    > this.
    >
    > So far I've got this small script that will remove URLs that start at
    > the beginning of a line, into a file. This is the easy part (Note, I
    > know this is messy, but this is still a dev script, at the moment).
    >
    > ---
    > echo remove spaces from the start of lines
    > sed 's/^ *//g' sorted_file > 1
    >
    > echo Remove all '>' from a file.
    > sed '/>/d' 1 > 2
    >
    > echo uniq the file
    > uniq 2 > 3
    >
    >
    > echo Move all lines beginning with http or www into another file
    > sed -n '/^http/p' 3 > 4
    > sed -n '/^www/p' 3 >> 4
    >
    > echo Remove all junk on lines from "space" to EOL
    > sed '/ .*$/d' 4 > 4.1
    >
    > echo uniq the file
    > uniq 4.1 > 4.2
    >
    > echo So far, I've got a file with all www and http only.
    > mv 4.2 http_and_www_only
    > ---
    >
    > Once I've stripped these lines (easy enough), I have a file that
    > remains like this:
    >
    > ----
    > And the URL is:
    > Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
    > Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
    > Anyone still got the url of the pages about the woman who keeps going
    > Are available on: http://www.spete.net/ukrm/sedan06/index.html
    > are July 6-8. The reason being "Power Big Meet",
    > http://www.bigmeet.com/ ,
    > Are you sure? http://www.usgpru.net/
    > a scout around www.nslu2-linux.org - and perhaps there isn't any easier
    > asked where the sinks were and if you could plug curling tongs into the
    > ----
    >
    > The result I want is a list like the following:
    >
    > http://ukrm.net/faq/UKRMsCBT.html
    > http://www.girlsbike2.com/
    > http://www.spete.net/ukrm/sedan06/index.html
    > http://www.bigmeet.com/
    > http://www.usgpru.net/
    > www.nslu2-linux.org


    you can start from here:

    lynx -dump http://your_url | grep -o '\(http\|www\)://.*'

    then filter out any unwanted links.

    HTH,
    Xicheng
    Xicheng Jia, Jun 2, 2006
    #5
  6. Cab

    Cab Guest

    Mirco Wahab wrote:

    > Thus spoke Cab (on 2006-06-02 15:57):
    >
    > > I'm trying to set up a script to strip out URL's from the body of a
    > > Usenet post.
    > > The result I want is a list like the following:
    > >
    > > http://ukrm.net/faq/UKRMsCBT.html
    > > http://www.girlsbike2.com/
    > > http://www.spete.net/ukrm/sedan06/index.html
    > > http://www.bigmeet.com/
    > > http://www.usgpru.net/
    > > www.nslu2-linux.org

    >
    > The following prints all links
    > (starting w/http or www) from $text
    >
    > use:
    > $> perl dumplinks.pl < text.txt
    >
    > #!/usr/bin/perl
    > use strict;
    > use warnings;
    >
    > my $data = do {local $/; <> };
    > print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
    >
    > # or:
    > # while (<>) {
    > # print "$1\n" while /(\b(http|www)\S+)/g;
    > # }
    >
    >
    > Of course, this can be done by an one-liner ;-)
    >
    > Regards
    >
    > Mirco


    Ta very much for that. Very helpful.

    --
    Cab
    Cab, Jun 2, 2006
    #6
  7. Cab

    Cab Guest

    Paul Lalli wrote:

    > Documentation to help you in this goal:
    > open a file: perldoc -f open
    > Looping: perldoc perlsyn
    > Reading a line from a file: perldoc -f readline
    > Using search-and-replace: perldoc perlop, perldoc perlretut
    > Regexp::Common::URI:
    > http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/
    > URI.pm printing to a file: perldoc -f print


    ^^^^^^^^^^^^^^^^^^^

    Ah, that's handy. Thanks.

    --
    Cab
    Cab, Jun 2, 2006
    #7
  8. Cab

    Dr.Ruud Guest

    Mirco Wahab schreef:

    > my $data = do {local $/; <> };
    > print "$1\n" while $data =~ /(\b(http|www)\S+)/g;


    { local ($", $\, $/) = ("\n", "\n", undef) ;
    print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
    }

    But read `perldoc -q URL`.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Jun 2, 2006
    #8
  9. Dr.Ruud wrote:
    > Mirco Wahab schreef:
    >
    >> my $data = do {local $/; <> };
    >> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

    >
    > { local ($", $\, $/) = ("\n", "\n", undef) ;
    > print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
    > }



    { local ( $,, $\, $/ ) = ( "\n", "\n" );
    print <> =~ /\b(?:http:|www\.)\S+/g
    }



    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Jun 3, 2006
    #9
  10. Cab

    Dr.Ruud Guest

    John W. Krahn schreef:
    > Dr.Ruud:
    >> Mirco Wahab:


    >>> my $data = do {local $/; <> };
    >>> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

    >>
    >> { local ($", $\, $/) = ("\n", "\n", undef) ;
    >> print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
    >> }

    >
    > { local ( $,, $\, $/ ) = ( "\n", "\n" );
    > print <> =~ /\b(?:http:|www\.)\S+/g
    > }


    Yes, that certainly is a cleaner variant. I did hesitate to put the
    C<undef> at the end of the rightside list, but decided it would be more
    educational. But then I was already trapped in using C<$"> where C<$,>
    is cleaner.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
    Dr.Ruud, Jun 3, 2006
    #10
  11. Dr.Ruud wrote:
    > John W. Krahn schreef:
    >>Dr.Ruud:
    >>>Mirco Wahab:

    >
    >>>> my $data = do {local $/; <> };
    >>>> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
    >>>{ local ($", $\, $/) = ("\n", "\n", undef) ;
    >>> print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
    >>>}

    >>{ local ( $,, $\, $/ ) = ( "\n", "\n" );
    >> print <> =~ /\b(?:http:|www\.)\S+/g
    >>}

    >
    > Yes, that certainly is a cleaner variant. I did hesitate to put the
    > C<undef> at the end of the rightside list, but decided it would be more
    > educational. But then I was already trapped in using C<$"> where C<$,>
    > is cleaner.


    Thanks, and you could also do it like this:

    { local ( $\, $/ ) = "\n";
    print for <> =~ /\b(?:http:|www\.)\S+/g
    }


    :)

    John
    --
    use Perl;
    program
    fulfillment
    John W. Krahn, Jun 3, 2006
    #11
  12. Cab

    Mumia W. Guest

    John W. Krahn wrote:
    > Dr.Ruud wrote:
    >> Mirco Wahab schreef:
    >>
    >>> my $data = do {local $/; <> };
    >>> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

    >> { local ($", $\, $/) = ("\n", "\n", undef) ;
    >> print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
    >> }

    >
    >
    > { local ( $,, $\, $/ ) = ( "\n", "\n" );
    > print <> =~ /\b(?:http:|www\.)\S+/g
    > }
    >
    >
    >
    > John


    Due to sentence structure, people like to put periods and commas on the
    end of their urls, so I decided to strip them off. I'm sorry this is so
    longwinded compared to the others:

    use strict;
    use warnings;

    my $data = q{
    And the URL is:
    Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
    Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
    Anyone still got the url of the pages about the woman who keeps going
    Are available on: http://www.spete.net/ukrm/sedan06/index.html
    are July 6-8. The reason being "Power Big Meet",
    http://www.bigmeet.com/ www4.redhat.com,
    Get a better browser: ftp.mozilla.org.
    Are you sure? http://www.usgpru.net/
    a scout around www.nslu2-linux.org - and perhaps there isn't any easier
    asked where the sinks were and if you could plug curling tongs into the
    };

    local $_;
    open (FH, '<', \$data)
    or die("Couldn't open in-memory file: $!\n");

    my @urls =
    map { /^(.*?)[,.]?$/; }
    map { /\b(?:http|ftp|www\d*\.)\S+/g; } <FH>;
    print join "\n", @urls;

    close FH;
    Mumia W., Jun 3, 2006
    #12
  13. Cab

    Xicheng Jia Guest

    Mumia W. wrote:
    > John W. Krahn wrote:
    > > Dr.Ruud wrote:
    > >> Mirco Wahab schreef:
    > >>
    > >>> my $data = do {local $/; <> };
    > >>> print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
    > >> { local ($", $\, $/) = ("\n", "\n", undef) ;
    > >> print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
    > >> }

    > >
    > >
    > > { local ( $,, $\, $/ ) = ( "\n", "\n" );
    > > print <> =~ /\b(?:http:|www\.)\S+/g
    > > }
    > >
    > >
    > >
    > > John

    >
    > Due to sentence structure, people like to put periods and commas on the
    > end of their urls, so I decided to strip them off. I'm sorry this is so
    > longwinded compared to the others:
    >
    > use strict;
    > use warnings;
    >
    > my $data = q{
    > And the URL is:
    > Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
    > Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
    > Anyone still got the url of the pages about the woman who keeps going
    > Are available on: http://www.spete.net/ukrm/sedan06/index.html
    > are July 6-8. The reason being "Power Big Meet",
    > http://www.bigmeet.com/ www4.redhat.com,
    > Get a better browser: ftp.mozilla.org.
    > Are you sure? http://www.usgpru.net/
    > a scout around www.nslu2-linux.org - and perhaps there isn't any easier
    > asked where the sinks were and if you could plug curling tongs into the
    > };
    >
    > local $_;
    > open (FH, '<', \$data)
    > or die("Couldn't open in-memory file: $!\n");
    >
    > my @urls =
    > map { /^(.*?)[,.]?$/; }
    > map { /\b(?:http|ftp|www\d*\.)\S+/g; } <FH>;
    > print join "\n", @urls;
    >
    > close FH;


    what if I add one line at the end of your data, say: $data .= "\nI like
    ftpd httpd www....". I guess the result is not what you wanted..

    Xicheng
    Xicheng Jia, Jun 3, 2006
    #13
  14. Cab

    Mumia W. Guest

    Xicheng Jia wrote:
    >
    > what if I add one line at the end of your data, say: $data .= "\nI like
    > ftpd httpd www....". I guess the result is not what you wanted..
    >
    > Xicheng
    >


    Right, writing a RE that does a complete job of separating URLs from
    text is not trivial. Tom Christiansen wrote one in FMTEYEWTK, and it's
    a lot more than two lines :)

    But the OPs requirements have been more than fulfilled.
    Mumia W., Jun 3, 2006
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    152
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    466
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    342
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    185
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    170
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page