Regexp help.

C

Cab

Hi all.

I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.

So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).

---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1

echo Remove all '>' from a file.
sed '/>/d' 1 > 2

echo uniq the file
uniq 2 > 3


echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4

echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1

echo uniq the file
uniq 4.1 > 4.2

echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---

Once I've stripped these lines (easy enough), I have a file that
remains like this:

----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----

The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?
 
M

Mirco Wahab

Thus spoke Cab (on 2006-06-02 15:57):
I'm trying to set up a script to strip out URL's from the body of a
Usenet post.
The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

The following prints all links
(starting w/http or www) from $text

use:
$> perl dumplinks.pl < text.txt

#!/usr/bin/perl
use strict;
use warnings;

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

# or:
# while (<>) {
# print "$1\n" while /(\b(http|www)\S+)/g;
# }


Of course, this can be done by an one-liner ;-)

Regards

Mirco
 
P

Paul Lalli

Cab said:
I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Can anyone give me some clues or pointers to websites where I can go
into this in more detail please?

open the original file for reading
open two files for writing - one for the modified file, one for the
list of URLs
loop through each line of the original file
Search for a URI, using Regexp::Common::URI. Replace it with nothing,
and be sure to capture the URI.
print the modified line to the modified file
print the captured URI to the URI file.

Documentation to help you in this goal:
open a file: perldoc -f open
Looping: perldoc perlsyn
Reading a line from a file: perldoc -f readline
Using search-and-replace: perldoc perlop, perldoc perlretut
Regexp::Common::URI:
http://search.cpan.org/~abigail/Regexp-Common-2.120/lib/Regexp/Common/URI.pm
printing to a file: perldoc -f print

Once you have made your *perl* attempt, if it doesn't work the way you
want, feel free to post it here to seek assistance. In the mean time,
be sure to read the posting guidelines for this group. They are posted
here twice a week.

Paul Lalli
 
X

Xicheng Jia

Cab said:
Hi all.

I'm trying to set up a script to strip out URL's from the body of a
Usenet post.

Any clues please? I have some expressions that I'm using, but they're
very long winded and inefficient, as seen below. At the moment, I've
done this in bash, but want to eventually set up a perl script to do
this.

So far I've got this small script that will remove URLs that start at
the beginning of a line, into a file. This is the easy part (Note, I
know this is messy, but this is still a dev script, at the moment).

---
echo remove spaces from the start of lines
sed 's/^ *//g' sorted_file > 1

echo Remove all '>' from a file.
sed '/>/d' 1 > 2

echo uniq the file
uniq 2 > 3


echo Move all lines beginning with http or www into another file
sed -n '/^http/p' 3 > 4
sed -n '/^www/p' 3 >> 4

echo Remove all junk on lines from "space" to EOL
sed '/ .*$/d' 4 > 4.1

echo uniq the file
uniq 4.1 > 4.2

echo So far, I've got a file with all www and http only.
mv 4.2 http_and_www_only
---

Once I've stripped these lines (easy enough), I have a file that
remains like this:

----
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ ,
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
----

The result I want is a list like the following:

http://ukrm.net/faq/UKRMsCBT.html
http://www.girlsbike2.com/
http://www.spete.net/ukrm/sedan06/index.html
http://www.bigmeet.com/
http://www.usgpru.net/
www.nslu2-linux.org

you can start from here:

lynx -dump http://your_url | grep -o '\(http\|www\)://.*'

then filter out any unwanted links.

HTH,
Xicheng
 
C

Cab

Mirco said:
Thus spoke Cab (on 2006-06-02 15:57):


The following prints all links
(starting w/http or www) from $text

use:
$> perl dumplinks.pl < text.txt

#!/usr/bin/perl
use strict;
use warnings;

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

# or:
# while (<>) {
# print "$1\n" while /(\b(http|www)\S+)/g;
# }


Of course, this can be done by an one-liner ;-)

Regards

Mirco

Ta very much for that. Very helpful.
 
D

Dr.Ruud

Mirco Wahab schreef:
my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

But read `perldoc -q URL`.
 
J

John W. Krahn

Dr.Ruud said:
Mirco Wahab schreef:
my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}


{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}



John
 
D

Dr.Ruud

John W. Krahn schreef:
Dr.Ruud:
Mirco Wahab:
my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;

{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}

{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

Yes, that certainly is a cleaner variant. I did hesitate to put the
C<undef> at the end of the rightside list, but decided it would be more
educational. But then I was already trapped in using C<$"> where C<$,>
is cleaner.
 
J

John W. Krahn

Dr.Ruud said:
John W. Krahn schreef:
Dr.Ruud:
Mirco Wahab:
my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}
{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}

Yes, that certainly is a cleaner variant. I did hesitate to put the
C<undef> at the end of the rightside list, but decided it would be more
educational. But then I was already trapped in using C<$"> where C<$,>
is cleaner.

Thanks, and you could also do it like this:

{ local ( $\, $/ ) = "\n";
print for <> =~ /\b(?:http:|www\.)\S+/g
}


:)

John
 
M

Mumia W.

John said:
Dr.Ruud said:
Mirco Wahab schreef:
my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}


{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}



John

Due to sentence structure, people like to put periods and commas on the
end of their urls, so I decided to strip them off. I'm sorry this is so
longwinded compared to the others:

use strict;
use warnings;

my $data = q{
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ www4.redhat.com,
Get a better browser: ftp.mozilla.org.
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
};

local $_;
open (FH, '<', \$data)
or die("Couldn't open in-memory file: $!\n");

my @urls =
map { /^(.*?)[,.]?$/; }
map { /\b(?:http|ftp|www\d*\.)\S+/g; } <FH>;
print join "\n", @urls;

close FH;
 
X

Xicheng Jia

Mumia said:
John said:
Dr.Ruud said:
Mirco Wahab schreef:

my $data = do {local $/; <> };
print "$1\n" while $data =~ /(\b(http|www)\S+)/g;
{ local ($", $\, $/) = ("\n", "\n", undef) ;
print "@{[ <> =~ /(\b(?:http:|www\.)\S+)/g ]}"
}


{ local ( $,, $\, $/ ) = ( "\n", "\n" );
print <> =~ /\b(?:http:|www\.)\S+/g
}



John

Due to sentence structure, people like to put periods and commas on the
end of their urls, so I decided to strip them off. I'm sorry this is so
longwinded compared to the others:

use strict;
use warnings;

my $data = q{
And the URL is:
Anton, try reading: url:http://ukrm.net/faq/UKRMsCBT.html
Anyone got any experience with http://www.girlsbike2.com/ ? SWMBO needs
Anyone still got the url of the pages about the woman who keeps going
Are available on: http://www.spete.net/ukrm/sedan06/index.html
are July 6-8. The reason being "Power Big Meet",
http://www.bigmeet.com/ www4.redhat.com,
Get a better browser: ftp.mozilla.org.
Are you sure? http://www.usgpru.net/
a scout around www.nslu2-linux.org - and perhaps there isn't any easier
asked where the sinks were and if you could plug curling tongs into the
};

local $_;
open (FH, '<', \$data)
or die("Couldn't open in-memory file: $!\n");

my @urls =
map { /^(.*?)[,.]?$/; }
map { /\b(?:http|ftp|www\d*\.)\S+/g; } <FH>;
print join "\n", @urls;

close FH;

what if I add one line at the end of your data, say: $data .= "\nI like
ftpd httpd www....". I guess the result is not what you wanted..

Xicheng
 
M

Mumia W.

Xicheng said:
what if I add one line at the end of your data, say: $data .= "\nI like
ftpd httpd www....". I guess the result is not what you wanted..

Xicheng

Right, writing a RE that does a complete job of separating URLs from
text is not trivial. Tom Christiansen wrote one in FMTEYEWTK, and it's
a lot more than two lines :)

But the OPs requirements have been more than fulfilled.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top