Need regexp to rejoin URL links broken by \n

T

Tony

Hi regular expression experts.

Can someone help me with a regular expression that removes \n's from
the middle of URL's?

I have an email inside a variable $A like so:

Hello, this is an
email which has
been formatted to
fit a narrow
column. Here is a
URL: http://test.
com/hello?test=op
tion1&test2=optio
n2. Thanks for
reading.

As you can see, the link has been wrapped into the column by a number
of \n's. Obviously, this means the link can't be clicked on.

I'd like to pass this through a regular expression that removes all the
\n's between http:\\ and the next dot followed by a space (that is:
'. ')

Hello, this is an
email which has
been formatted to
fit a narrow
column. Here is a
URL: http://test.com/hello?test=option1&test2=option2.
Thanks for
reading.

Any ideas?

Thank you!

(And are there any tools to help construct and test regular
expressions?)
 
G

Greg Bacon

: Can someone help me with a regular expression that removes \n's from
: the middle of URL's?
: [...]

My first thought was to suggest stripping all runs of whitespace and
feeding the result to URI::Find, but then I realized that you're
trying to reformat the message for human consumption.

Below is a cut at it:

$ cat try
#! /usr/local/bin/perl

use warnings;
use strict;

chomp(my $A = <<EOMessage);
Hello, this is an
email which has
been formatted to
fit a narrow
column. Here is a
URL: http://test.
com/hello?test=op
tion1&test2=optio
n2. Thanks for
reading.
EOMessage

$A =~ s!(http://.+?\.) !($a=$1) =~ tr/\n//d; "$a\n"!se;

print $A, "\n";

$ ./try
Hello, this is an
email which has
been formatted to
fit a narrow
column. Here is a
URL: http://test.com/hello?test=option1&test2=option2.
Thanks for
reading.

Using /\. / as a terminator strikes me as being *very* brittle, but
that only shows the truth of mjd's words: "Of course, this is a
heuristic, which is a fancy way of saying that it doesn't work."

Hope this helps,
Greg
 
T

Tad McClellan

Tony said:
I have an email inside a variable $A like so:

Hello, this is an
email which has
been formatted to
fit a narrow
column. Here is a
URL: http://test.
com/hello?test=op
tion1&test2=optio
n2. Thanks for
reading.
removes all the
\n's between http:\\ and the next dot followed by a space
Hello, this is an
email which has
been formatted to
fit a narrow
column. Here is a
URL: http://test.com/hello?test=option1&test2=option2.
Thanks for
reading.


$A =~ s{(http://.*?)\. }
{my $s=$1; $s=~tr/\n//d; "$s.\n"}egsi;
 
M

Martien Verbruggen

Can someone help me with a regular expression that removes \n's from
the middle of URL's?
[snip]

I'd like to pass this through a regular expression that removes all the
\n's between http:\\ and the next dot followed by a space (that is:
'. ')

While Tad's solution gives you that, it isn't going to be a solution
to your problem. The example text you showed can have URLs broken
without them following a space:

Here is another
URL: http://test.
com/hello?test=op
tion1&test2=optio
n2&test3=option3.
Thanks for reading.

Looking for ( |\n) following a full stop also won't work, as the first
full stop in that URL would signify the end of the URL. I can't really
think of a RE that would work in the generic case. You'd probably have
to build something that also validates that the URL is valid to get
closer.

Martien
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top