Need regexp to rejoin URL links broken by \n

Discussion in 'Perl Misc' started by Tony, Jun 22, 2005.

  1. Tony

    Tony Guest

    Hi regular expression experts.

    Can someone help me with a regular expression that removes \n's from
    the middle of URL's?

    I have an email inside a variable $A like so:

    Hello, this is an
    email which has
    been formatted to
    fit a narrow
    column. Here is a
    URL: http://test.
    com/hello?test=op
    tion1&test2=optio
    n2. Thanks for
    reading.

    As you can see, the link has been wrapped into the column by a number
    of \n's. Obviously, this means the link can't be clicked on.

    I'd like to pass this through a regular expression that removes all the
    \n's between http:\\ and the next dot followed by a space (that is:
    '. ')

    Hello, this is an
    email which has
    been formatted to
    fit a narrow
    column. Here is a
    URL: http://test.com/hello?test=option1&test2=option2.
    Thanks for
    reading.

    Any ideas?

    Thank you!

    (And are there any tools to help construct and test regular
    expressions?)
     
    Tony, Jun 22, 2005
    #1
    1. Advertising

  2. Tony

    Greg Bacon Guest

    In article <>,
    Tony <> wrote:

    : Can someone help me with a regular expression that removes \n's from
    : the middle of URL's?
    : [...]

    My first thought was to suggest stripping all runs of whitespace and
    feeding the result to URI::Find, but then I realized that you're
    trying to reformat the message for human consumption.

    Below is a cut at it:

    $ cat try
    #! /usr/local/bin/perl

    use warnings;
    use strict;

    chomp(my $A = <<EOMessage);
    Hello, this is an
    email which has
    been formatted to
    fit a narrow
    column. Here is a
    URL: http://test.
    com/hello?test=op
    tion1&test2=optio
    n2. Thanks for
    reading.
    EOMessage

    $A =~ s!(http://. ?\.) !($a=$1) =~ tr/\n//d; "$a\n"!se;

    print $A, "\n";

    $ ./try
    Hello, this is an
    email which has
    been formatted to
    fit a narrow
    column. Here is a
    URL: http://test.com/hello?test=option1&test2=option2.
    Thanks for
    reading.

    Using /\. / as a terminator strikes me as being *very* brittle, but
    that only shows the truth of mjd's words: "Of course, this is a
    heuristic, which is a fancy way of saying that it doesn't work."

    Hope this helps,
    Greg
    --
    It should be noted that government is never so zealous in suppressing
    crime as when that crime consists of direct injury to its own sources of
    revenue, as in tax evasion and counterfeiting of its currency.
    -- Murray Rothbard
     
    Greg Bacon, Jun 22, 2005
    #2
    1. Advertising

  3. Tony <> wrote:

    > I have an email inside a variable $A like so:
    >
    > Hello, this is an
    > email which has
    > been formatted to
    > fit a narrow
    > column. Here is a
    > URL: http://test.
    > com/hello?test=op
    > tion1&test2=optio
    > n2. Thanks for
    > reading.


    > removes all the
    > \n's between http:\\ and the next dot followed by a space


    > Hello, this is an
    > email which has
    > been formatted to
    > fit a narrow
    > column. Here is a
    > URL: http://test.com/hello?test=option1&test2=option2.
    > Thanks for
    > reading.



    $A =~ s{(http://.*?)\. }
    {my $s=$1; $s=~tr/\n//d; "$s.\n"}egsi;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Jun 22, 2005
    #3
  4. On 22 Jun 2005 02:15:15 -0700,
    Tony <> wrote:
    >
    > Can someone help me with a regular expression that removes \n's from
    > the middle of URL's?


    [snip]

    > I'd like to pass this through a regular expression that removes all the
    > \n's between http:\\ and the next dot followed by a space (that is:
    > '. ')


    While Tad's solution gives you that, it isn't going to be a solution
    to your problem. The example text you showed can have URLs broken
    without them following a space:

    Here is another
    URL: http://test.
    com/hello?test=op
    tion1&test2=optio
    n2&test3=option3.
    Thanks for reading.

    Looking for ( |\n) following a full stop also won't work, as the first
    full stop in that URL would signify the end of the URL. I can't really
    think of a RE that would work in the generic case. You'd probably have
    to build something that also validates that the URL is valid to get
    closer.

    Martien
    --
    |
    Martien Verbruggen | Computers in the future may weigh no more
    | than 1.5 tons. -- Popular Mechanics, 1949
    |
     
    Martien Verbruggen, Jun 27, 2005
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Steven D'Aprano

    Why are "broken iterators" broken?

    Steven D'Aprano, Sep 21, 2008, in forum: Python
    Replies:
    8
    Views:
    665
  2. Cameron Simpson

    Re: Why are "broken iterators" broken?

    Cameron Simpson, Sep 22, 2008, in forum: Python
    Replies:
    0
    Views:
    593
    Cameron Simpson
    Sep 22, 2008
  3. Fredrik Lundh

    Re: Why are "broken iterators" broken?

    Fredrik Lundh, Sep 22, 2008, in forum: Python
    Replies:
    0
    Views:
    610
    Fredrik Lundh
    Sep 22, 2008
  4. Bil Kleb
    Replies:
    1
    Views:
    99
    Austin Ziegler
    Nov 9, 2004
  5. Joao Silva
    Replies:
    16
    Views:
    367
    7stud --
    Aug 21, 2009
Loading...

Share This Page