Regular expressions (extracting urls)

David Krmpotic · Feb 5, 2007

Hi!

I have to extract an url from the text and make it a link (a href...)..
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = "Go here: http://www.something.com!"
link.gsub!(/https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+/i, '<a
href=\'\0\'>\0</a>')

link becomes:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

Now...

When the link is this:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+)/i, '<a
href=\'\1\'>\1</a>')

but how can I exclude links that start with href=" and href=' ?
The problem is that I don't know how to specify that HREF cannot
preceede the link (cannot write [^href], [^(href)] doesn't seem to work
either and it also screws \n ... )

Please help !

David

Robert Klemme · Feb 5, 2007

Hi!

I have to extract an url from the text and make it a link (a href...)..
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = "Go here: http://www.something.com!"
link.gsub!(/https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+/i, '<a
href=\'\0\'>\0</a>')

link becomes:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

Now...

When the link is this:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+)/i, '<a
href=\'\1\'>\1</a>')

but how can I exclude links that start with href=" and href=' ?
The problem is that I don't know how to specify that HREF cannot
preceede the link (cannot write [^href], [^(href)] doesn't seem to work
either and it also screws \n ... )

Maybe it's enough to do

s.gsub( /([^"'])\b(http[^\s"']+)\b([^"'])/,
'\\1<a href="\\2">\\2</a>\\3' )

That depends on your input text. This piece has some weaknesses, e.g.
won't substitute hrefs at the beginning and end of the string (you could
pad with a whitespace).

Kind regards

robert

David Krmpotic · Feb 7, 2007

Thank you.. but yes.. the problem is that it has to detect links at the
beginning and the end. and also it has to check for HREF, because
sometimes you can have the quotation marks before the link, even if it's
not precedeed by href.

Is there a better solution? thank you

Robert said:
href=\'\0\'>\0</a>')

either and it also screws \n ... )

Click to expand...

Maybe it's enough to do

s.gsub( /([^"'])\b(http[^\s"']+)\b([^"'])/,
'\\1<a href="\\2">\\2</a>\\3' )

That depends on your input text. This piece has some weaknesses, e.g.
won't substitute hrefs at the beginning and end of the string (you could
pad with a whitespace).

Kind regards

robert

gga · Feb 7, 2007

Hi!

I have to extract an url from the text and make it a link (a href...)..
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = "Go here:http://www.something.com!"
link.gsub!(/https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+/i, '<a
href=\'\0\'>\0</a>')

link becomes:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

Now...

When the link is this:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+)/i, '<a
href=\'\1\'>\1</a>')

but how can I exclude links that start with href=" and href=' ?

You can exclude full words if you write regexes such as:

/(?!href=['"])/

Be careful about greediness, thou. If you are doing any web
scraping, you also should look into something like WWW::Mechanize,
instead of re-inventing the wheel.

Robert Klemme · Feb 7, 2007

but how can I exclude links that start with href=" and href=' ?

Click to expand...

You can exclude full words if you write regexes such as:

/(?!href=['"])/

I'm afraid this won't work: this is negative lookahead but what you
really need here is negative look*behind*. That's only possible with
the new regex engine in 1.9.

Kind regards

robert

Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Only one table shows up with the information	2	Mar 29, 2023
Multi select options in a menu	1	Oct 30, 2022
Help with Visual Lightbox: Scripts	2	May 3, 2023
I need help fixing my website	2	Oct 15, 2023
Javascript scroll to sections and also scroll to section but open relevant nav-tab	4	Feb 25, 2022
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Clickable Div Block	1	Oct 13, 2023

Regular expressions (extracting urls)

David Krmpotic

Robert Klemme

David Krmpotic

gga

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads