Regular expressions (extracting urls)

D

David Krmpotic

Hi!

I have to extract an url from the text and make it a link (a href...)..
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = "Go here: http://www.something.com!"
link.gsub!(/https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+/i, '<a
href=\'\0\'>\0</a>')

link becomes:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

Now...

When the link is this:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+)/i, '<a
href=\'\1\'>\1</a>')

but how can I exclude links that start with href=" and href=' ?
The problem is that I don't know how to specify that HREF cannot
preceede the link (cannot write [^href], [^(href)] doesn't seem to work
either and it also screws \n ... )


Please help !

David
 
R

Robert Klemme

Hi!

I have to extract an url from the text and make it a link (a href...)..
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = "Go here: http://www.something.com!"
link.gsub!(/https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+/i, '<a
href=\'\0\'>\0</a>')

link becomes:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

Now...

When the link is this:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+)/i, '<a
href=\'\1\'>\1</a>')

but how can I exclude links that start with href=" and href=' ?
The problem is that I don't know how to specify that HREF cannot
preceede the link (cannot write [^href], [^(href)] doesn't seem to work
either and it also screws \n ... )

Maybe it's enough to do

s.gsub( /([^"'])\b(http[^\s"']+)\b([^"'])/,
'\\1<a href="\\2">\\2</a>\\3' )

That depends on your input text. This piece has some weaknesses, e.g.
won't substitute hrefs at the beginning and end of the string (you could
pad with a whitespace).

Kind regards

robert
 
D

David Krmpotic

Thank you.. but yes.. the problem is that it has to detect links at the
beginning and the end. and also it has to check for HREF, because
sometimes you can have the quotation marks before the link, even if it's
not precedeed by href.

Is there a better solution? thank you

Robert said:
href=\'\0\'>\0</a>')

either and it also screws \n ... )
Maybe it's enough to do

s.gsub( /([^"'])\b(http[^\s"']+)\b([^"'])/,
'\\1<a href="\\2">\\2</a>\\3' )

That depends on your input text. This piece has some weaknesses, e.g.
won't substitute hrefs at the beginning and end of the string (you could
pad with a whitespace).

Kind regards

robert
 
G

gga

Hi!

I have to extract an url from the text and make it a link (a href...)..
The trick is that I have to be careful not to replace the url, that are
already a part of the link.

so:

link = "Go here:http://www.something.com!"
link.gsub!(/https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+/i, '<a
href=\'\0\'>\0</a>')

link becomes:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

Now...

When the link is this:
link = "Go here: <a
href='http://www.something.com'>http://www.something.com</a>!"

The regular expression must not replace it!

I know that for example if I want to exclude the links that start with
xhttp, I can write:
link.gsub!(/([^x]https?:\/\/[a-z0-9\.\-\_=&\+\/\?]+)/i, '<a
href=\'\1\'>\1</a>')

but how can I exclude links that start with href=" and href=' ?

You can exclude full words if you write regexes such as:

/(?!href=['"])/

Be careful about greediness, thou. If you are doing any web
scraping, you also should look into something like WWW::Mechanize,
instead of re-inventing the wheel.
 
R

Robert Klemme

but how can I exclude links that start with href=" and href=' ?

You can exclude full words if you write regexes such as:

/(?!href=['"])/

I'm afraid this won't work: this is negative lookahead but what you
really need here is negative look*behind*. That's only possible with
the new regex engine in 1.9.

Kind regards

robert
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,053
Latest member
BrodieSola

Latest Threads

Top