problem with trivial regular expression

David Villa · Dec 14, 2009

hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem with this :

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url but i try with :
/(http\:\/\/.+com)/ it returns a long match:

1. http://www.marca.comjafosjodfahttp://www.as.com

how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com

thanks

Jesús Gabriel y Galán · Dec 14, 2009

hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem =A0with this =A0:

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url =A0but i try with :
/(http\:\/\/.+com)/ it returns a long match:

=A0 1. http://www.marca.comjafosjodfahttp://www.as.com

how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com

thanks

What you are missing is the non-greedy modifier (?) for the +:

irb(main):001:0> s =3D "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=3D> " fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjo=
fja"
irb(main):003:0> s.scan(/http\:\/\/.+?\.com/)
=3D> ["http://www.marca.com", "http://www.as.com"]

(I also added an extra \. before com, to match a ".com" and not "com"
only). Then, scan helps you going through the full string retrieving
matches.

Hope this helps,

Jesus.

Benoit Daloze · Dec 14, 2009

[Note: parts of this message were removed to make it a legal post.]

irb> s =
"fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=> "fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
irb> s.scan( %r{http://.+?\.com} )
=> ["http://www.marca.com", "http://www.as.com"]

I use scan because we want multiple results.

For the Regexp,
in ".+?", you make it ungreedy
The %r{} let you write / without escaping
I also add a "\." to ensure there is a point before "com"

Enjoy

Benoit Daloze · Dec 14, 2009

We are quite according in our posts

This should be the Ruby way, very clear this time !

2009/12/14 Jes=FAs Gabriel y Gal=E1n said:
hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem with this :

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url but i try with :
/(http\:\/\/.+com)/ it returns a long match:

1. http://www.marca.comjafosjodfahttp://www.as.com

how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com

thanks

Click to expand...

What you are missing is the non-greedy modifier (?) for the +:

irb(main):001:0> s =3D "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=3D> "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
irb(main):003:0> s.scan(/http\:\/\/.+?\.com/)
=3D> ["http://www.marca.com", "http://www.as.com"]

(I also added an extra \. before com, to match a ".com" and not "com"
only). Then, scan helps you going through the full string retrieving
matches.

Hope this helps,

Jesus.

Click to expand...

David Masover · Dec 14, 2009

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

Any particular context? Or is it actually that random?

i want to extract the diferents url but i try with :
/(http\:\/\/.+com)/ it returns a long match:

1. http://www.marca.comjafosjodfahttp://www.as.com

If you think about it, that is still a valid URL. You're trying to limit it
not to URLs, but only to http:// followed by a domain, and then only a domain
ending in .com -- there are MANY urls that this will break.

If you're OK with that, the basic problem is that . is going to match as much
as it possibly can (greedy), and it matches any character. The simple solution
is to make it match as few characters as it can (miserly). You do that by
putting a question mark after the + or *:

/(http\:\/\/.+?com)/

But again, that's not matching .com, that's matching anything ending in com.
For example, on this URL:

http://www.broadcom.com/

it will only capture http://www.broadcom. So there's an easy solution -- add
an escaped dot:

/(http\:\/\/.+?\.com)/

That's as much as I want to do with it. I'm guessing what you're trying to do
is auto-linkify URLs in forum posts, or something like that -- some problem
that's been solved a million times before, and better, so you should look for
those solutions. But I won't assume that applies to you...

By the way, if you don't already know:

http://rubular.com/

Jesús Gabriel y Galán · Dec 14, 2009

We are quite according in our posts
This should be the Ruby way, very clear this time !

Yep, but I forgot the %r. Escaping / is ugly

.
So, thanks for that !

Jesus.

Brian Candler · Dec 14, 2009

David said:
If you think about it, that is still a valid URL.

That's arguable, because of the colon. RFC 1738:

URL schemes that involve the direct use
of an IP-based protocol to a specified host on the Internet use a
common syntax for the scheme-specific data:

//<user>:<password>@<host>:<port>/<url-path>

...
port
The port number to connect to. Most schemes designate
protocols that have a default port number. Another port number
may optionally be supplied, in decimal, separated from the
host by a colon. If the port is omitted, the colon is as well.

However, it says "is" rather than "MUST BE".

David Villa · Dec 14, 2009

Thanks for all answer,

i used rubular.com for test, thanks however.

I posted a random string and i understand it better, but the real
question and string is :

"\"18%7Chttp%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C34%7Chttp%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48%2C5%7Chttp%3A%2F%2Fv16.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D5%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D91FAC179A7C02DD662942CCEB71D8BE05BD7B5D3.BC4BFC97F8BA02519CC817D2541E5D0548BC2C7C%26factor%3D1.25%26id%3Dcf9829e68818de48\"

here, are some urls, one starts with : http.... and end with %2C3 :

http%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C3

and other, following this, with star with : http...and end with %2C5.

http%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48%2C5

So, when i match the frist, no problem, but when i try to match the
second, it matchs all the submatch,the first and the second :

Match captures:

1.
http%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C34%7Chttp%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48

how can i match only the second ?

Brian Candler · Dec 14, 2009

David said:
I posted a random string and i understand it better, but the real
question and string is :

But where does this string actually come from? It looks a bit like URLs
but with an extra layer of URL-encoding, in that = appears as %3D, for
example, and some extra numeric prefixes like 18|

So it would be much more helpful to understand what the real structure
of this string is, rather than just guessing, in which case you don't
need to guess about how to decode it.

Removing the first level of escaping:

irb(main):007:0> CGI.unescape(s)

=>
"\"18|http://v14.lscache3.c.youtube.com/v...5D0548BC2C7C&factor=1.25&id=cf9829e68818de48\""

So my total *guess* is that this is a double-quoted string, which
contains comma-separated fields, and each field is of the form nn|URL.
In which case you can unwrap it in stages:

irb(main):011:0> s.sub!(/\A"(.*)"\z/) { $1 }
irb(main):012:0> fields = CGI.unescape(s).split(',')
irb(main):013:0> fields.each { |f| num,url = f.split('|',2); puts
"***",url }; nil
***
http://v14.lscache3.c.youtube.com/v...C762325913921&factor=1.25&id=cf9829e68818de48
***
http://v7.lscache8.c.youtube.com/vi...88556A871C421&factor=1.25&id=cf9829e68818de48
***
http://v16.lscache8.c.youtube.com/v...E5D0548BC2C7C&factor=1.25&id=cf9829e68818de48
=> nil

IMO it's far better to use the structure of the input to delimit the
data you're looking for, rather than guessing where the start and end of
each datum is based on what you expect the datum to look like.

David Villa · Dec 23, 2009

Sorry, i forgot this post.

Thanks for all the people.

Finally, i think i end my script to download from youtube and extract
the sound, thanks again.

The script, very simple but very util for me and very interesting to
remember regular expressions which i had forgotten.

www.dvillanueva.com/blog

Unwanted collector in regular expression	2	Apr 1, 2011
Regular expression	12	May 29, 2009
Pattern Search Regular Expression	20	Jun 15, 2013
Need Assistance With A Coding Problem	0	Aug 26, 2023
Possible regular expression	3	May 6, 2008
Regular Expression interesting problem	8	Mar 28, 2009
Regular Expression interesting problem	0	Mar 28, 2009
Regular expression	7	Mar 23, 2009

problem with trivial regular expression

David Villa

Jesús Gabriel y Galán

Benoit Daloze

Benoit Daloze

David Masover

Jesús Gabriel y Galán

Brian Candler

David Villa

Brian Candler

David Villa

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads