problem with trivial regular expression

D

David Villa

hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem with this :

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url but i try with :
/(http\:\/\/.+com)/ it returns a long match:



1. http://www.marca.comjafosjodfahttp://www.as.com


how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com


thanks
 
J

Jesús Gabriel y Galán

hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem =A0with this =A0:

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url =A0but i try with :
/(http\:\/\/.+com)/ it returns a long match:



=A0 1. http://www.marca.comjafosjodfahttp://www.as.com


how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com


thanks

What you are missing is the non-greedy modifier (?) for the +:

irb(main):001:0> s =3D "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=3D> " fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjo=
fja"
irb(main):003:0> s.scan(/http\:\/\/.+?\.com/)
=3D> ["http://www.marca.com", "http://www.as.com"]

(I also added an extra \. before com, to match a ".com" and not "com"
only). Then, scan helps you going through the full string retrieving
matches.

Hope this helps,

Jesus.
 
B

Benoit Daloze

[Note: parts of this message were removed to make it a legal post.]

irb> s =
"fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=> "fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
irb> s.scan( %r{http://.+?\.com} )
=> ["http://www.marca.com", "http://www.as.com"]

I use scan because we want multiple results.

For the Regexp,
in ".+?", you make it ungreedy
The %r{} let you write / without escaping
I also add a "\." to ensure there is a point before "com"

Enjoy
 
B

Benoit Daloze

We are quite according in our posts :)
This should be the Ruby way, very clear this time !

2009/12/14 Jes=FAs Gabriel y Gal=E1n said:
hi ,sorry for my english.

I am trying to remember the use of regular expressions and i have a
problem with this :

i had a text with diferents strings, for example, url , like this :

fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja

i want to extract the diferents url but i try with :
/(http\:\/\/.+com)/ it returns a long match:



1. http://www.marca.comjafosjodfahttp://www.as.com


how can i group this in 2 diferents matchs? example :
1- http://www.marca.com
2- http://www.as.com


thanks

What you are missing is the non-greedy modifier (?) for the +:

irb(main):001:0> s =3D "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
=3D> "
fajljsfjaosfohttp://www.marca.comjafosjodfahttp://www.as.comjfoaasjofja"
irb(main):003:0> s.scan(/http\:\/\/.+?\.com/)
=3D> ["http://www.marca.com", "http://www.as.com"]

(I also added an extra \. before com, to match a ".com" and not "com"
only). Then, scan helps you going through the full string retrieving
matches.

Hope this helps,

Jesus.
 
D

David Masover


Any particular context? Or is it actually that random?
i want to extract the diferents url but i try with :
/(http\:\/\/.+com)/ it returns a long match:

1. http://www.marca.comjafosjodfahttp://www.as.com

If you think about it, that is still a valid URL. You're trying to limit it
not to URLs, but only to http:// followed by a domain, and then only a domain
ending in .com -- there are MANY urls that this will break.

If you're OK with that, the basic problem is that . is going to match as much
as it possibly can (greedy), and it matches any character. The simple solution
is to make it match as few characters as it can (miserly). You do that by
putting a question mark after the + or *:

/(http\:\/\/.+?com)/

But again, that's not matching .com, that's matching anything ending in com.
For example, on this URL:

http://www.broadcom.com/

it will only capture http://www.broadcom. So there's an easy solution -- add
an escaped dot:

/(http\:\/\/.+?\.com)/

That's as much as I want to do with it. I'm guessing what you're trying to do
is auto-linkify URLs in forum posts, or something like that -- some problem
that's been solved a million times before, and better, so you should look for
those solutions. But I won't assume that applies to you...

By the way, if you don't already know:

http://rubular.com/
 
J

Jesús Gabriel y Galán

We are quite according in our posts :)
This should be the Ruby way, very clear this time !

Yep, but I forgot the %r. Escaping / is ugly :).
So, thanks for that !

Jesus.
 
B

Brian Candler

David said:
If you think about it, that is still a valid URL.

That's arguable, because of the colon. RFC 1738:

URL schemes that involve the direct use
of an IP-based protocol to a specified host on the Internet use a
common syntax for the scheme-specific data:

//<user>:<password>@<host>:<port>/<url-path>

...
port
The port number to connect to. Most schemes designate
protocols that have a default port number. Another port number
may optionally be supplied, in decimal, separated from the
host by a colon. If the port is omitted, the colon is as well.

However, it says "is" rather than "MUST BE".
 
D

David Villa

Thanks for all answer,

i used rubular.com for test, thanks however.

I posted a random string and i understand it better, but the real
question and string is :

"\"18%7Chttp%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C34%7Chttp%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48%2C5%7Chttp%3A%2F%2Fv16.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D5%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D91FAC179A7C02DD662942CCEB71D8BE05BD7B5D3.BC4BFC97F8BA02519CC817D2541E5D0548BC2C7C%26factor%3D1.25%26id%3Dcf9829e68818de48\"

here, are some urls, one starts with : http.... and end with %2C3 :

http%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C3

and other, following this, with star with : http...and end with %2C5.

http%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48%2C5

So, when i match the frist, no problem, but when i try to match the
second, it matchs all the submatch,the first and the second :

Match captures:

1.
http%3A%2F%2Fv14.lscache3.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D18%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D4C33C8CB5787DBA4D111A2E66BFFACCC2E95E0A7.D302F87BF11A4BD47631704E858C762325913921%26factor%3D1.25%26id%3Dcf9829e68818de48%2C34%7Chttp%3A%2F%2Fv7.lscache8.c.youtube.com%2Fvideoplayback%3Fip%3D0.0.0.0%26sparams%3Did%252Cexpire%252Cip%252Cipbits%252Citag%252Calgorithm%252Cburst%252Cfactor%26fexp%3D900034%252C902305%26algorithm%3Dthrottle-factor%26itag%3D34%26ipbits%3D0%26burst%3D40%26sver%3D3%26expire%3D1260813600%26key%3Dyt1%26signature%3D158633D23CDC66CC40D171EC8CB48A84B4BCB223.86F6779A36E14684B367BCCF91588556A871C421%26factor%3D1.25%26id%3Dcf9829e68818de48

how can i match only the second ?
 
B

Brian Candler

David said:
I posted a random string and i understand it better, but the real
question and string is :

But where does this string actually come from? It looks a bit like URLs
but with an extra layer of URL-encoding, in that = appears as %3D, for
example, and some extra numeric prefixes like 18|

So it would be much more helpful to understand what the real structure
of this string is, rather than just guessing, in which case you don't
need to guess about how to decode it.

Removing the first level of escaping:
irb(main):007:0> CGI.unescape(s)
=>
"\"18|http://v14.lscache3.c.youtube.com/v...5D0548BC2C7C&factor=1.25&id=cf9829e68818de48\""

So my total *guess* is that this is a double-quoted string, which
contains comma-separated fields, and each field is of the form nn|URL.
In which case you can unwrap it in stages:

irb(main):011:0> s.sub!(/\A"(.*)"\z/) { $1 }
irb(main):012:0> fields = CGI.unescape(s).split(',')
irb(main):013:0> fields.each { |f| num,url = f.split('|',2); puts
"***",url }; nil
***
http://v14.lscache3.c.youtube.com/v...C762325913921&factor=1.25&id=cf9829e68818de48
***
http://v7.lscache8.c.youtube.com/vi...88556A871C421&factor=1.25&id=cf9829e68818de48
***
http://v16.lscache8.c.youtube.com/v...E5D0548BC2C7C&factor=1.25&id=cf9829e68818de48
=> nil

IMO it's far better to use the structure of the input to delimit the
data you're looking for, rather than guessing where the start and end of
each datum is based on what you expect the datum to look like.
 
D

David Villa

Sorry, i forgot this post.

Thanks for all the people.

Finally, i think i end my script to download from youtube and extract
the sound, thanks again.


The script, very simple but very util for me and very interesting to
remember regular expressions which i had forgotten.


www.dvillanueva.com/blog
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top