gsub pattern substitution and ${...}

S

Sarah Allen

I'm trying to escape a URI that is matched by a regular expression with
gsub.

In irb, here's my string:=> "<a href='http://foo.com/one=>two'/>"

Now I want to match href="..." or href='...' and then URI.escape the
characters withing the quotes==> true

First I tried this:
s.gsub(/href=(['"])([^']*)/, 'href=\1#{URL.escape($2)}\3')
=> "<a href='\#{URL.escape($2)}'/>"

Of course, that doesn't work since ${expr} will only eval the expression
within a double quoted string.

But when it is double quoted, like this:
s.gsub(/href=(['"])([^']*)/, "href=\1#{URI.escape($2)}\3")
=> "<a href=\001http://foo.com/one=%3Etwo\003'/>

\1 doesn't evaluate to the first match anymore

I would guess there's some basic string or regex syntax that I'm missing
here. I've looked at the gsub and string documentation, and either I
missed it or I should be looking elsewhere.

Can someone give me a clue and help me move forward with my mother's day
hacking session?

Thanks in advance,
Sarah
 
7

7stud --

Sarah said:
Of course, that doesn't work since ${expr} will only eval the expression
within a double quoted string.

But when it is double quoted, like this:
s.gsub(/href=(['"])([^']*)/, "href=\1#{URI.escape($2)}\3")
=> "<a href=\001http://foo.com/one=%3Etwo\003'/>

\1 doesn't evaluate to the first match anymore

I would guess there's some basic string or regex syntax that I'm missing
here.

In double quoted strings escaped characters do not have literal
meanings. For instance, in a double quoted string "\n" is not two
characters--it is one character that represents a newline. Double
quoted strings interpret all escaped characters, which means that \1
gets interpreted into something( but who knows what!).

On the other hand, with single quoted strings there are only a couple of
automatic substitutions that take place, and interpreting \1 is not one
of them. So with single quoted strings \1 means \1.

If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in "\1". So try "\\1". With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!
 
S

Sarah Allen

7stud said:
If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in "\1". So try "\\1". With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!
Eureka!
s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
=> "<a href='http://foo.com/one=>two'/>"

Thanks so much for your help.

Sarah
 
R

Robert Klemme

7stud said:
If you need to use double quoted strings, then you need to literally
have \1 in your string, which requires the use of additional backslashes
to escape the \ in "\1". So try "\\1". With ruby if one backslash is
not enough, keep adding more backslashes until whatever you are trying
accomplish works!
Eureka!
s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
=> "<a href='http://foo.com/one=>two'/>"

Thanks so much for your help.

That does not work as 7stud did not mention the most important point:
even with proper escaping this won't work as the string interpolation
takes place *before* gsub is invoked and hence URI.escape will insert
something but not the matched portion. In your tests it has probably
worked because $2 was properly set from the previous match.

In this case the block form of gsub is needed:

irb(main):007:0> s = "<a href='http://foo.com/one=>two'/>"
=> "<a href='http://foo.com/one=>two'/>"
irb(main):008:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href='http://foo.com/one=>two'/>"

irb(main):009:0> s = "<a href=\"http://foo.com/one=>two\"/>"
=> "<a href=\"http://foo.com/one=>two\"/>"
irb(main):010:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href=\"http://foo.com/one=>two\"/>"

And if quotes differ with my regexp no replacement takes place:

irb(main):011:0> s = "<a href=\"http://foo.com/one=>two'/>"
=> "<a href=\"http://foo.com/one=>two'/>"
irb(main):012:0> s.gsub(/href=(["'])([^'"]+)\1/) {
"href=#$1#{URI.escape($2)}#$1" }
=> "<a href=\"http://foo.com/one=>two'/>"

Whether this is something you want or not depends on you but AFAIK
mixing quote types is not allowed here so we probably rather not want to
do the replacement in that case.

Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI. I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Kind regards

robert
 
S

Sebastian Hungerecker

Am Montag 11 Mai 2009 02:13:52 schrieb 7stud --:
Double
quoted strings interpret all escaped characters, which means that \1
gets interpreted into something( but who knows what!).

The character with ASCII value 1.
 
S

Sarah Allen

Robert said:
even with proper escaping this won't work as the string interpolation
takes place *before* gsub is invoked and hence URI.escape will insert
something but not the matched portion. In your tests it has probably
worked because $2 was properly set from the previous match.

really? so the ${...} gets evaluated before the param is passed to gsub,
but the block is passed as code, so then it is evaluated after.

Using the previous attempt, with a fresh irb session, I can see the
issue:
$1 => nil
$2 => nil
$3 => nil
require 'URI' => true
s.gsub(/href=(['"])([^']*)/, "href=\\1#{URI.escape($2)}\\3")
NoMethodError: private method `gsub' called for nil:NilClass
from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/uri/common.rb:289:in
`escape'
from (irb):8
$1 => nil
$2 => nil
$3 => nil
s = "<a href='http://foo.com/one=>two'/>"
=> said:
require 'URI' => true
s.gsub(/href=(["'])([^'"]+)\1/) {
?> "href=#$1#{URI.escape($2)}#$1" }
=> "<a href='http://foo.com/one=>two'/>"

Nice!
Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI. I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Wow, interesting. That would be incorrect HTML that the browser doesn't
deal with well, so I'll not worry about it for this case, but I would be
curious how it might be handled.

Thanks so much,
Sarah
 
R

Robert Klemme

2009/5/11 Sarah Allen said:
really? so the ${...} gets evaluated before the param is passed to gsub,

All method parameters are evaluated before method invocation - this is
true for every method invocation in Ruby.
but the block is passed as code, so then it is evaluated after.

In the case of gsub the block is invoked once for each match.
Using the previous attempt, with a fresh irb session, I can see the
issue:
$1 =3D> nil
$2 =3D> nil
$3 =3D> nil
require 'URI' =3D> true
s.gsub(/href=3D(['"])([^']*)/, "href=3D\\1#{URI.escape($2)}\\3")
NoMethodError: private method `gsub' called for nil:NilClass
=A0from
/System/Library/Frameworks/Ruby.framework/Versions/1.8/usr/lib/ruby/1.8/u= ri/common.rb:289:in
`escape'
=A0from (irb):8
$1 =3D> nil
$2 =3D> nil
$3 =3D> nil
s =3D "<a href=3D'http://foo.com/one=3D>two'/>"
=3D> said:
require 'URI' =3D> true
s.gsub(/href=3D(["'])([^'"]+)\1/) {
?> "href=3D#$1#{URI.escape($2)}#$1" }
=3D> "<a href=3D'http://foo.com/one=3D>two'/>"

Nice!
:)
Note that my regexp has another weakness: the quote character not used
to quote the URI should be allowed as part of the URI. =A0I did not want
to complicate things too much but if you want to deal with this the
regular expression must be made a bit more complex.

Wow, interesting. That would be incorrect HTML that the browser doesn't
deal with well, so I'll not worry about it for this case, but I would be
curious how it might be handled.

Basically you need an alternative and more capturing groups along the lines=
of

'([^']+)'|"([^"]+)"
Thanks so much,

You're welcome.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
 
S

Sarah Allen

Robert said:
All method parameters are evaluated before method invocation - this is
true for every method invocation in Ruby.


In the case of gsub the block is invoked once for each match.

This are really important details to understand. Thanks for pointing
them out.
Basically you need an alternative and more capturing groups along the
lines of

'([^']+)'|"([^"]+)"

Ah, of course. I knew that, but didn't put it together.

I am so appreciative of the folks on this list.

Thank you 7stud, Sebastian & Robert!

Sarah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top