Ruby Regex

Sriram Varahan · Jan 13, 2010

Hello,

I have a string as a = "&0&1"

I need to pass this value with the ampersand escaped to another command
in my program.

So I tried something like this:

irb(main):038:0> a.gsub(/&/,"\\&")
=> "&0&1"

But if I replace the & with some other variable I get it properly
working:

irb(main):049:0> a.gsub(/&/,"\\g")
=> "\\g0\\g1"

Any explanation on why this is happening and how to go about escaping
the &.

Thanks.
Sriram Varahan.

Brian Candler · Jan 13, 2010

Sriram said:
Hello,

I have a string as a = "&0&1"

I need to pass this value with the ampersand escaped to another command
in my program.

So I tried something like this:

irb(main):038:0> a.gsub(/&/,"\\&")
=> "&0&1"

That's because \& has a special meaning in a replacement string ("the
matched string"). Either use a block to provide the replacement value
(which doesn't do the backslash replacement), or put two backslashes in
the replacement string.

irb(main):002:0> a.gsub(/&/) { "\\&" }
=> "\\&0\\&1"
irb(main):003:0> a.gsub(/&/, "\\\\&")
=> "\\&0\\&1"

Robert Klemme · Jan 13, 2010

2010/1/13 Brian Candler said:
That's because \& has a special meaning in a replacement string ("the
matched string"). Either use a block to provide the replacement value
(which doesn't do the backslash replacement), or put two backslashes in
the replacement string.

irb(main):002:0> a.gsub(/&/) { "\\&" }
=> "\\&0\\&1"
irb(main):003:0> a.gsub(/&/, "\\\\&")
=> "\\&0\\&1"

But it's only special when preceded by a backslash, which is special
in replacement strings.

irb(main):002:0> "123".gsub(/\d/, '<&>')
=> "<&><&><&>"
irb(main):003:0> "123".gsub(/\d/, '<\\&>')
=> "<1><2><3>"
irb(main):004:0> "123".gsub(/\d/, '<\\\\&>')
=> "<\\&><\\&><\\&>"

Another example - backslash with group index:

irb(main):011:0> "abc".gsub(/\w(.)\w/, '<1>')
=> "<1>"
irb(main):012:0> "abc".gsub(/\w(.)\w/, '<\\1>')
=> "<b>"
irb(main):013:0> "abc".gsub(/\w(.)\w/, '<\\\\1>')
=> "<\\1>"

So what you basically do here is you escape the escape so it looses
its special meaning in the replacement string.

Kind regards

robert

Marnen Laibow-Koser · Jan 13, 2010

Robert said:
But it's only special when preceded by a backslash, which is special
in replacement strings.

irb(main):002:0> "123".gsub(/\d/, '<&>')
=> "<&><&><&>"
irb(main):003:0> "123".gsub(/\d/, '<\\&>')
=> "<1><2><3>"
irb(main):004:0> "123".gsub(/\d/, '<\\\\&>')
=> "<\\&><\\&><\\&>"

Another example - backslash with group index:

irb(main):011:0> "abc".gsub(/\w(.)\w/, '<1>')
=> "<1>"
irb(main):012:0> "abc".gsub(/\w(.)\w/, '<\\1>')
=> "<b>"
irb(main):013:0> "abc".gsub(/\w(.)\w/, '<\\\\1>')
=> "<\\1>"

So what you basically do here is you escape the escape so it looses
its special meaning in the replacement string.

...and Ruby's stupid backslash handling strikes again. This is a
completely brain-dead way to do it, and is one of the few things I
really hate about Ruby.

Kind regards

robert

Best,
--Â
Marnen Laibow-Koser
http://www.marnen.org
(e-mail address removed)

Xavier Noria · Jan 13, 2010

String literals have a one-pass escaping at parse time, so that

"foo\\bar\nbaz"

is an encoded way to express

foo\bar
baz

And the result of that ordinary pass is what gsub receives.

Then, at runtime gsub inspects its argument and looks in turn for
occurrences of \1, \& and friends. That is gsub's contract, and has no
relationship with string literals parsing.

You need double-scaping for \1 and friends to skip both passes, one
related to literals, and the other one related to how gsub works.

Gary Wright · Jan 13, 2010

...and Ruby's stupid backslash handling strikes again. This is a
completely brain-dead way to do it, and is one of the few things I
really hate about Ruby.

Is this really a Ruby snafu? It seems like it would be inherent in
any sort of character escape sequence, of which there are many
examples that have nothing at all to do with Ruby.

Any pointers to alternative encoding schemes that avoid this problem?

Gary Wright

Marnen Laibow-Koser · Jan 13, 2010

Gary said:
Is this really a Ruby snafu?

Yes. The problem is that Ruby "helpfully" does another level of
escaping, so that "\\&" is equivaIent to "\&", whereas it should simply
take the escape at face value and consider it equivalent to the two
characters \ and &.

For real fun, try concatenating two strings, the first of which ends in
a backslash. It's insane.

It seems like it would be inherent in
any sort of character escape sequence, of which there are many
examples that have nothing at all to do with Ruby.

But Ruby has its own special brand of idiocy here. Even Perl and PHP
get this right.

Any pointers to alternative encoding schemes that avoid this problem?

It has nothing to do with encoding. It's a question of a particular
point of stupidity in Ruby's parser and/or String class.

Gary Wright

Best,
--Â
Marnen Laibow-Koser
http://www.marnen.org
(e-mail address removed)

Xavier Noria · Jan 13, 2010

It has nothing to do with encoding. =C2=A0It's a question of a particular
point of stupidity in Ruby's parser and/or String class.

I don't understand your point. The backslash is a special character in
string literals. If you want to include one you need to escape it.
That's pretty normal.

What's your complain about parsing? This gotcha is related to gsub's
contract, nor to rules for string literals themselves.

Brian Candler · Jan 13, 2010

Marnen said:
The problem is that Ruby "helpfully" does another level of
escaping

It's necessary because it lets you use sequences like \n in
double-quoted strings, and \" if you want a double-quote, and \#{expr}
if you want literally # { expr } rather than interpolation.

so that "\\&" is equivaIent to "\&", whereas it should simply
take the escape at face value and consider it equivalent to the two
characters \ and &.

Which is what happens in single-quoted strings. But you can't put any
control character sequences like \n in those.

For real fun, try concatenating two strings, the first of which ends in
a backslash. It's insane.

a = "abc\\" # that's a string ending with one backslash
b = "def"
c = a + b

Looks OK to me.

But Ruby has its own special brand of idiocy here. Even Perl and PHP
get this right.

Perl is *exactly* the same.

#!/usr/bin/perl
print "abc\\\n";

This prints abc\ - same as Ruby would.

It was probably unfortunate that gsub uses sequences like \1 and the
like in the substitution side though. But that's what perl does:

#!/usr/bin/perl
$_ = "ab&de\n";
s/&/\&/;
print;

That prints ab&de, which is the problem the OP was grappling with.

Marnen Laibow-Koser · Jan 13, 2010

Xavier said:
String literals have a one-pass escaping at parse time, so that

"foo\\bar\nbaz"

is an encoded way to express

foo\bar
baz

And the result of that ordinary pass is what gsub receives.

Then, at runtime gsub inspects its argument and looks in turn for
occurrences of \1, \& and friends. That is gsub's contract, and has no
relationship with string literals parsing.

You need double-scaping for \1 and friends to skip both passes, one
related to literals, and the other one related to how gsub works.

Yes, I see that now. I wasn't aware that gsub did an extra parsing
step. With that in mind, doubling backslashes makes sense.

Best,

Marnen Laibow-Koser · Jan 13, 2010

Brian said:
It's necessary because it lets you use sequences like \n in
double-quoted strings, and \" if you want a double-quote, and \#{expr}
if you want literally # { expr } rather than interpolation.

Which is what happens in single-quoted strings. But you can't put any
control character sequences like \n in those.

I know that.

a = "abc\\" # that's a string ending with one backslash
b = "def"
c = a + b

Looks OK to me.

And to me too, when I just now tried it. I *did* run into a problem
with this at one point, but I can't now reproduce it. Perhaps it was
actually a gsub issue.

I'm glad to know that Ruby's backslash handling is not as weird as I'd
thought. Thanks for the correction.

Best,

Sriram Varahan · Jan 14, 2010

Thanks Brian, Robert and Xavier for your explanation. It was very
helpful.

Regards,
Sriram Varahan.

Robert Klemme · Jan 14, 2010

2010/1/13 Marnen Laibow-Koser said:
I'm glad to know that Ruby's backslash handling is not as weird as I'd
thought. =A0Thanks for the correction.

For me there is actually something weird about Ruby's escape handling
- but it's something else: in some circumstances Ruby allows you to
*omit* a backslash which is meant to be convenient (I believe) but
which leads to a certain inconsistency:

irb(main):014:0> '\1' # this might be seen as surprising
=3D> "\\1"
irb(main):015:0> '\\1'
=3D> "\\1"

We can get a single backslash by just using one, but if we need more
backslashes we need to escape:

irb(main):027:0> '\\1'
=3D> "\\1"
irb(main):028:0> '\\\1'
=3D> "\\\\1"
irb(main):029:0> '\\\\1'
=3D> "\\\\1"
irb(main):030:0> '\\\\\1'
=3D> "\\\\\\1"
irb(main):031:0> '\\\\\\1'
=3D> "\\\\\\1"

For double quoted strings we always need to use two backslashes at
least if followed by a digit:

irb(main):016:0> "\1"
=3D> "\x01"
irb(main):017:0> "\\1"
=3D> "\\1"

and

irb(main):018:0> '\n' # this might be seen as surprising
=3D> "\\n"
irb(main):019:0> '\\n'
=3D> "\\n"

but

irb(main):020:0> "\n" # a single newline
=3D> "\n"
irb(main):021:0> "\\n" # backslash and n
=3D> "\\n"

Bottom line for me: I do not exploit the '\1' case and have made it a
habit to use two backslashes whenever I need a literal backslash in a
string.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

Brian Candler · Jan 14, 2010

Robert said:
For me there is actually something weird about Ruby's escape handling
- but it's something else: in some circumstances Ruby allows you to
*omit* a backslash which is meant to be convenient (I believe) but
which leads to a certain inconsistency:

irb(main):014:0> '\1' # this might be seen as surprising
=> "\\1"

I think the principle is "single quoting does the absolute minimum
amount of dequoting".

However it *has* to support a way to get a single-quote within a
single-quoted string, and they chose \'. As a consequence, it *has* to
support \\ to get a single backslash within a single-quoted string.

The question then is, should any other sequence like \1 raise an error,
or return literal \ and 1 ?

The alternative would have been to use two single quotes where you want
a single quote within a string:

'It''s that time of day'

I quite like that, but arguably it's just confusing in a different way.

Robert Klemme · Jan 14, 2010

2010/1/14 Brian Candler said:
I think the principle is "single quoting does the absolute minimum
amount of dequoting".

Hmm, I never thought of it that way. I'm not sure I like this principle th=
ough.

However it *has* to support a way to get a single-quote within a
single-quoted string, and they chose \'. As a consequence, it *has* to
support \\ to get a single backslash within a single-quoted string.

The question then is, should any other sequence like \1 raise an error,
or return literal \ and 1 ?

I opt for raising a syntax error. I know, this is unlikely to happen
anytime soon if only because of the large base of code that is
potentially affected. With what I have seen over the past years, the
number of backslashes needed for proper quoting (especially for #gsub
and friends) has caused much confusion. I believe that could be
avoided by disallowing the '\1'.

The alternative would have been to use two single quotes where you want
a single quote within a string:

=A0 'It''s that time of day'

I quite like that, but arguably it's just confusing in a different way.

I like the quoting approach better.

Kind regards

robert

--=20
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

regex \s == \n???	11	Feb 6, 2009
ruby regex lookarounds?	5	Mar 14, 2007
Ruby Hash Keys and Related Questions	6	Feb 23, 2011
On Ranges as conditions	6	Oct 3, 2009
regex failure notifcation	1	Jan 15, 2008
ruby thread is buggy while using serialport	7	Aug 7, 2009
how to capture all conditions using regex	5	Jan 4, 2010
ruby global regex question.	6	Nov 19, 2008

Ruby Regex

Sriram Varahan

Brian Candler

Robert Klemme

Marnen Laibow-Koser

Xavier Noria

Gary Wright

Marnen Laibow-Koser

Xavier Noria

Brian Candler

Marnen Laibow-Koser

Marnen Laibow-Koser

Sriram Varahan

Robert Klemme

Brian Candler

Robert Klemme

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads