Regexp, String, Symbol literals' object_ids

Pavel R. · Dec 19, 2010

Regexp literals:
5.times { p /abcdasdf/.object_id } -> same!

String literals:
5.times { 'asdasdf'.object_id } -> different

Symbols:
5.times { p :asdfsf.object_id } -> same!

Symbols with to_s:
5.times { p :asdsdfsdf.to_s.object_id } -> different

Predefined string as a constant
CONS = 'asdfsdf'
5.times { p CONS.object_id } -> same! (sure)

Question:
Is there some special syntax for string literals ("asdfasdf") to behave
like /sadfsdf/ as in the examples above? Without predefining a string as
a constant's value. Or another elegant way to achieve the same goal?

Andrea Dallera · Dec 19, 2010

Ruby has both mutable and immutable strings. A mutable string is
declared as "string". An immutable string is declared as :string and in
ruby is called a 'symbol'. So, no, there is no way for "string" to
behave as :string, since that's by design. Well there is a way but I'd
not go there

If you want two equivalent string literals to point at the same
instance, use the symbol notation, as in:

:test.object_id == :test.object_id #true

--
Andrea Dallera
http://github.com/bolthar/freightrain
http://usingimho.wordpress.com

Andrea

Il 19/12/2010 21:07, Pavel R. ha scritto:

Pavel R. · Dec 19, 2010

Andrea Dallera wrote in post #969438:

Well there is a way but I'd
not go there

Digging into parse.y and other ruby core files?

Quintus · Dec 19, 2010

Am 19.12.2010 21:07, schrieb Pavel R.:

Regexp literals:
5.times { p /abcdasdf/.object_id } -> same!

How is this possible? For every time the loop is executed there should a
new regexp be created... Have a look at this which seems confusing to me:

#ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
irb(main):001:0> 5.times { p /abcdasdf/.object_id }
8030280
8030280
8030280
8030280
8030280
=> 5
irb(main):002:0> p /abcdasdf/.object_id
8049600
=> 8049600
irb(main):003:0> p /abcdasdf/.object_id
8063560
=> 8063560
irb(main):004:0>

The regexp in the loop always stays the same, but if I create some
others outside the loop, they get a different object ID? Can anybody
shade some light on this?

Valete,
Marvin

Pavel R. · Dec 19, 2010

How about this? I have just discovered

pavel@pavel-laptop:~/dev/binexp$ ~/usr/ruby19/bin/irb
irb(main):001:0> /(?<digits>\d+)/ =~ 'abc123def'
=> 3
irb(main):002:0> digits
=> "123"
irb(main):003:0>

According to `ri Regexp#=~`

If =~ is used with a regexp literal with named captures, captured
strings (or nil) is assigned to local variables named by the capture
names.

/(?<lhs>\w+)\s*=\s*(?<rhs>\w+)/ =~ " x = y "
p lhs #=> "x"
p rhs #=> "y"

If it is not matched, nil is assigned for the variables.

/(?<lhs>\w+)\s*=\s*(?<rhs>\w+)/ =~ " x = "
p lhs #=> nil
p rhs #=> nil

This assignment is implemented in the Ruby parser. The parser detects
'regexp-literal =~ expression' for the assignment. The regexp must be a
literal without interpolation and placed at left hand side.

=======>

It seems Ruby parser can do some magic things!

Abinoam Jr. · Dec 20, 2010

Try

ruby-1.9.2-head > 5.times { p /thesame/.object_id.to_s + ' ' +
/thesame/.object_id.to_s}
"21522620 21522400"
"21522620 21522400"
"21522620 21522400"
"21522620 21522400"
"21522620 21522400"
=> 5
ruby-1.9.2-head > 5.times { p 'thesame'.object_id.to_s + ' ' +
'thesame'.object_id.to_s}
"21553480 21553400"
"21553320 21553240"
"21553160 21553080"
"21553000 21552920"
"21552840 21552760"
=> 5
ruby-1.9.2-head >

What is the logic behind this?

Am 19.12.2010 21:07, schrieb Pavel R.:

Regexp literals:
5.times { p /abcdasdf/.object_id } -> same!

Click to expand...

How is this possible? For every time the loop is executed there should a
new regexp be created... Have a look at this which seems confusing to me:

#ruby -v: ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-linux]
irb(main):001:0> 5.times { p /abcdasdf/.object_id }
8030280
8030280
8030280
8030280
8030280
=> 5
irb(main):002:0> p /abcdasdf/.object_id
8049600
=> 8049600
irb(main):003:0> p /abcdasdf/.object_id
8063560
=> 8063560
irb(main):004:0>

The regexp in the loop always stays the same, but if I create some
others outside the loop, they get a different object ID? Can anybody
shade some light on this?

Valete,
Marvin

Rick DeNatale · Dec 20, 2010

Try

ruby-1.9.2-head > 5.times { p /thesame/.object_id.to_s + ' ' =A0+
/thesame/.object_id.to_s}
"21522620 21522400"
"21522620 21522400"
"21522620 21522400"
"21522620 21522400"
"21522620 21522400"
=A0=3D> 5
ruby-1.9.2-head > 5.times { p 'thesame'.object_id.to_s + ' ' =A0+
'thesame'.object_id.to_s}
"21553480 21553400"
"21553320 21553240"
"21553160 21553080"
"21553000 21552920"
"21552840 21552760"
=A0=3D> 5
ruby-1.9.2-head >

What is the logic behind this?

A regular expression literal like /thesameobject/ causes a Regexp
object to be instantiated at PARSE time.

Since regular expressions are immutable (as contrasted with strings or
arrays which also have a literal representation), it really doesn't
mattter if a new regeexp is created each time the expression
containing the literal is evaluated. The fact that two occurrences of
an apparently 'equal' regular expression generate two different
instances simply reflects the fact that the parser doesn't attempt to
consolidate equal literals.

In the case of a string literal, since strings are mutable, a new
string instance is created each time the expression containing the
string literal is evaluated.

--=20
Rick DeNatale

Blog: http://talklikeaduck.denhaven2.com/
Github: http://github.com/rubyredrick
Twitter: @RickDeNatale
WWR: http://www.workingwithrails.com/person/9021-rick-denatale
LinkedIn: http://www.linkedin.com/in/rickdenatale

Brian Candler · Dec 20, 2010

Andrea Dallera wrote in post #969438:

Ruby has both mutable and immutable strings. A mutable string is
declared as "string". An immutable string is declared as :string and in
ruby is called a 'symbol'. So, no, there is no way for "string" to
behave as :string, since that's by design.

This is a very misleading description, so I'll bite.

Strings and Symbols are two completely different things in Ruby. In Ruby
1.9, Symbols have gained some more string-like behaviour(*), but they're
still fundamentally different.

Symbols are objects intended for labelling things (e.g. method names,
hash keys). The main property of Symbols is that there only ever exists
one Symbol object which represents the same label, i.e. the same
sequence of characters.(**)

So when your program loads, and it uses the symbol :foo, which hasn't
been used before, then a new symbol called :foo is created in the symbol
table. But every other future use of :foo always returns the same
object.

This makes symbols very cheap to test for equality, because:

* Two symbols are the same iff they have the same object_id
* Two symbols are different iff they have different object_id

So testing equality between :a_very_long_symbol_like_this and
:another_very_long_symbol is only comparing their object_ids, basically
two integers.

The property that any future :foo must return the same object_id means
that the Symbol table is never garbage-collected. A Symbol is for life,
not just for Christmas.

Strings are collections of bytes/characters. They can be mutated. There
can be many String objects in the system which contain the same sequence
of bytes/characters. Therefore, comparing two Strings always has to be
done byte-by-byte.(***)

In general, what you want is a String. If you're reading data from a
user (e.g. on STDIN or a web-page POST) then it comes in as a String.
You can convert a String into a Symbol represented by the same set of
characters:

a = "foo"
b = a.intern # b = :foo

but this can be a dangerous thing to do if the string you are converting
came from an untrusted source, because it can lead to a simple
denial-of-service attack as the user floods your symbol table with
garbage.

So to summarise, Symbols are used as method names:

a = 1
b = a.send

+, 2) # b = a + 2

and are often used as hash keys, because the lookup operations are
cheaper.

def doit(params)
puts params[:foo]
puts params[:bar]
end

doit

foo=>123, :bar=>456)

If coming from a language like C, think of symbols more as enums rather
than strings, where the programmer is using an easy-to-read label like
:foo, but the underlying value is actually a number.

HTH,

Brian.

(*) Example from ruby 1.8:NoMethodError: undefined method `size' for :foo:Symbol
from (irb):2
from :0

But:

1.9.2-p0 > :foo.size
=> 3

(**) Everything you say about Strings or Symbols in 1.9 has to be
qualified, because it's such a complex area. Suffice to say, in 1.9 it's
possible to have two distinct Symbols which are labelled by the same
series of bytes but with different encodings.

Things are far simpler in ruby 1.8, where bytes are real bytes, and
small furry creatures from Alpha Centuri are real small furry creates
from Alpha Centuri.

(***) There are in fact some optimisations whereby two distinct string
objects can share the same underlying data buffer, with copy-on-write.
But in general comparing strings needs to compare the buffers.

And even though ruby 1.9 has strings of characters, the comparisons
*are* done byte-by-byte, not character by character.

Pavel R. · Dec 20, 2010

Ok. But initial question was slightly different.

Can I write something like

%c(string i do not want to be created again and again, and i do not want
to define it as a constant because it is used once in a code)

?

What is a reason if it is impossible in Ruby? It seems to be useful!

Abinoam Jr. · Dec 21, 2010

[Advice... I'm new at this... be patient]

Isn't it "%q" ?

I couldn't figure out a "need" of this that wouldn't fit something like...

holding_var = %q(string i do not want to be created again and again,
and i do not want to define it as a constant because it is used once
in a code)

5.times { puts "holding_var = #{holding_var} and its object_id is
#{holding_var.object_id}" }

If the string is short, you could even use a var name that ressembles
the string.

created_once = "created_once"

Could you show an example? ("used once in a code" vs. "created again and again")

Abinoam Jr.

Pavel R. · Dec 21, 2010

Example:

class A
def m
a,b,c,d = data.unpack('NNvv')
e,f,g,h = a.unpack('vNNv')
# and so on ...
# do something with data
end
end

To prevent 'NNvv' again and again I can assign a constant

class A
Format_NNvv = 'NNvv'
end

and use it.

But there's many different 'NNvv', 'vNNv', 'a*c', ... in my code so I
need to assign all them to constants. This approach implies a large
section of assigning constants.

Pavel R. · Dec 22, 2010

What kind of "problem" are going to need do so many packs/unpacks that

this part (constant or string) will make much difference?

Working with binary protocols. Smth. like
https://github.com/pavelrosputko/em-oscar/blob/master/em-oscar/icbm.rb

Much difference? Actually not so much in icbm.rb source above.

But one wrote at http://redmine.ruby-lang.org/issues/show/4184#note-3

I've been able to get 2-3% improvements in Rails apps by simply
rewriting some 'constant's and inline Arrays as CONSTANTs.

I have patches to MRI that use cached, immutable Strings for the
internal #to_s messages on immutable objects; e.g. changing Symbol#to_s,
Float#to_s, Bignum#to_s, Rational#to_s, etc. to return the same frozen
String instance. I measured 1-6% performance improvement in the
standard MRI tests.

Phillip Gawlowski · Dec 22, 2010

I've been able to get 2-3% improvements in Rails apps by simply
rewriting some 'constant's and inline Arrays as CONSTANTs.

2-3% of what? If it's 200ms, the gains are much less impressive when
compared to 2000ms.

--
Phillip Gawlowski

Though the folk I have met,
(Ah, how soon!) they forget
When I've moved on to some other place,
There may be one or two,
When I've played and passed through,
Who'll remember my song or my face.

Non latin characters in string literals	17	Jan 3, 2010
constant string as controlling expression in _Generic gives error	8	Dec 8, 2013
can the pre-processor convert string literals into chars?	10	Mar 29, 2007
can the pre-processor convert string literals into chars?	2	Mar 29, 2007
hash key performance	3	Oct 2, 2007
REWORK - Task: Unify behaviour of by-literal-instantiated Objects	4	Jun 21, 2011
[ANN] RedParse 0.8.4 Released	0	Jan 4, 2010
[ANN] JRuby 1.4.0RC1 Released	0	Oct 3, 2009

Regexp, String, Symbol literals' object_ids

Pavel R.

Andrea Dallera

Pavel R.

Quintus

Pavel R.

Abinoam Jr.

Rick DeNatale

Brian Candler

Pavel R.

Abinoam Jr.

Pavel R.

Pavel R.

Phillip Gawlowski

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads