Problem using \G in look-behind Construct in Regular Expression (Ruby 1.9)

  • Thread starter Wolfgang Nádasi-Donner
  • Start date
W

Wolfgang Nádasi-Donner

I've recognized a problem, but I don't know if it is a bug or a feature.

Using Ruby 1.9 for generating examples of "look-behind" regular expression
constructions I recognized, that "(?<=\G.{10})" is not allowed (an error
message will be produced), while the "(?<=^.{10})" construct, which is
similar in using a positional assertion, works well.

If it is a feature, I don't understand the reason for it, because the length
of the look-behind pattern is still fixed.

???

Best regards, Wolfgang WoNaDo.
 
W

Wolfgang Nádasi-Donner

I'm a little bit surprised for no reaction.

Isn't it the right place for questions related to Oniguruma, the future
pattern matching machine fot Ruby?
 
T

ts

W> I'm a little bit surprised for no reaction.

Well, what do you expect with a such regexp ?
 
W

Wolfgang Nádasi-Donner

Well, what do you expect with a such regexp ?

Very simple - I'm writing examples for some information texts about "RegExes
in Ruby", this part is related to "look-behind" in Ruby 1.9.

The construct was planned inside a larger Rexexp that makes simple text
processing. The goal was to produce from

"Liebe Rubyistinnen und Rubyisten! Anlaesslich der immer wieder, ja taeglich
anzutreffenden ausserordentlichen
Freude, die Ihr, liebe Rubyistinnen und Rubyisten, bei der Benutzung unserer
ausserordentlichen Lieblingssprache
Ruby empfinden duerft, haben wir Euch alle hier zum Bankett der Rubyistinnen
und Rubyisten geladen. Benutzt die
Angebote des Bueffets genau so, wie Ihr Ruby nutzt: Einfach nach Wunsch
zusammenstellen und geniessen."

the output

"Liebe Rubyistinnen und Rubyisten!
Anlaesslich der immer wieder, ja
taeglich anzutreffenden
ausserordentlichen Freude, die Ihr,
liebe Rubyistinnen und Rubyisten, bei
der Benutzung unserer ausserordentlichen
Lieblingssprache Ruby empfinden duerft,
haben wir Euch alle hier zum Bankett der
Rubyistinnen und Rubyisten geladen.
Benutzt die Angebote des Bueffets genau
so, wie Ihr Ruby nutzt: Einfach nach
Wunsch zusammenstellen und geniessen."

To come to this result, the most simple solution is based on the pattern

/(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/

which should be used as argument for "String#scan" on a preprocessed input
(one line, blank after each token).

Unfortunately it does not work (error message, that "\G" is not allowed in
look-behind). I changed the expression to

/(((\w+[:.,;?!]? )+)(?=.*(?=^.{41}))|(\w+[:.,;?!]? ))/

and used "String#sub!" inside a while loop, which works! I'm not happy with
this solution, because it is artificial (extra lines of code, and
destructive on input).

I don't understand the reason for it, because "\G.{41}" is of fixes size and
anchored similar to "^.{41}". If it is a bug - O.K., fine to find it, if it
is a feature, I don't understand the reason to be one.
 
T

ts

W> /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/


I'm happy that Oniguruma give an error :)

Because otherwise the next question will be : why Oniguruma make an
infinite loop ?
 
W

Wolfgang Nádasi-Donner

I'm happy that Oniguruma give an error :)
Because otherwise the next question will be : why Oniguruma make an
infinite loop ?

????

There is no infinite loog at all. "\G" means "end of last match", which is
very useful in "scan"- and "gsub"-Regular Expressions.

Later I used a kind of simulation of "\G", I wrote a "while" around a
"sub!", which deletes the matched part of the string, and use then "^"
instead of the original "\G".

Code-part which works very well:

l_to_r = false
outpuff = ''
pat = /(\w+[:.,;?!]? )/
breite = 40
inpuff = inpuff.gsub(/\s+/, ' ') + ' '
while inpuff.length > breite
inpuff.sub!(/((#{pat}+)(?=.*(?<=^.{#{breite+1}}))|#{pat})/) do |m|
outpuff << m.fillstring(breite, (l_to_r = !l_to_r)) << "\n"
''
end
end
outpuff << inpuff.fillstring(breite, !l_to_r)

It works, but the semantically identical version using "\G" and "scan" does
not need any additional "while".

There is no inifinit loop at all - it's simply an "ancor".
 
W

Wolfgang Nádasi-Donner

It will not work with "perl5". There is a description in "perlre", that may
be valid for Ruby too:

"Currently \G is only fully supported when anchored to the start of the
pattern; while it is permitted to use it elsewhere, as in /(?<=\G..)./g,
some such uses (/.\G/g, for example) currently cause problems, and it is
recommended that you avoid such usage for now."

The general description of "\G" in "perlre":

"\G Match only at pos() (e.g. at the end-of-match position
of prior m//g)"

This means, it is a general bug, because it should work, but won't.

Try the Ruby programm with the "^"-workaround. It works.
inpuff = <<INTEXT
Liebe Rubyistinnen und Rubyisten! Anlaesslich der immer wieder, ja taeglich
anzutreffenden ausserordentlichen
Freude, die Ihr, liebe Rubyistinnen und Rubyisten, bei der Benutzung unserer
ausserordentlichen Lieblingssprache
Ruby empfinden duerft, haben wir Euch alle hier zum Bankett der Rubyistinnen
und Rubyisten geladen. Benutzt die
Angebote des Bueffets genau so, wie Ihr Ruby nutzt: Einfach nach Wunsch
zusammenstellen und geniessen.
INTEXT

class String
def fillstring(len, lr = true)
return '' unless self.match(/[^ ]/)
temp = self.strip.split(' ')
return temp[0] if temp.length == 1
dm = (len - temp.join('').length).divmod(temp.length - 1)
if lr
(temp[0...(temp.length - dm[1])].join(' ' * dm[0]) + ' ' * (dm[0] + 1)
+
temp[(temp.length - dm[1])...(temp.length)].join(' ' * (dm[0] +
1))).strip
else
(temp[0...dm[1]].join(' ' * (dm[0] + 1)) + ' ' * (dm[0] + 1) +
temp[dm[1]...temp.length].join(' ' * dm[0])).strip
end
end
end

l_to_r = false
outpuff = ''
pat = /(\w+[:.,;?!]? )/
breite = 40
inpuff = inpuff.gsub(/\s+/, ' ') + ' '
while inpuff.length > breite
inpuff.sub!(/((#{pat}+)(?=.*(?<=^.{#{breite+1}}))|#{pat})/) do |m|
outpuff << m.fillstring(breite, (l_to_r = !l_to_r)) << "\n"
''
end
end
outpuff << inpuff.fillstring(breite, !l_to_r)
puts outpuff
 
T

ts

W> "Currently \G is only fully supported when anchored to the start of the
W> pattern; while it is permitted to use it elsewhere, as in /(?<=\G..)./g,
W> some such uses (/.\G/g, for example) currently cause problems, and it is
W> recommended that you avoid such usage for now."

svg% cat a.pl
$m = "Liebe Rubyistinnen und Rubyisten!
anzutreffenden ausserordentlichen
Freude, die Ihr, liebe Rubyistinnen und Rubyisten, bei der Benutzung unserer
ausserordentlichen Lieblingssprache
Ruby empfinden duerft, haben wir Euch alle hier zum Bankett der Rubyistinnen
und Rubyisten geladen. Benutzt die
Angebote des Bueffets genau so, wie Ihr Ruby nutzt: Einfach nach Wunsch
zusammenstellen und geniessen.";

while ($m =~ /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/g) {
print $1;
}
svg%

svg% ./perl -l a.pl | more
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
Liebe
[etc, etc, etc]
svg%
 
W

Wolfgang Nádasi-Donner

I received a mail from K.Kosako. He said, that he will allow \G in
look-behind in the next release.
 
W

Wolfgang Nádasi-Donner

while ($m =~ /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/g) {
print $1;
}

May be this would not work under other circumstances, because the length
check after "while" is necessary. But I am surprised about your result. Why
does the match always start at the beginning? - The pattern matching machine
restarts always at the beginning, reusing allready recognized characters.

Somehow "\G" must confuse the matcher, because it is from my understanding
nothing else than a "moving left anchor", so that pattern like
"(?<=\G.{10})" mean nothing else than 'now reached position 10 after last
match'".

O.K. - it is as it is in the moment. Workarounds are possible, so it is
nothing like a blocking problem.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,040
Latest member
papereejit

Latest Threads

Top