Problem using \G in look-behind Construct in Regular Expression (Ruby 1.9)

Discussion in 'Ruby' started by Wolfgang Nádasi-Donner, Jun 28, 2006.

  1. I've recognized a problem, but I don't know if it is a bug or a feature.

    Using Ruby 1.9 for generating examples of "look-behind" regular expression
    constructions I recognized, that "(?<=\G.{10})" is not allowed (an error
    message will be produced), while the "(?<=^.{10})" construct, which is
    similar in using a positional assertion, works well.

    If it is a feature, I don't understand the reason for it, because the length
    of the look-behind pattern is still fixed.

    ???

    Best regards, Wolfgang WoNaDo.
     
    Wolfgang Nádasi-Donner, Jun 28, 2006
    #1
    1. Advertising

  2. I'm a little bit surprised for no reaction.

    Isn't it the right place for questions related to Oniguruma, the future
    pattern matching machine fot Ruby?
     
    Wolfgang Nádasi-Donner, Jun 29, 2006
    #2
    1. Advertising

  3. Wolfgang Nádasi-Donner

    ts Guest

    >>>>> "W" == Wolfgang Nádasi-Donner <> writes:

    W> I'm a little bit surprised for no reaction.

    Well, what do you expect with a such regexp ?


    --

    Guy Decoux
     
    ts, Jun 30, 2006
    #3
  4. > Well, what do you expect with a such regexp ?

    Very simple - I'm writing examples for some information texts about "RegExes
    in Ruby", this part is related to "look-behind" in Ruby 1.9.

    The construct was planned inside a larger Rexexp that makes simple text
    processing. The goal was to produce from

    "Liebe Rubyistinnen und Rubyisten! Anlaesslich der immer wieder, ja taeglich
    anzutreffenden ausserordentlichen
    Freude, die Ihr, liebe Rubyistinnen und Rubyisten, bei der Benutzung unserer
    ausserordentlichen Lieblingssprache
    Ruby empfinden duerft, haben wir Euch alle hier zum Bankett der Rubyistinnen
    und Rubyisten geladen. Benutzt die
    Angebote des Bueffets genau so, wie Ihr Ruby nutzt: Einfach nach Wunsch
    zusammenstellen und geniessen."

    the output

    "Liebe Rubyistinnen und Rubyisten!
    Anlaesslich der immer wieder, ja
    taeglich anzutreffenden
    ausserordentlichen Freude, die Ihr,
    liebe Rubyistinnen und Rubyisten, bei
    der Benutzung unserer ausserordentlichen
    Lieblingssprache Ruby empfinden duerft,
    haben wir Euch alle hier zum Bankett der
    Rubyistinnen und Rubyisten geladen.
    Benutzt die Angebote des Bueffets genau
    so, wie Ihr Ruby nutzt: Einfach nach
    Wunsch zusammenstellen und geniessen."

    To come to this result, the most simple solution is based on the pattern

    /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/

    which should be used as argument for "String#scan" on a preprocessed input
    (one line, blank after each token).

    Unfortunately it does not work (error message, that "\G" is not allowed in
    look-behind). I changed the expression to

    /(((\w+[:.,;?!]? )+)(?=.*(?=^.{41}))|(\w+[:.,;?!]? ))/

    and used "String#sub!" inside a while loop, which works! I'm not happy with
    this solution, because it is artificial (extra lines of code, and
    destructive on input).

    I don't understand the reason for it, because "\G.{41}" is of fixes size and
    anchored similar to "^.{41}". If it is a bug - O.K., fine to find it, if it
    is a feature, I don't understand the reason to be one.
     
    Wolfgang Nádasi-Donner, Jun 30, 2006
    #4
  5. Wolfgang Nádasi-Donner

    ts Guest

    >>>>> "W" == Wolfgang Nádasi-Donner <> writes:

    W> /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/


    I'm happy that Oniguruma give an error :)

    Because otherwise the next question will be : why Oniguruma make an
    infinite loop ?

    --

    Guy Decoux
     
    ts, Jun 30, 2006
    #5
  6. > I'm happy that Oniguruma give an error :)
    >
    > Because otherwise the next question will be : why Oniguruma make an
    > infinite loop ?


    ????

    There is no infinite loog at all. "\G" means "end of last match", which is
    very useful in "scan"- and "gsub"-Regular Expressions.

    Later I used a kind of simulation of "\G", I wrote a "while" around a
    "sub!", which deletes the matched part of the string, and use then "^"
    instead of the original "\G".

    Code-part which works very well:

    l_to_r = false
    outpuff = ''
    pat = /(\w+[:.,;?!]? )/
    breite = 40
    inpuff = inpuff.gsub(/\s+/, ' ') + ' '
    while inpuff.length > breite
    inpuff.sub!(/((#{pat}+)(?=.*(?<=^.{#{breite+1}}))|#{pat})/) do |m|
    outpuff << m.fillstring(breite, (l_to_r = !l_to_r)) << "\n"
    ''
    end
    end
    outpuff << inpuff.fillstring(breite, !l_to_r)

    It works, but the semantically identical version using "\G" and "scan" does
    not need any additional "while".

    There is no inifinit loop at all - it's simply an "ancor".
     
    Wolfgang Nádasi-Donner, Jun 30, 2006
    #6
  7. Wolfgang Nádasi-Donner

    ts Guest

    >>>>> "W" == Wolfgang Nádasi-Donner <> writes:

    W> There is no infinite loog at all.

    Try your regexp with a P language.

    --

    Guy Decoux
     
    ts, Jun 30, 2006
    #7
  8. It will not work with "perl5". There is a description in "perlre", that may
    be valid for Ruby too:

    "Currently \G is only fully supported when anchored to the start of the
    pattern; while it is permitted to use it elsewhere, as in /(?<=\G..)./g,
    some such uses (/.\G/g, for example) currently cause problems, and it is
    recommended that you avoid such usage for now."

    The general description of "\G" in "perlre":

    "\G Match only at pos() (e.g. at the end-of-match position
    of prior m//g)"

    This means, it is a general bug, because it should work, but won't.

    Try the Ruby programm with the "^"-workaround. It works.

    >>>>> program >>>>>

    inpuff = <<INTEXT
    Liebe Rubyistinnen und Rubyisten! Anlaesslich der immer wieder, ja taeglich
    anzutreffenden ausserordentlichen
    Freude, die Ihr, liebe Rubyistinnen und Rubyisten, bei der Benutzung unserer
    ausserordentlichen Lieblingssprache
    Ruby empfinden duerft, haben wir Euch alle hier zum Bankett der Rubyistinnen
    und Rubyisten geladen. Benutzt die
    Angebote des Bueffets genau so, wie Ihr Ruby nutzt: Einfach nach Wunsch
    zusammenstellen und geniessen.
    INTEXT

    class String
    def fillstring(len, lr = true)
    return '' unless self.match(/[^ ]/)
    temp = self.strip.split(' ')
    return temp[0] if temp.length == 1
    dm = (len - temp.join('').length).divmod(temp.length - 1)
    if lr
    (temp[0...(temp.length - dm[1])].join(' ' * dm[0]) + ' ' * (dm[0] + 1)
    +
    temp[(temp.length - dm[1])...(temp.length)].join(' ' * (dm[0] +
    1))).strip
    else
    (temp[0...dm[1]].join(' ' * (dm[0] + 1)) + ' ' * (dm[0] + 1) +
    temp[dm[1]...temp.length].join(' ' * dm[0])).strip
    end
    end
    end

    l_to_r = false
    outpuff = ''
    pat = /(\w+[:.,;?!]? )/
    breite = 40
    inpuff = inpuff.gsub(/\s+/, ' ') + ' '
    while inpuff.length > breite
    inpuff.sub!(/((#{pat}+)(?=.*(?<=^.{#{breite+1}}))|#{pat})/) do |m|
    outpuff << m.fillstring(breite, (l_to_r = !l_to_r)) << "\n"
    ''
    end
    end
    outpuff << inpuff.fillstring(breite, !l_to_r)
    puts outpuff
    >>>>> end of program >>>>>
     
    Wolfgang Nádasi-Donner, Jun 30, 2006
    #8
  9. Wolfgang Nádasi-Donner

    ts Guest

    >>>>> "W" == Wolfgang Nádasi-Donner <> writes:

    W> "Currently \G is only fully supported when anchored to the start of the
    W> pattern; while it is permitted to use it elsewhere, as in /(?<=\G..)./g,
    W> some such uses (/.\G/g, for example) currently cause problems, and it is
    W> recommended that you avoid such usage for now."

    svg% cat a.pl
    $m = "Liebe Rubyistinnen und Rubyisten!
    anzutreffenden ausserordentlichen
    Freude, die Ihr, liebe Rubyistinnen und Rubyisten, bei der Benutzung unserer
    ausserordentlichen Lieblingssprache
    Ruby empfinden duerft, haben wir Euch alle hier zum Bankett der Rubyistinnen
    und Rubyisten geladen. Benutzt die
    Angebote des Bueffets genau so, wie Ihr Ruby nutzt: Einfach nach Wunsch
    zusammenstellen und geniessen.";

    while ($m =~ /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/g) {
    print $1;
    }
    svg%

    svg% ./perl -l a.pl | more
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    Liebe
    [etc, etc, etc]
    svg%


    --

    Guy Decoux
     
    ts, Jun 30, 2006
    #9
  10. I received a mail from K.Kosako. He said, that he will allow \G in
    look-behind in the next release.
     
    Wolfgang Nádasi-Donner, Jun 30, 2006
    #10
  11. > while ($m =~ /(((\w+[:.,;?!]? )+)(?=.*(?=\G.{41}))|(\w+[:.,;?!]? ))/g) {
    > print $1;
    > }


    May be this would not work under other circumstances, because the length
    check after "while" is necessary. But I am surprised about your result. Why
    does the match always start at the beginning? - The pattern matching machine
    restarts always at the beginning, reusing allready recognized characters.

    Somehow "\G" must confuse the matcher, because it is from my understanding
    nothing else than a "moving left anchor", so that pattern like
    "(?<=\G.{10})" mean nothing else than 'now reached position 10 after last
    match'".

    O.K. - it is as it is in the moment. Workarounds are possible, so it is
    nothing like a blocking problem.
     
    Wolfgang Nádasi-Donner, Jun 30, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,397
  2. inhahe
    Replies:
    3
    Views:
    2,506
    Diez B. Roggisch
    Jan 28, 2005
  3. Mukesh
    Replies:
    4
    Views:
    648
    Paul N
    Mar 26, 2010
  4. Replies:
    4
    Views:
    213
  5. Jason Friedman

    Re: Regular expression negative look-ahead

    Jason Friedman, Jul 2, 2013, in forum: Python
    Replies:
    0
    Views:
    96
    Jason Friedman
    Jul 2, 2013
Loading...

Share This Page