Partial Regular Expression Matching

Discussion in 'Ruby' started by Hans Fugal, May 22, 2007.

  1. Hans Fugal

    Hans Fugal Guest

    I would like to identify partial matching of a regular expression, for a
    stream of input, as described in the pcrepartial(3) manpage. Is this
    possible with ruby Regexp, or would I have to wrap (a piece of) pcre?
    (or implement my own regular expression engine, hah!)

    As an aside, what I am really trying to do is write a lexer that works
    on stream input, and can decide whether any of the eligible tokens match
    before reading EOF (which may be a long, long way off both in bytes and
    time). If you can think of another approach (that still uses regexes)
    that'd work too.

    Thanks

    1. http://www.gammon.com.au/pcre/pcrepartial.html
    Hans Fugal, May 22, 2007
    #1
    1. Advertising

  2. On Tue, May 22, 2007 at 03:20:04PM +0900, Hans Fugal wrote:
    > I would like to identify partial matching of a regular expression, for a
    > stream of input, as described in the pcrepartial(3) manpage. Is this
    > possible with ruby Regexp, or would I have to wrap (a piece of) pcre?
    > (or implement my own regular expression engine, hah!)


    It looks like someone has wrapped pcre already:
    http://raa.ruby-lang.org/project/pcre/
    but that's quite old so you might need to fiddle with it a bit.

    > As an aside, what I am really trying to do is write a lexer that works
    > on stream input, and can decide whether any of the eligible tokens match
    > before reading EOF (which may be a long, long way off both in bytes and
    > time). If you can think of another approach (that still uses regexes)
    > that'd work too.


    Well, you can use regexps to distinguish a complete token from a partial
    one, simply by checking if it is followed by a character which is not part
    of the token. A little care is needed to handle EOF correctly - at worst you
    could just stick a sentinel character onto the end.

    A simple example, which matches (\w+) and (\s+) as tokens:

    require 'stringio'
    stream = StringIO.new("wibble bibble boing")

    token = ""
    chunk = stream.read(1)
    token << chunk if chunk
    loop do
    case token
    when /\A\w+/
    match = $&
    when /\A\s+/
    match = $&
    else
    puts "Syntax error here! " + token.inspect
    break
    end

    if match.size < token.size or chunk.nil?
    puts "Match token: " + token.slice!(0,match.size).inspect
    break if chunk.nil?
    else
    #puts "Partial match: " + token.inspect
    chunk = stream.read(1)
    token << chunk if chunk
    end
    end

    This should also work if you use, say, read(4096) instead of read(1), so it
    ought to be pretty efficient.

    Regards,

    Brian.
    Brian Candler, May 22, 2007
    #2
    1. Advertising

  3. Hans Fugal

    Hans Fugal Guest

    Brian Candler wrote:

    >> As an aside, what I am really trying to do is write a lexer that works
    >> on stream input, and can decide whether any of the eligible tokens match
    >> before reading EOF (which may be a long, long way off both in bytes and
    >> time). If you can think of another approach (that still uses regexes)
    >> that'd work too.

    >
    > Well, you can use regexps to distinguish a complete token from a partial
    > one, simply by checking if it is followed by a character which is not part
    > of the token. A little care is needed to handle EOF correctly - at worst you
    > could just stick a sentinel character onto the end.
    >
    > A simple example, which matches (\w+) and (\s+) as tokens:
    >
    > require 'stringio'
    > stream = StringIO.new("wibble bibble boing")
    >
    > token = ""
    > chunk = stream.read(1)
    > token << chunk if chunk
    > loop do
    > case token
    > when /\A\w+/
    > match = $&
    > when /\A\s+/
    > match = $&
    > else
    > puts "Syntax error here! " + token.inspect
    > break
    > end
    >
    > if match.size < token.size or chunk.nil?
    > puts "Match token: " + token.slice!(0,match.size).inspect
    > break if chunk.nil?
    > else
    > #puts "Partial match: " + token.inspect
    > chunk = stream.read(1)
    > token << chunk if chunk
    > end
    > end
    >
    > This should also work if you use, say, read(4096) instead of read(1), so it
    > ought to be pretty efficient.
    >


    Well that works for \w+ an \s+, but what if you want to match /01+0/?
    You'd get a syntax error on 0111 even though it's a valid partial match.
    Hans Fugal, May 22, 2007
    #3
  4. On Tue, May 22, 2007 at 03:20:04PM +0900, Hans Fugal wrote:
    > I would like to identify partial matching of a regular expression, for a
    > stream of input, as described in the pcrepartial(3) manpage. Is this
    > possible with ruby Regexp, or would I have to wrap (a piece of) pcre?
    > (or implement my own regular expression engine, hah!)
    >
    > As an aside, what I am really trying to do is write a lexer that works
    > on stream input, and can decide whether any of the eligible tokens match
    > before reading EOF (which may be a long, long way off both in bytes and
    > time). If you can think of another approach (that still uses regexes)
    > that'd work too.

    Why not use rex?
    Also StringScanner may be helpful for this.
    >
    > Thanks
    >
    > 1. http://www.gammon.com.au/pcre/pcrepartial.html
    Logan Capaldo, May 22, 2007
    #4
  5. On 5/22/07, Hans Fugal <> wrote:

    > Well that works for \w+ an \s+, but what if you want to match /01+0/?
    > You'd get a syntax error on 0111 even though it's a valid partial match.


    Han's I'm not sure I understand your use case. Perhaps you could
    provide some code as you would write it IF Ruby provided a
    match_partial method for Regexp.

    The normal use case for partial re matching is that you are processing
    interactively accumulated input, and want to check that what the user
    has typed in so far is a valid prefix for the valid inputs. As far as
    I can see the best way to do that with the current Ruby regexp engine
    is to write a regexp which fully matches all prefixes

    $ cat part_mat.rb
    full_pat = /^01+0/
    part_pat = /^((0|01+)0?)?$/

    (%w(0 01 010 0100 011 0110 01100) << "").each do |str|
    m = part_pat.match(str)
    if m
    puts "\"#{str}\" partially matches as \"#{m.string}\""
    else
    puts "\"#{str}\" does not match"
    end
    end

    $ ruby part_mat.rb
    "0" partially matches as "0"
    "01" partially matches as "01"
    "010" partially matches as "010"
    "0100" does not match
    "011" partially matches as "011"
    "0110" partially matches as "0110"
    "01100" does not match
    "" partially matches as ""

    It might be possible to take a regexp and automatically generate
    another regexp which will match it's prefixes.

    Might make an interesting rubyquiz.

    --
    Rick DeNatale

    My blog on Ruby
    http://talklikeaduck.denhaven2.com/
    Rick DeNatale, May 22, 2007
    #5
  6. On Wed, May 23, 2007 at 01:00:04AM +0900, Hans Fugal wrote:
    > Well that works for \w+ an \s+, but what if you want to match /01+0/?
    > You'd get a syntax error on 0111 even though it's a valid partial match.


    OK, I see the problem - it's not detecting the end of the expression, it's
    saying that this expression *might* match but only if the right characters
    were appended to the end of the source.

    In the general case I think you'd have to turn each RE into one which
    matches all possible prefixes, perhaps something like

    /(0(1+(0)?)?)/ # (note *)

    However, if you can guarantee that no individual valid token is going to be
    longer than a certain size (let's say 200 characters) then it would be
    simpler to ensure that you read-ahead at least 200 characters into a buffer
    and then match against that.

    Alternatively: perhaps only a few of your token REs have unlimited variable
    length. Those you can code in the prefix form like that shown above. The
    remainder (of fixed or limited length) can just be matched in the simple way
    against a large enough read-ahead buffer.

    Regards,

    Brian.

    (*) Hmm, this isn't quite right, since it partially matches 011112 as well.
    You could check for a partial match (i.e. $3 = nil) and allow it only if it
    consumes the whole string.

    Alternatively, the RE itself needs to say "must be followed by X or end of
    string". This works, but it's a bit ugly:

    /(0(\z|1+(\z|0)))

    I can't think of a better formulation off the top of my head though.
    Brian Candler, May 22, 2007
    #6
  7. Hans Fugal

    Hans Fugal Guest

    Rick DeNatale wrote:
    > On 5/22/07, Hans Fugal <> wrote:
    >
    >> Well that works for \w+ an \s+, but what if you want to match /01+0/?
    >> You'd get a syntax error on 0111 even though it's a valid partial match.

    >
    > Han's I'm not sure I understand your use case. Perhaps you could
    > provide some code as you would write it IF Ruby provided a
    > match_partial method for Regexp.


    It's a thought excercise. I've been fiddling with parser generators, and
    had an idea for a simple recursive-descent parser that includes the
    lexer by defining terminals as regexes. Example:


    # productions
    expr: term {. expr0 = term .}
    ( '+' term {. expr0 += term3 .}
    | '-' term {. expr0 -= term4 .}
    )*
    ;

    term: fact {. term0 = fact1 .}
    ( '*' fact {. term0 *= fact2 .}
    | '/' fact {. term0 /= fact3 .}
    )*
    ;

    fact: ['+'] const {. fact0 = const1.to_f .}
    | '-' const {. fact0 = -const2.to_f .}
    | '(' expr ')' {. fact0 = expr1 .}
    ;

    # terminals
    const: /\d+[\.\d+]/ = '0';

    Whether a parser generator or a generic lexer, in order to do the lexing
    generically from a set of regexes, you need to be able to say "doesn't
    match" in order to catch syntax errors in a timely manner. It's easy
    enough if you have all of the input, or "a lot" which is reasonably
    expected to be longer than any token, or if you can count on tokens not
    crossing a guard (such as a newline), but in general you need to do
    partial matching.

    So the code might look like:

    r = /\A#{terminals.inject {|u,r| Regexp.union(u,r)}}/
    until input.eof?
    if r =~ input
    # figure out which token and consume/return it
    elsif r.partial_match(input)
    # wait for more input
    end
    end

    That may not be the most efficient way, but it gives a good idea.

    The problem is also applicable for input verification, i.e. in a field
    on a form, as has been mentioned.
    Hans Fugal, May 25, 2007
    #7
  8. On Fri, May 25, 2007 at 11:25:09PM +0900, Hans Fugal wrote:
    > # productions
    > expr: term {. expr0 = term .}
    > ( '+' term {. expr0 += term3 .}
    > | '-' term {. expr0 -= term4 .}
    > )*
    > ;
    >
    > term: fact {. term0 = fact1 .}
    > ( '*' fact {. term0 *= fact2 .}
    > | '/' fact {. term0 /= fact3 .}
    > )*
    > ;
    >
    > fact: ['+'] const {. fact0 = const1.to_f .}
    > | '-' const {. fact0 = -const2.to_f .}
    > | '(' expr ')' {. fact0 = expr1 .}
    > ;
    >
    > # terminals
    > const: /\d+[\.\d+]/ = '0';


    I imagine it should be
    const: /\d+(\.\d+)?/

    > It's easy
    > enough if you have all of the input, or "a lot" which is reasonably
    > expected to be longer than any token, or if you can count on tokens not
    > crossing a guard (such as a newline), but in general you need to do
    > partial matching.


    All your tokens above are one character, so a one character lookahead is
    fine, apart from 'const'

    If you read your input in 4096 byte blocks, reading a new block when your
    buffer is less than 4096 bytes full, then you'll have somewhere between 4K
    and 8K of lookahead. You'd do that for efficiency anyway, I'd hope.

    So this leaves the case of any terminal regexps which might be required to
    match more than 4K of data as a single token. If you're parsing a language
    like that, then I'd agree that having partial matching makes your code a bit
    simpler. But otherwise, you can write your regexp to match partially:

    const: /\d+(\.(\d+)?)?/

    and if a partial match is detected, keep eating more input as necessary.

    Note: the worst case is that the entire file consists of a single token - in
    which case, you *will* end up reading the whole file into memory anyway.

    Regards,

    Brian.
    Brian Candler, May 25, 2007
    #8
  9. Hans Fugal

    Hans Fugal Guest

    Brian Candler wrote:
    > On Fri, May 25, 2007 at 11:25:09PM +0900, Hans Fugal wrote:
    >> # productions
    >> expr: term {. expr0 = term .}
    >> ( '+' term {. expr0 += term3 .}
    >> | '-' term {. expr0 -= term4 .}
    >> )*
    >> ;
    >>
    >> term: fact {. term0 = fact1 .}
    >> ( '*' fact {. term0 *= fact2 .}
    >> | '/' fact {. term0 /= fact3 .}
    >> )*
    >> ;
    >>
    >> fact: ['+'] const {. fact0 = const1.to_f .}
    >> | '-' const {. fact0 = -const2.to_f .}
    >> | '(' expr ')' {. fact0 = expr1 .}
    >> ;
    >>
    >> # terminals
    >> const: /\d+[\.\d+]/ = '0';

    >
    > I imagine it should be
    > const: /\d+(\.\d+)?/


    Yes, of course. Sorry.

    >
    >> It's easy
    >> enough if you have all of the input, or "a lot" which is reasonably
    >> expected to be longer than any token, or if you can count on tokens not
    >> crossing a guard (such as a newline), but in general you need to do
    >> partial matching.

    >
    > All your tokens above are one character, so a one character lookahead is
    > fine, apart from 'const'


    My example is not meant to be pathological. It's not hard to imagine a
    set of terminal regexes where you need much more than one-character
    lookahead.

    >
    > If you read your input in 4096 byte blocks, reading a new block when your
    > buffer is less than 4096 bytes full, then you'll have somewhere between 4K
    > and 8K of lookahead. You'd do that for efficiency anyway, I'd hope.


    When was the last time you typed 4k faster than the computer could
    process it? The biggest reason for stream parsing is when there's a
    human on the other end. In practice, you usually wouldn't have trouble
    assuming newline is never part of a token, because you wouldn't get
    stdin except after the user presses enter. But that's not entirely a
    guarantee, and there can be other situations (like a network stream)
    where that might not be the case.

    >
    > So this leaves the case of any terminal regexps which might be required to
    > match more than 4K of data as a single token. If you're parsing a language
    > like that, then I'd agree that having partial matching makes your code a bit
    > simpler. But otherwise, you can write your regexp to match partially:
    >
    > const: /\d+(\.(\d+)?)?/
    >
    > and if a partial match is detected, keep eating more input as necessary.
    >


    Easy enough in a hand-written lexer, but I wouldn't want to ask users of
    a parser generator to have to provide partial-match regexes. And parsing
    regexes to discover a partial match regex is not an easy task.
    Hans Fugal, May 26, 2007
    #9
  10. On Sun, May 27, 2007 at 12:45:12AM +0900, Hans Fugal wrote:
    > >If you read your input in 4096 byte blocks, reading a new block when your
    > >buffer is less than 4096 bytes full, then you'll have somewhere between 4K
    > >and 8K of lookahead. You'd do that for efficiency anyway, I'd hope.

    >
    > When was the last time you typed 4k faster than the computer could
    > process it? The biggest reason for stream parsing is when there's a
    > human on the other end.


    OK. So it sounds like the problem is that you *need* partial matching,
    because you need to handle parse-as-you-type, rather than parse an existing
    file presented as a stream.

    In which case the conclusions seem clear:

    (1) Find an existing regular expression engine or lexical analyser engine
    which does what you need. There are several which have been ported to Ruby
    which you could try.

    or:

    (2) Write your own Regular Expression engine.

    Sorry for stating the obvious. But I can't really see what else you want
    from this thread. Commiseration that Ruby's built-in regexp engine isn't
    suitable for your requirements? :)

    Regards,

    Brian.
    Brian Candler, May 26, 2007
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. VSK
    Replies:
    2
    Views:
    2,282
  2. Billy
    Replies:
    2
    Views:
    499
    Billy
    Feb 1, 2006
  3. =?iso-8859-1?B?bW9vcJk=?=

    Matching abitrary expression in a regular expression

    =?iso-8859-1?B?bW9vcJk=?=, Dec 1, 2005, in forum: Java
    Replies:
    8
    Views:
    839
    Alan Moore
    Dec 2, 2005
  4. Thomas Heller
    Replies:
    13
    Views:
    853
    Michele Simionato
    Feb 8, 2007
  5. J. Clifford Dyer

    Re: Partial 1.0 - Partial classes for Python

    J. Clifford Dyer, Feb 8, 2007, in forum: Python
    Replies:
    0
    Views:
    517
    J. Clifford Dyer
    Feb 8, 2007
Loading...

Share This Page