RegEx engine returning empty matches between valid tokens.

Discussion in 'Perl Misc' started by John otac0n Gietzen, Feb 5, 2006.

  1. Dear RegEx Gurus,

    I am writing an application to evaluate mathematics functions. The
    first step in the process of creating the expressions is tokenizing the
    input. I decided to use one large regular expression to preform this
    tokenization:

    ~\G([a-zA-Z]\w*\(|[a-zA-Z]\w*|(<=|>=|!=|<>|==|=)|0x[\da-fA-F.]*|0b[\d.]*|[\d.]*|\s*|.)~

    Now, according to my intuition, this should work. However, any time a
    single character that is not explicitly recognized as a token comes by,
    the regex engine returns two matches: one empty and one of the correct
    character.

    To simplify this odd behavior, I have prepared the following example:

    Match the string
    abcdefghijklmnop
    to the expression
    ~\G(a|b|c*|\w)~

    This "anomaly" is seen in the Perl, PHP, and C# regex engines (which
    makes me think that it is expected behavior). The final destination
    for this regex is C#, so I can not just ignore null entries. (The C#
    regex engine stops after the first null match.) Any help or advice
    would be much appreciated.

    Sincerely,
    John "Otac0n" Gietzen
     
    John otac0n Gietzen, Feb 5, 2006
    #1
    1. Advertising

  2. John otac0n Gietzen

    Xicheng Guest

    John otac0n Gietzen wrote:
    > Dear RegEx Gurus,
    >
    > I am writing an application to evaluate mathematics functions. The
    > first step in the process of creating the expressions is tokenizing the
    > input. I decided to use one large regular expression to preform this
    > tokenization:
    >
    > ~\G([a-zA-Z]\w*\(|[a-zA-Z]\w*|(<=|>=|!=|<>|==|=)|0x[\da-fA-F.]*|0b[\d.]*|[\d.]*|\s*|.)~
    > Now, according to my intuition, this should work. However, any time a
    > single character that is not explicitly recognized as a token comes by,
    > the regex engine returns two matches: one empty and one of the correct
    > character
    >
    > To simplify this odd behavior, I have prepared the following example:
    >
    > Match the string
    > abcdefghijklmnop
    > to the expression
    > ~\G(a|b|c*|\w)~

    when you make "c*" as an alternation, the regex actually does like
    this:

    ~\G(a|b|c+||\w)~

    so you have five choices(instead of four), one of which is NULL which
    always takes a place between two characters. if you do want one or
    multiple "c" to show in your matched text, use "c+" instead of "c*"..

    Xicheng

    > This "anomaly" is seen in the Perl, PHP, and C# regex engines (which
    > makes me think that it is expected behavior). The final destination
    > for this regex is C#, so I can not just ignore null entries. (The C#
    > regex engine stops after the first null match.) Any help or advice
    > would be much appreciated.
    >
    > Sincerely,
    > John "Otac0n" Gietzen
     
    Xicheng, Feb 5, 2006
    #2
    1. Advertising

  3. Brilliant! Thanks very much.
     
    John otac0n Gietzen, Feb 5, 2006
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?SmViQnVzaGVsbA==?=

    Is ASP Validator Regex Engine Same As VS2003 Find Regex Engine?

    =?Utf-8?B?SmViQnVzaGVsbA==?=, Oct 22, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    735
    =?Utf-8?B?SmViQnVzaGVsbA==?=
    Oct 22, 2005
  2. Adam Balgach
    Replies:
    2
    Views:
    585
    news-east
    Nov 28, 2004
  3. Jonathan Lukens

    returning regex matches as lists

    Jonathan Lukens, Feb 15, 2008, in forum: Python
    Replies:
    7
    Views:
    310
    Jonathan Lukens
    Feb 16, 2008
  4. Nick Leverton
    Replies:
    2
    Views:
    614
    Nick Leverton
    Dec 5, 2008
  5. Markus Fischer
    Replies:
    9
    Views:
    179
    7stud --
    Apr 8, 2011
Loading...

Share This Page