regex tidy

Discussion in 'Java' started by Roedy Green, May 23, 2014.

  1. Roedy Green

    Roedy Green Guest

    What would a Java regex tidier do?

    Here are some ideas:

    1. remove nugatory \ quoting

    2. convert \s*\s* --> \s*

    3. alphabetise | lists.

    4. remove (?:... ) that is not doing anything.

    5. put [..] lists in canonical order.

    6. Convert runs of 4+ to a-d notation.

    7. use negative char lists when it would shorten the list.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Young man, in mathematics you don’t understand things.
    You just get used to them.
    ~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)
     
    Roedy Green, May 23, 2014
    #1
    1. Advertisements

  2. On 5/23/2014 2:33 PM, Leif Roar Moldskred wrote:
    > Roedy Green <> wrote:
    >>
    >> What would a Java regex tidier do?
    >>

    >
    > Evil.
    >
    > Regular expressions should be written to as closely as
    > possible mirror a mental or conceptual model of the
    > inputs it will match, and not to be as short or compact
    > as possible.
    >
    > For instance, to match dates on the pattern "YY-MM-dd"
    > your regular expression should be "\\d{2}-\\d{2}-\\d{2}"
    > and not "((^|-)\\d{2}){3}"


    Those two regexes aren't actually equivalent. The latter matches
    -00-00-00, for example. I assume you meant "\\d{2}(?:-\\d{2}){2}"?


    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer ðŸ§, May 23, 2014
    #2
    1. Advertisements

  3. On 23.05.2014 20:39, Roedy Green wrote:
    > What would a Java regex tidier do?
    >
    > Here are some ideas:
    >
    > 1. remove nugatory \ quoting
    >
    > 2. convert \s*\s* --> \s*


    More generally convert

    R*R* into R*

    But who writes regexes like this?

    > 3. alphabetise | lists.


    What does that mean? Are you talking about merging common prefixes to
    get a more efficient matching experience? That might actually be done
    by a regex engine if it detects this situation.

    > 4. remove (?:... ) that is not doing anything.


    Hmm... It may still serve the purpose of documentation for the human.

    > 5. put [..] lists in canonical order.


    I don't believe this will improve anything.

    > 6. Convert runs of 4+ to a-d notation.


    Not sure what you mean here.

    > 7. use negative char lists when it would shorten the list.


    I don't believe this has any impact as this is something the regex
    engine will do internally. Also, if the list is longer the way it is
    written then someone probably had a reason to do so. Who would
    voluntarily write longer char classes than necessary?

    Generally I tend to agree with what Leif wrote: it may actually be
    harder to read a regex that you did not author. I am sceptical of the
    effort. And then, you must consider that some changes may actually hurt
    performance if the regex was specifically crafted for a particular engine.

    Kind regards

    robert
     
    Robert Klemme, May 24, 2014
    #3
  4. Roedy Green

    Roedy Green Guest

    On Sat, 24 May 2014 14:55:03 +0200, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >What does that mean? Are you talking about merging common prefixes to
    >get a more efficient matching experience? That might actually be done
    >by a regex engine if it detects this situation.



    (cat|rabbit|dog|cow)
    could be transformed to
    (cat|cow|dog|rabbit)

    for two reasons:
    make the list easier to proofread
    might help engine optimise handling the c in cat and cow in common.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Young man, in mathematics you don’t understand things.
    You just get used to them.
    ~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)
     
    Roedy Green, May 25, 2014
    #4
  5. Roedy Green

    Roedy Green Guest

    On Sat, 24 May 2014 14:55:03 +0200, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >> 5. put [..] lists in canonical order.

    >
    >I don't believe this will improve anything.


    If lists are in canonical order, they are easier to proofread,
    especially if they are complicated.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Young man, in mathematics you don’t understand things.
    You just get used to them.
    ~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)
     
    Roedy Green, May 25, 2014
    #5
  6. Roedy Green

    Roedy Green Guest

    On Sat, 24 May 2014 14:55:03 +0200, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >> 6. Convert runs of 4+ to a-d notation.

    >
    >Not sure what you mean here.


    see http://mindprod.com/project/regextidier.html

    I would convert

    [acdbde] to [a-e]
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Young man, in mathematics you don’t understand things.
    You just get used to them.
    ~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)
     
    Roedy Green, May 25, 2014
    #6
  7. Roedy Green

    Roedy Green Guest

    On Sat, 24 May 2014 14:55:03 +0200, Robert Klemme
    <> wrote, quoted or indirectly quoted
    someone who said :

    >> 7. use negative char lists when it would shorten the list.

    >
    >I don't believe this has any impact as this is something the regex
    >engine will do internally.


    I was thinking of a transform something like one that searched for
    every character but " by specifying every ascii char but one. It could
    be simplified to use a negative. Perhaps the author had never heard
    of negative searches.

    It is pretty clear any tidier will need to be configurable.
    --
    Roedy Green Canadian Mind Products http://mindprod.com
    Young man, in mathematics you don’t understand things.
    You just get used to them.
    ~ John von Neumann (born: 1903-12-28 died: 1957-02-08 at age: 53)
     
    Roedy Green, May 25, 2014
    #7
  8. On 25.05.2014 18:09, Roedy Green wrote:
    > On Sat, 24 May 2014 14:55:03 +0200, Robert Klemme
    > <> wrote, quoted or indirectly quoted
    > someone who said :


    >>> 3. alphabetise | lists.

    >> What does that mean? Are you talking about merging common prefixes to
    >> get a more efficient matching experience? That might actually be done
    >> by a regex engine if it detects this situation.

    >
    >
    > (cat|rabbit|dog|cow)
    > could be transformed to
    > (cat|cow|dog|rabbit)
    >
    > for two reasons:
    > make the list easier to proofread
    > might help engine optimise handling the c in cat and cow in common.


    The engine does not need the reordering to do this optimization. If you
    want to help lesser engines which do not optimize here you would have to
    transform it into

    (c(?:at|ow)|dog|rabbit)

    to get more efficient matching.

    Cheers

    robert
     
    Robert Klemme, May 25, 2014
    #8
  9. On 25.05.2014 19:28, Leif Roar Moldskred wrote:
    > Robert Klemme <> wrote:
    >>
    >> The engine does not need the reordering to do this optimization. If you
    >> want to help lesser engines which do not optimize here you would have to
    >> transform it into
    >>
    >> (c(?:at|ow)|dog|rabbit)
    >>
    >> to get more efficient matching.

    >
    > Besides, when we're talking about performance on this scale, the order
    > in which the terms are evaluated by the regex engine might matter:
    > we'd want it to check against the more common case first to minimize
    > the time spent backtracking.


    Only problem is that this cannot be done by looking at the expression -
    the optimizer would need additional data about inputs. That would make
    the reordering more complicated and the result would probably be no
    longer "tidy".

    I came to think that Roedy's main concern was human readability and not
    performance of the expression. However, changing a human crafted
    expression is probably not wise - for performance reasons (see below)
    but also for readability: it may actually be crucial for readability
    that the expression stays as written.

    > Now, regex engines don't _have_ to use the ordering of the term as a
    > suggestion for the order of evaluation, but as far as I'm aware,
    > that's how it's usually done.


    That does make sense: since NFSs are known to execute in order (as
    opposed to DFAs) it is wise to use the human given order because then we
    have a chance to do that optimization.

    > So if "rabbit" is much more prevalent in your expected input than either
    > "cat", "cow" or "dog", alphabetizing the terms would yield suboptimal
    > performance.


    Right, but see above.

    > (That said, if you find that such low-level details about the regex'
    > efficency is important enough that it is significant for your program,
    > you should probably rethink your whole approach.)


    Yes, for word search there are probably better approaches. I personally
    haven't come across a case where this order would have mattered. In
    many cases cost of IO will be dominant anyway.

    Cheers

    robert
     
    Robert Klemme, May 25, 2014
    #9
  10. On 5/25/2014 11:21 AM, Roedy Green wrote:
    > On Sat, 24 May 2014 14:55:03 +0200, Robert Klemme
    > <> wrote, quoted or indirectly quoted
    > someone who said :
    >
    >>> 7. use negative char lists when it would shorten the list.

    >>
    >> I don't believe this has any impact as this is something the regex
    >> engine will do internally.

    >
    > I was thinking of a transform something like one that searched for
    > every character but " by specifying every ascii char but one. It could
    > be simplified to use a negative. Perhaps the author had never heard
    > of negative searches.


    That wouldn't be correct. The original regex would not match, e.g., Õ,
    while your converted one would.

    [Although this does raise the question of how regular expressions handle
    or fail to handle characters like 💩 or the penguin in my display name.
    Stupid UTF-16 and the "it's fixed-width characters if you look at it
    funny" results.]

    --
    Beware of bugs in the above code; I have only proved it correct, not
    tried it. -- Donald E. Knuth
     
    Joshua Cranmer ðŸ§, May 25, 2014
    #10
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. d davis
    Replies:
    0
    Views:
    634
    d davis
    Apr 27, 2004
  2. Christoph Schneegans

    HTML Tidy in ASP.NET

    Christoph Schneegans, Nov 2, 2003, in forum: ASP .Net
    Replies:
    2
    Views:
    7,620
    mthakershi
    Apr 28, 2009
  3. Eric
    Replies:
    0
    Views:
    777
  4. Chris Harris

    Tidy configuration

    Chris Harris, Jun 24, 2003, in forum: HTML
    Replies:
    3
    Views:
    6,721
    Headless
    Jul 2, 2003
  5. Nick Howes
    Replies:
    6
    Views:
    811
    Toby A Inkster
    Nov 24, 2003
  6. Andrzej Adam Filip

    How to use tidy to "beautify" xhtml ?

    Andrzej Adam Filip, Jun 1, 2004, in forum: XML
    Replies:
    5
    Views:
    2,766
    Andrzej Adam Filip
    Jun 2, 2004
  7. Tobias Wendorff

    HTML Tidy

    Tobias Wendorff, Apr 27, 2005, in forum: C++
    Replies:
    26
    Views:
    2,843
    Phlip
    Apr 28, 2005
  8. Replies:
    3
    Views:
    1,155
    Reedick, Andrew
    Jul 1, 2008
Loading...