regex tidy

Discussion in 'Java' started by Roedy Green, May 23, 2014.

  1. Roedy Green

    Roedy Green Guest

    What would a Java regex tidier do?

    Here are some ideas:

    1. remove nugatory \ quoting

    2. convert \s*\s* --> \s*

    3. alphabetise | lists.

    4. remove (?:... ) that is not doing anything.

    5. put [..] lists in canonical order.

    6. Convert runs of 4+ to a-d notation.

    7. use negative char lists when it would shorten the list.
    Roedy Green, May 23, 2014
    1. Advertisements

  2. Those two regexes aren't actually equivalent. The latter matches
    -00-00-00, for example. I assume you meant "\\d{2}(?:-\\d{2}){2}"?
    Joshua Cranmer ðŸ§, May 23, 2014
    1. Advertisements

  3. More generally convert

    R*R* into R*

    But who writes regexes like this?
    What does that mean? Are you talking about merging common prefixes to
    get a more efficient matching experience? That might actually be done
    by a regex engine if it detects this situation.
    Hmm... It may still serve the purpose of documentation for the human.
    I don't believe this will improve anything.
    Not sure what you mean here.
    I don't believe this has any impact as this is something the regex
    engine will do internally. Also, if the list is longer the way it is
    written then someone probably had a reason to do so. Who would
    voluntarily write longer char classes than necessary?

    Generally I tend to agree with what Leif wrote: it may actually be
    harder to read a regex that you did not author. I am sceptical of the
    effort. And then, you must consider that some changes may actually hurt
    performance if the regex was specifically crafted for a particular engine.

    Kind regards

    Robert Klemme, May 24, 2014
  4. Roedy Green

    Roedy Green Guest

    could be transformed to

    for two reasons:
    make the list easier to proofread
    might help engine optimise handling the c in cat and cow in common.
    Roedy Green, May 25, 2014
  5. Roedy Green

    Roedy Green Guest

    If lists are in canonical order, they are easier to proofread,
    especially if they are complicated.
    Roedy Green, May 25, 2014
  6. Roedy Green

    Roedy Green Guest

    Roedy Green, May 25, 2014
  7. Roedy Green

    Roedy Green Guest

    I was thinking of a transform something like one that searched for
    every character but " by specifying every ascii char but one. It could
    be simplified to use a negative. Perhaps the author had never heard
    of negative searches.

    It is pretty clear any tidier will need to be configurable.
    Roedy Green, May 25, 2014
  8. The engine does not need the reordering to do this optimization. If you
    want to help lesser engines which do not optimize here you would have to
    transform it into


    to get more efficient matching.


    Robert Klemme, May 25, 2014
  9. Only problem is that this cannot be done by looking at the expression -
    the optimizer would need additional data about inputs. That would make
    the reordering more complicated and the result would probably be no
    longer "tidy".

    I came to think that Roedy's main concern was human readability and not
    performance of the expression. However, changing a human crafted
    expression is probably not wise - for performance reasons (see below)
    but also for readability: it may actually be crucial for readability
    that the expression stays as written.
    That does make sense: since NFSs are known to execute in order (as
    opposed to DFAs) it is wise to use the human given order because then we
    have a chance to do that optimization.
    Right, but see above.
    Yes, for word search there are probably better approaches. I personally
    haven't come across a case where this order would have mattered. In
    many cases cost of IO will be dominant anyway.


    Robert Klemme, May 25, 2014
  10. That wouldn't be correct. The original regex would not match, e.g., Õ,
    while your converted one would.

    [Although this does raise the question of how regular expressions handle
    or fail to handle characters like 💩 or the penguin in my display name.
    Stupid UTF-16 and the "it's fixed-width characters if you look at it
    funny" results.]
    Joshua Cranmer ðŸ§, May 25, 2014
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.