regex tidy

Roedy Green · May 23, 2014

What would a Java regex tidier do?

Here are some ideas:

1. remove nugatory \ quoting

2. convert \s*\s* --> \s*

3. alphabetise | lists.

4. remove (?:... ) that is not doing anything.

5. put [..] lists in canonical order.

6. Convert runs of 4+ to a-d notation.

7. use negative char lists when it would shorten the list.

Joshua Cranmer ðŸ§ · May 23, 2014

Evil.

Regular expressions should be written to as closely as
possible mirror a mental or conceptual model of the
inputs it will match, and not to be as short or compact
as possible.

For instance, to match dates on the pattern "YY-MM-dd"
your regular expression should be "\\d{2}-\\d{2}-\\d{2}"
and not "((^|-)\\d{2}){3}"

Those two regexes aren't actually equivalent. The latter matches
-00-00-00, for example. I assume you meant "\\d{2}(?:-\\d{2}){2}"?

Robert Klemme · May 24, 2014

What would a Java regex tidier do?

Here are some ideas:

1. remove nugatory \ quoting

2. convert \s*\s* --> \s*

More generally convert

R*R* into R*

But who writes regexes like this?

3. alphabetise | lists.

What does that mean? Are you talking about merging common prefixes to
get a more efficient matching experience? That might actually be done
by a regex engine if it detects this situation.

4. remove (?:... ) that is not doing anything.

Hmm... It may still serve the purpose of documentation for the human.

5. put [..] lists in canonical order.

I don't believe this will improve anything.

6. Convert runs of 4+ to a-d notation.

Not sure what you mean here.

7. use negative char lists when it would shorten the list.

I don't believe this has any impact as this is something the regex
engine will do internally. Also, if the list is longer the way it is
written then someone probably had a reason to do so. Who would
voluntarily write longer char classes than necessary?

Generally I tend to agree with what Leif wrote: it may actually be
harder to read a regex that you did not author. I am sceptical of the
effort. And then, you must consider that some changes may actually hurt
performance if the regex was specifically crafted for a particular engine.

Kind regards

robert

Roedy Green · May 25, 2014

What does that mean? Are you talking about merging common prefixes to
get a more efficient matching experience? That might actually be done
by a regex engine if it detects this situation.

(cat|rabbit|dog|cow)
could be transformed to
(cat|cow|dog|rabbit)

for two reasons:
make the list easier to proofread
might help engine optimise handling the c in cat and cow in common.

Roedy Green · May 25, 2014

5. put [..] lists in canonical order.

Click to expand...

I don't believe this will improve anything.

If lists are in canonical order, they are easier to proofread,
especially if they are complicated.

Roedy Green · May 25, 2014

Not sure what you mean here.

see http://mindprod.com/project/regextidier.html

I would convert

[acdbde] to [a-e]

Roedy Green · May 25, 2014

I don't believe this has any impact as this is something the regex
engine will do internally.

I was thinking of a transform something like one that searched for
every character but " by specifying every ascii char but one. It could
be simplified to use a negative. Perhaps the author had never heard
of negative searches.

It is pretty clear any tidier will need to be configurable.

Robert Klemme · May 25, 2014

(cat|rabbit|dog|cow)
could be transformed to
(cat|cow|dog|rabbit)

for two reasons:
make the list easier to proofread
might help engine optimise handling the c in cat and cow in common.

The engine does not need the reordering to do this optimization. If you
want to help lesser engines which do not optimize here you would have to
transform it into

(c(?:at|ow)|dog|rabbit)

to get more efficient matching.

Cheers

robert

Robert Klemme · May 25, 2014

Besides, when we're talking about performance on this scale, the order
in which the terms are evaluated by the regex engine might matter:
we'd want it to check against the more common case first to minimize
the time spent backtracking.

Only problem is that this cannot be done by looking at the expression -
the optimizer would need additional data about inputs. That would make
the reordering more complicated and the result would probably be no
longer "tidy".

I came to think that Roedy's main concern was human readability and not
performance of the expression. However, changing a human crafted
expression is probably not wise - for performance reasons (see below)
but also for readability: it may actually be crucial for readability
that the expression stays as written.

Now, regex engines don't _have_ to use the ordering of the term as a
suggestion for the order of evaluation, but as far as I'm aware,
that's how it's usually done.

That does make sense: since NFSs are known to execute in order (as
opposed to DFAs) it is wise to use the human given order because then we
have a chance to do that optimization.

So if "rabbit" is much more prevalent in your expected input than either
"cat", "cow" or "dog", alphabetizing the terms would yield suboptimal
performance.

Right, but see above.

(That said, if you find that such low-level details about the regex'
efficency is important enough that it is significant for your program,
you should probably rethink your whole approach.)

Yes, for word search there are probably better approaches. I personally
haven't come across a case where this order would have mattered. In
many cases cost of IO will be dominant anyway.

Cheers

robert

Joshua Cranmer ðŸ§ · May 25, 2014

I was thinking of a transform something like one that searched for
every character but " by specifying every ascii char but one. It could
be simplified to use a negative. Perhaps the author had never heard
of negative searches.

That wouldn't be correct. The original regex would not match, e.g., Ã•,
while your converted one would.

[Although this does raise the question of how regular expressions handle
or fail to handle characters like ðŸ’© or the penguin in my display name.
Stupid UTF-16 and the "it's fixed-width characters if you look at it
funny" results.]

Tic Tac Toe Game	2	Mar 10, 2024
Regex to match a numerical IP range	7	Dec 11, 2010
Adding adressing of IPv6 to program	1	Feb 16, 2023
Something is wrong	1	Mar 7, 2022
Trying to build a SARIMAX model to forecast the S&P500 trend	0	Nov 5, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Tasks	1	Nov 29, 2022
URGENT	1	Jan 31, 2023

regex tidy

Roedy Green

Joshua Cranmer ðŸ§

Robert Klemme

Roedy Green

Roedy Green

Roedy Green

Roedy Green

Robert Klemme

Robert Klemme

Joshua Cranmer ðŸ§

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads