regex tidy

R

Roedy Green

What would a Java regex tidier do?

Here are some ideas:

1. remove nugatory \ quoting

2. convert \s*\s* --> \s*

3. alphabetise | lists.

4. remove (?:... ) that is not doing anything.

5. put [..] lists in canonical order.

6. Convert runs of 4+ to a-d notation.

7. use negative char lists when it would shorten the list.
 
J

Joshua Cranmer ðŸ§

Evil.

Regular expressions should be written to as closely as
possible mirror a mental or conceptual model of the
inputs it will match, and not to be as short or compact
as possible.

For instance, to match dates on the pattern "YY-MM-dd"
your regular expression should be "\\d{2}-\\d{2}-\\d{2}"
and not "((^|-)\\d{2}){3}"

Those two regexes aren't actually equivalent. The latter matches
-00-00-00, for example. I assume you meant "\\d{2}(?:-\\d{2}){2}"?
 
R

Robert Klemme

What would a Java regex tidier do?

Here are some ideas:

1. remove nugatory \ quoting

2. convert \s*\s* --> \s*

More generally convert

R*R* into R*

But who writes regexes like this?
3. alphabetise | lists.

What does that mean? Are you talking about merging common prefixes to
get a more efficient matching experience? That might actually be done
by a regex engine if it detects this situation.
4. remove (?:... ) that is not doing anything.

Hmm... It may still serve the purpose of documentation for the human.
5. put [..] lists in canonical order.

I don't believe this will improve anything.
6. Convert runs of 4+ to a-d notation.

Not sure what you mean here.
7. use negative char lists when it would shorten the list.

I don't believe this has any impact as this is something the regex
engine will do internally. Also, if the list is longer the way it is
written then someone probably had a reason to do so. Who would
voluntarily write longer char classes than necessary?

Generally I tend to agree with what Leif wrote: it may actually be
harder to read a regex that you did not author. I am sceptical of the
effort. And then, you must consider that some changes may actually hurt
performance if the regex was specifically crafted for a particular engine.

Kind regards

robert
 
R

Roedy Green

What does that mean? Are you talking about merging common prefixes to
get a more efficient matching experience? That might actually be done
by a regex engine if it detects this situation.


(cat|rabbit|dog|cow)
could be transformed to
(cat|cow|dog|rabbit)

for two reasons:
make the list easier to proofread
might help engine optimise handling the c in cat and cow in common.
 
R

Roedy Green

I don't believe this has any impact as this is something the regex
engine will do internally.

I was thinking of a transform something like one that searched for
every character but " by specifying every ascii char but one. It could
be simplified to use a negative. Perhaps the author had never heard
of negative searches.

It is pretty clear any tidier will need to be configurable.
 
R

Robert Klemme

(cat|rabbit|dog|cow)
could be transformed to
(cat|cow|dog|rabbit)

for two reasons:
make the list easier to proofread
might help engine optimise handling the c in cat and cow in common.

The engine does not need the reordering to do this optimization. If you
want to help lesser engines which do not optimize here you would have to
transform it into

(c(?:at|ow)|dog|rabbit)

to get more efficient matching.

Cheers

robert
 
R

Robert Klemme

Besides, when we're talking about performance on this scale, the order
in which the terms are evaluated by the regex engine might matter:
we'd want it to check against the more common case first to minimize
the time spent backtracking.

Only problem is that this cannot be done by looking at the expression -
the optimizer would need additional data about inputs. That would make
the reordering more complicated and the result would probably be no
longer "tidy".

I came to think that Roedy's main concern was human readability and not
performance of the expression. However, changing a human crafted
expression is probably not wise - for performance reasons (see below)
but also for readability: it may actually be crucial for readability
that the expression stays as written.
Now, regex engines don't _have_ to use the ordering of the term as a
suggestion for the order of evaluation, but as far as I'm aware,
that's how it's usually done.

That does make sense: since NFSs are known to execute in order (as
opposed to DFAs) it is wise to use the human given order because then we
have a chance to do that optimization.
So if "rabbit" is much more prevalent in your expected input than either
"cat", "cow" or "dog", alphabetizing the terms would yield suboptimal
performance.

Right, but see above.
(That said, if you find that such low-level details about the regex'
efficency is important enough that it is significant for your program,
you should probably rethink your whole approach.)

Yes, for word search there are probably better approaches. I personally
haven't come across a case where this order would have mattered. In
many cases cost of IO will be dominant anyway.

Cheers

robert
 
J

Joshua Cranmer ðŸ§

I was thinking of a transform something like one that searched for
every character but " by specifying every ascii char but one. It could
be simplified to use a negative. Perhaps the author had never heard
of negative searches.

That wouldn't be correct. The original regex would not match, e.g., Õ,
while your converted one would.

[Although this does raise the question of how regular expressions handle
or fail to handle characters like 💩 or the penguin in my display name.
Stupid UTF-16 and the "it's fixed-width characters if you look at it
funny" results.]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top