Regexp slowdown

C

chris.ritchie

I have a subroutine that does a rather extensive expression match (the
expression is ~80 characters long). It's looking for this expression
in strings limited to 500 characters.

The script runs just fine on many input strings, then slows
considerably without any discernable cause. My debug output looks
like:

Starting regexp match (string size 380)...
<~5 seconds>
Done.

Starting regexp match (string size 381)...
<~9 seconds>
Done.

Starting regexp match (string size 382)...
<~14 seconds>
Done.

I might think that the string size is too large for the expression
(though there's no reason it should be), but it matches previous
strings of size close to 500 with ease. Also, when I limit the size to
100, it will have the same slow down around 70 or 80.

I would wonder about memory management because the entire script is a
few thousand lines long, but there aren't many global variables. And
it does not slow down elsewhere. The code is almost all subroutines
with 'my' variables. I assume these variables are deallocated when the
subroutine is done, please correct me if I'm wrong.

What else could it be?
 
J

John W. Krahn

I have a subroutine that does a rather extensive expression match (the
expression is ~80 characters long). It's looking for this expression
in strings limited to 500 characters.

The script runs just fine on many input strings, then slows
considerably without any discernable cause. My debug output looks
like:

Starting regexp match (string size 380)...
<~5 seconds>
Done.

Starting regexp match (string size 381)...
<~9 seconds>
Done.

Starting regexp match (string size 382)...
<~14 seconds>
Done.

I might think that the string size is too large for the expression
(though there's no reason it should be), but it matches previous
strings of size close to 500 with ease. Also, when I limit the size to
100, it will have the same slow down around 70 or 80.

Without seeing the actual regular expression it is pretty hard to diagnose the
problem but Perl's regular expressions use backtracking which can cause loops
that take a very long time to run. See the "Backtracking" section of perlre:

perldoc perlre

particularly the WARNING at the end of the section.

I would wonder about memory management because the entire script is a
few thousand lines long, but there aren't many global variables. And
it does not slow down elsewhere. The code is almost all subroutines
with 'my' variables. I assume these variables are deallocated when the
subroutine is done, please correct me if I'm wrong.

What else could it be?

This page may provide some hints: http://www.ccl4.org/~nick/P/Fast_Enough/

Also the book "Mastering Regular Expressions"
http://www.oreilly.com/catalog/regex3/ explains in detail how to optimize
regular expressions.



John
 
C

chris.ritchie

Thanks, I'll check out those docs. I may have narrowed it down, though
I don't know how to solve it.

Calling `ps -ly` tells me that the script never exceeds 6 megs of
memory. But the cpu/cpu% increases in step with the slowdown (from
below 10 to the mid-20s). Scheduling priority is at or around 99 for
most of the runtime.

Is this that the expression match becomes more intensive or the OS
stops caring about my process? I wonder if it's the latter because
like I said, when I limit the string to 100 characters the same
slowdown occurs.

The only way I would think it's the complexity of the expression is if
there's some suboptimal backtracking going on. I'll look into this...
 
C

chris.ritchie

[ Please do not top-post ]
As far as I know, this only applies when addressing parts of someone's
post. Aside from the 'thanks I'll check it out', my last post was an
addendum to the first post. Another piece of etiquette is reading all
relevant material before making a comment/criticism :-\
Well, neither do we, as you refuse to show this regex. Is it top secret
or something.
Unfortunately I cannot. The interpretation of regexes is pretty
straightforward. I have no problem analyzing the expression for
correctness and efficiency.

Though I was alarmed by John's mention of backtracking- I was afraid he
meant backtracking across previous expression matches or some
memory-intensive attempt at interpreter optimization by the perl folks.
After reading his suggested information, I found this is not the case.

What I am unfamiliar with is Perl memory management and Unix scheduling
- in practice. That was the direction of my second post.
Please read the posting guidelines for this group to learn how you can
help yourself and help others help you.
John seemed to understand pretty well. Sadly, you made no attempt to
answer any of my questions. You wrongly complained about my etiquette
and suggested I can't write regular expressions. Now who's wasting
peoples' time?
It is easy to write a stupid regex that will just waste time doing
unnecessary things.
If this were the case, it wouldn't have quickly matched to the hundreds
of string inputs I fed it before it started slowing down. Granted I
could have hit a special case that fools the regexp, this is unlikely
considering the number of inputs it works perfectly on.
 
U

Uri Guttman

cr" == chris ritchie said:
[ Please do not top-post ]

cr> As far as I know, this only applies when addressing parts of someone's
cr> post. Aside from the 'thanks I'll check it out', my last post was an
cr> addendum to the first post. Another piece of etiquette is reading all
cr> relevant material before making a comment/criticism :-\

no, it is part of this group's (and the history of usenet's)
etiquette. and the groups rules override any of yours when you come
here. top posting in any form is considered poor netiquette.

cr> Unfortunately I cannot. The interpretation of regexes is pretty
cr> straightforward. I have no problem analyzing the expression for
cr> correctness and efficiency.

then why are you asking for help on your regex? you probably should
write a book on them given your skills. oops, MRE (3rd ed!) is already
written.

cr> Though I was alarmed by John's mention of backtracking- I was afraid he
cr> meant backtracking across previous expression matches or some
cr> memory-intensive attempt at interpreter optimization by the perl folks.
cr> After reading his suggested information, I found this is not the case.

and your proof of that is?

cr> What I am unfamiliar with is Perl memory management and Unix scheduling
cr> - in practice. That was the direction of my second post.

why would unix scheduling have anything to do with how fast a regex
runs? if you think they are related you definitely need more help than
you can get here.

cr> John seemed to understand pretty well. Sadly, you made no attempt to
cr> answer any of my questions. You wrongly complained about my etiquette
cr> and suggested I can't write regular expressions. Now who's wasting
cr> peoples' time?

given your original query with no regex posted, that is a reasonable
assumption. you are to blame for not posting real code and then being
asked for it. but i will predict you will turn on me too. such is the
way of those who wish answers but not to learn.

cr> If this were the case, it wouldn't have quickly matched to the hundreds
cr> of string inputs I fed it before it started slowing down. Granted I
cr> could have hit a special case that fools the regexp, this is unlikely
cr> considering the number of inputs it works perfectly on.

unlikely but possible. and of course without seeing the reges nor the
inputs, no one but you can tell why it slows down. i suspect it is on
line 42 of the regex. remove the + (or is it a []?)

have a fine day,

uri
 
C

chris.ritchie

no, it is part of this group's (and the history of usenet's)
etiquette. and the groups rules override any of yours when you come
here. top posting in any form is considered poor netiquette.
Where was I supposed to put my comments? There's not TOFU because I
wasn't replying to anything in his post. Granted, I could have deleted
the cc of the previous posts, but that's not something to get
up-in-arms about. There was no non-linear scanning required.

"Unappreciated followup styles are referred to as ``top-posting'',
``Jeopardy'' (because the answer comes before the question)." That's
the group rule. My post followed that rule as I had no response to
intersperse. I'm sorry I did not delete the cc. Happy?
then why are you asking for help on your regex?
My last paragraph:
I would wonder about memory management because the entire script is a
few thousand lines long, but there aren't many global variables. And
it does not slow down elsewhere. The code is almost all subroutines
with 'my' variables. I assume these variables are deallocated when the
subroutine is done, please correct me if I'm wrong.
What else could it be?

What of that makes you think I'm worried about the syntax of my regex?

you probably should
write a book on them given your skills. oops, MRE (3rd ed!) is already
written. Okay.

cr> Though I was alarmed by John's mention of backtracking- I was afraid he
cr> meant backtracking across previous expression matches or some
cr> memory-intensive attempt at interpreter optimization by the perl folks.
cr> After reading his suggested information, I found this is not the case.
and your proof of that is?
The 'backtracking' section of the perldoc on regex.
why would unix scheduling have anything to do with how fast a regex
runs? if you think they are related you definitely need more help than
you can get here.
Oh lets see, if my process's quantum is small and priority is low then
it won't get completed very fast, now will it? If you don't understand
the relationship between a process and the os running it you probably
shouldn't be posting on this thread.
given your original query with no regex posted, that is a reasonable
assumption. you are to blame for not posting real code and then being
asked for it.
I DON'T NEED HELP WITH THE SYNTAX OF MY REGEX. I assume shouting is
warranted when you've repeated yourself so many times. I appreciate
John mentioning a possible pitfall based on the particulars of Perl's
matching implementation. But everything I've directly asked has dealt
with the interpreter and/or how it interfaces with UNIX.
but i will predict you will turn on me too. such is the
way of those who wish answers but not to learn.
I'm glad you can be so sage and philosophical but you haven't given me
anything to learn. You've just complained that I haven't supplied
information that you don't need. I appreciate the fact that you are
willing to run my code and input rather than give nebulous answers
about the way Perl works. But it's the nebulous answers that I need.
unlikely but possible. and of course without seeing the reges nor the
inputs, no one but you can tell why it slows down. i suspect it is on
line 42 of the regex. remove the + (or is it a []?)
Yes you can. Runtime isn't just a function of the expression. If I
had a mile-long stack frame I would see a performance hit. If Perl had
an ugly garbage collector and all of my previous subroutine variables
hadn't been deallocated I would see a performance hit. If Perl
parallelized regexes and forked many, many times I would see a
performance hit. If any generally intensive script required that the
coder set a lower 'nice' value I would see a performance hit.

This is what I asked for in the beginning, specified in all of my
subsequent posts, and keeps being ignored.
--Perl Consulting,
Oh! now I understand.
 
U

Uri Guttman

cr> Where was I supposed to put my comments? There's not TOFU because I
cr> wasn't replying to anything in his post. Granted, I could have deleted
cr> the cc of the previous posts, but that's not something to get
cr> up-in-arms about. There was no non-linear scanning required.

it doesn't matter. your choice was wrong for this group. you defend that
with the same weak logic that you use for the regex stuff. keep trying.
cr> My last paragraph:
cr> I would wonder about memory management because the entire script is a
cr> few thousand lines long, but there aren't many global variables. And
cr> it does not slow down elsewhere. The code is almost all subroutines
cr> with 'my' variables. I assume these variables are deallocated when the
cr> subroutine is done, please correct me if I'm wrong.
cr> What else could it be?

a faulty regex? a bug in line 42? shall i whip out the venerable
PSI::ESP module and divine the answer?

cr> What of that makes you think I'm worried about the syntax of my regex?

the syntax may be fine as it compiles (or so you claim). but if you
don't know how to spot the possible issues with a regex you won't
realize that your test data isn't comprehensive enough.

cr> Okay.

cr> Though I was alarmed by John's mention of backtracking- I was afraid he
cr> meant backtracking across previous expression matches or some
cr> memory-intensive attempt at interpreter optimization by the perl folks.
cr> After reading his suggested information, I found this is not the case.cr> The 'backtracking' section of the perldoc on regex.

but without showing the regex and the slow data, you haven't proven
anything yet. you seem to keep missing this point. show the code and
data. otherwise you are really wasting all of our time. but of course
you don't mind wasting your own.
cr> Oh lets see, if my process's quantum is small and priority is low then
cr> it won't get completed very fast, now will it? If you don't understand
cr> the relationship between a process and the os running it you probably
cr> shouldn't be posting on this thread.

wow. oh well, i will have to leave you now. this was the stupidest thing
you have said so far. regexes don't massively slow down just because of
some scheduling issues. they are compute bound in almost all cases
unless they have heat death regexes where they can suck the universe
dry. but as usual without any code we will just have to take your word
that all is fine.

cr> I DON'T NEED HELP WITH THE SYNTAX OF MY REGEX. I assume shouting is
cr> warranted when you've repeated yourself so many times. I appreciate
cr> John mentioning a possible pitfall based on the particulars of Perl's
cr> matching implementation. But everything I've directly asked has dealt
cr> with the interpreter and/or how it interfaces with UNIX.

well, you know best. track all that down. see if you can figure out why
a data dependent behavior would cause major operating system changes but
not be a case of just a poorly written regex.

and you keep confusing syntax and semantics. your regex syntax is
probably correct as you claim it compiles. (hey i said that
already. maybe you will learn it this time?) on the other hand, the
semantics are surely wrong since they will slow down based on the
data.
cr> I'm glad you can be so sage and philosophical but you haven't given me
cr> anything to learn. You've just complained that I haven't supplied
cr> information that you don't need. I appreciate the fact that you are
cr> willing to run my code and input rather than give nebulous answers
cr> about the way Perl works. But it's the nebulous answers that I need.

info i don't need? the code and the data? what are you babbling about??

nebulous answers are for nebulous questions and neither have a place in
this group. if you think the problem is with unix, then go to a unix
group. admit it, you know more perl than anyone here.

unlikely but possible. and of course without seeing the reges nor the
inputs, no one but you can tell why it slows down. i suspect it is on
line 42 of the regex. remove the + (or is it a []?)
cr> Yes you can. Runtime isn't just a function of the expression. If I
cr> had a mile-long stack frame I would see a performance hit. If Perl had
cr> an ugly garbage collector and all of my previous subroutine variables
cr> hadn't been deallocated I would see a performance hit. If Perl
cr> parallelized regexes and forked many, many times I would see a
cr> performance hit. If any generally intensive script required that the
cr> coder set a lower 'nice' value I would see a performance hit.

wow x 10.


cr> This is what I asked for in the beginning, specified in all of my
cr> subsequent posts, and keeps being ignored.
cr> Oh! now I understand.

no you don't. you won't. try banging your head against your
keyboard. you will be nebulous soon.

too bad i won't be consulting for you in the future. but you can't
handle the truth. :)

bye bye.

uri
 
A

axel

Where was I supposed to put my comments? There's not TOFU because I
wasn't replying to anything in his post. Granted, I could have deleted

You did. You mentioned documents.

Even when following up on your own post you should put it into context.
the cc of the previous posts, but that's not something to get
up-in-arms about. There was no non-linear scanning required.

Yes there was... your reply made no sense without doing non-linear
scanning.
My last paragraph:
I would wonder about memory management because the entire script is a
few thousand lines long, but there aren't many global variables. And
it does not slow down elsewhere. The code is almost all subroutines
with 'my' variables. I assume these variables are deallocated when the
subroutine is done, please correct me if I'm wrong.
What else could it be?

Ah... so we are supposed to debug a very long script by some psychic
method.
What of that makes you think I'm worried about the syntax of my regex?

Well, you mentioned it and that that is where your script slows down.
Oh lets see, if my process's quantum is small and priority is low then
it won't get completed very fast, now will it? If you don't understand
the relationship between a process and the os running it you probably
shouldn't be posting on this thread.

I am sure Uri understands these matters well.
I DON'T NEED HELP WITH THE SYNTAX OF MY REGEX. I assume shouting is
warranted when you've repeated yourself so many times. I appreciate
John mentioning a possible pitfall based on the particulars of Perl's
matching implementation. But everything I've directly asked has dealt
with the interpreter and/or how it interfaces with UNIX.

There is no special interface with Unix in this context just as with any
other operating system, perl does not ask Unix for special prorities.
I'm glad you can be so sage and philosophical but you haven't given me
anything to learn. You've just complained that I haven't supplied
information that you don't need. I appreciate the fact that you are

You have not supplied any information beyond that your script seems to
slow down.

Axel
 
D

Dr.Ruud

(e-mail address removed) schreef:
I DON'T NEED HELP WITH THE SYNTAX OF MY REGEX.

Maybe not, maybe you should be told to use something different than
regexen, and certainly to not use your suboptimal regex.
But because you don't show any code that shows the problem (as the
Posting Guidelines tell you should), we'll never know.

*ploink*
 
A

Ala Qumsieh

I have a subroutine that does a rather extensive expression match (the
expression is ~80 characters long). It's looking for this expression
in strings limited to 500 characters.
[snip]

What else could it be?

I suggest you strip down the code to the barest minimum, and see what is
the cause for the slowdown.

Another thought is whether you are using any of the special regexp
variables: $& $' $`

If you are, then that might explain the slowdown.

Of course, as other said, it is impossible to guess what is causing the
problem without seeing any code.

--Ala
 
C

chris.ritchie

For the people who've actually suggested things: thanks for the sincere
responses. I'll work this one out.

For the flamers: settle. I didn't come here to muck up your forum.
I'm sorry I can't post my code, but I can't take it out of context
enough to make it nonproprietary. But I've only asked for help with
ways the intepreter works and interfaces with the OS, which is what I
need. If I can't spot fundamental errors in my regexs, then I have
bigger problems.

And as the arguments have become repetative (as in, someone's not
reading what's already been said) there is no longer any point.
 
U

Uri Guttman

AQ> [snip]

AQ> I suggest you strip down the code to the barest minimum, and see what
AQ> is the cause for the slowdown.

AQ> Another thought is whether you are using any of the special regexp
AQ> variables: $& $' $`

those will slow down other regexes as it forces copies of the string to
be made in other regexes even if they don't grab anything. using those
is not likely to slow a regex down based on the input.

AQ> Of course, as other said, it is impossible to guess what is causing
AQ> the problem without seeing any code.

hmm, do i hear a greek chorus of that chant?

uri
 
A

anno4000

[...]
I'm sorry I can't post my code, but I can't take it out of context
enough to make it nonproprietary.

So you've been wasting everybody's time right from the beginning.
But I've only asked for help with
ways the intepreter works and interfaces with the OS, which is what I
need.

"My car slows down after a mile of driving. No, I can't show you my
car. I'm just asking how the gear box works and how it interfaces
with the motor, which is what I need."
If I can't spot fundamental errors in my regexs, then I have
bigger problems.

You do.

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top