What is built-in method sub

J

Jeremy

I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

Thanks,
Jeremy
 
C

Carl Banks

I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

I'm guessing this is re.sub (or, more likely, a method sub of an
internal object that is called by re.sub).

If all your script does is to make a bunch of regexp substitutions,
then spending 99% of the time in this function might be reasonable.
Optimize your regexps to improve performance. (We can help you if you
care to share any.)

If my guess is wrong, you'll have to be more specific about what your
sctipt does, and maybe share the profile printout or something.


Carl Banks
 
M

Matthew Barnett

Jeremy said:
I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?

Thanks,
Jeremy
 
M

MRAB

Jeremy said:
I just profiled one of my Python scripts and discovered that >99% of
the time was spent in

{built-in method sub}

What is this function and is there a way to optimize it?
I think it's the subtraction operator. The only way to optimise it is to
reduce the number of subtractions that you do!
 
J

Jeremy

I'm guessing this is re.sub (or, more likely, a method sub of an
internal object that is called by re.sub).

If all your script does is to make a bunch of regexp substitutions,
then spending 99% of the time in this function might be reasonable.
Optimize your regexps to improve performance.  (We can help you if you
care to share any.)

If my guess is wrong, you'll have to be more specific about what your
sctipt does, and maybe share the profile printout or something.

Carl Banks

Your guess is correct. I had forgotten that I was using that
function.

I am using the re.sub command to remove trailing whitespace from lines
in a text file. The commands I use are copied below. If you have any
suggestions on how they could be improved, I would love to know.

Thanks,
Jeremy

lines = self._outfile.readlines()
self._outfile.close()

line = string.join(lines)

if self.removeWS:
# Remove trailing white space on each line
trailingPattern = '(\S*)\ +?\n'
line = re.sub(trailingPattern, '\\1\n', line)
 
D

Diez B. Roggisch

Jeremy said:
Your guess is correct. I had forgotten that I was using that
function.

I am using the re.sub command to remove trailing whitespace from lines
in a text file. The commands I use are copied below. If you have any
suggestions on how they could be improved, I would love to know.

Thanks,
Jeremy

lines = self._outfile.readlines()
self._outfile.close()

line = string.join(lines)

if self.removeWS:
# Remove trailing white space on each line
trailingPattern = '(\S*)\ +?\n'
line = re.sub(trailingPattern, '\\1\n', line)

line = line.rstrip()?

Diez
 
P

Philip Semanchuk

Yep. I was trying to reinvent the wheel. I just remove the trailing
whitespace before joining the lines.

I second the suggestion to use rstrip(), but for future reference you
should also check out the compile() function in the re module. You
might want to time the code above against a version using a compiled
regex to see how much difference it makes.

Cheers
Philip
 
D

Diez B. Roggisch

Philip said:
I second the suggestion to use rstrip(), but for future reference you
should also check out the compile() function in the re module. You might
want to time the code above against a version using a compiled regex to
see how much difference it makes.

For his usecase, none. There is a caching build-in into re that will
take care of this.

Diez
 
C

Chris Rebert

On Mon, Jan 11, 2010 at 12:34 PM, Steven D'Aprano
If you can avoid regexes in favour of ordinary string methods, do so. In
general, something like:

source.replace(target, new)

will potentially be much faster than:

regex = re.compile(target)
regex.sub(new, source)
# equivalent to re.sub(target, new, source)

(assuming of course that target is just a plain string with no regex
specialness). If you're just cracking a peanut, you probably don't need
the 30 lb sledgehammer of regular expressions.

Of course, but is the regex library really not smart enough to
special-case and optimize vanilla string substitutions?

Cheers,
Chris
 
S

Steven D'Aprano

On Mon, Jan 11, 2010 at 12:34 PM, Steven D'Aprano


Of course, but is the regex library really not smart enough to
special-case and optimize vanilla string substitutions?


Apparently not in Python 2.5:

Inquisition!")',
.... 'from re import compile; x = compile("Spanish")')
t2 = Timer('x.replace("Spanish", "Dutch")', .... 'x="Nobody expects the Spanish Inquisition!"')

t1.repeat() [3.7209370136260986, 2.7262279987335205, 2.6416280269622803]
t2.repeat()
[2.2915709018707275, 1.2584249973297119, 1.2730350494384766]


Even if it did, I wouldn't rely on that sort of special casing unless the
language guaranteed it. Keep in mind that regexes are essentially a
programming language (although not Turing Complete), and the engine
implementation may choose purity and simplicity over such optimizations.
 
J

John Machin

Yep.  I was trying to reinvent the wheel.  I just remove the trailing
whitespace before joining the lines.

Actually you don't do that. Your regex has three components:

(1) (\S*) zero or more occurrences of not-whitespace
(2) \ +? one or more (non-greedy) occurrences of SPACE
(3) \n a newline

Component (2) should be \s+?

In any case this is a round-about way of doing it. Try writing a regex
that does it simply: replace trailing whitespace by an empty string.

Another problem with your approach: it doesn't work if the line is not
terminated by \n -- this is quite possible if the lines are being read
from a file.

A wise person once said: Re-inventing the wheel is often accompanied
by forgetting to re-invent the axle.
 
P

Phlip

trailingPattern = '(\S*)\ +?\n'
What happens with this?

trailingPattern = '\s+$'
line = re.sub(trailingPattern, '', line)

I'm guessing that $ terminates \s+'s greediness without snarfing the underlying
\n. Then I'm guessing that the lack of a \1 replacer will help the sub work
faster with less internal string shuffling.

is probably faster still, but there might be a technical reason to avoid it.

But these uncertainties are why I write unit tests, including tests for the edge
cases. (What if it's a \r\n? What if the \n is missing? etc.) That way I don't
need to memorize re's exact behavior, and if I find a reason to swap in a
..rstrip(), I can pass all the tests and make sure the substitution works the same.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,774
Messages
2,569,596
Members
45,143
Latest member
SterlingLa
Top