Fastest way to calculate leading whitespace

D

dasacc22

Hi

This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie ' ').

Here are some different methods I have tried so far
--- solution 1

a = ' some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))

--- solution 2

a = ' some content\n'
b = a.strip()
c = a.partition(b[0])[0]

--- solution 3

def get_leading_whitespace(s):
def _get():
for x in s:
if x != ' ':
break
yield x
return ''.join(_get())

---

Solution 1 seems to be about as fast as solution 2 except in certain
circumstances where the value of b has already been determined for
other purposes. Solution 3 is slower due to the function overhead.

Curious to see what other types of solutions people might have.

Thanks,
Daniel
 
P

Patrick Maupin

Hi

This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie '    ').

Here are some different methods I have tried so far
--- solution 1

a = '    some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))

--- solution 2

a = '    some content\n'
b = a.strip()
c = a.partition(b[0])[0]

--- solution 3

def get_leading_whitespace(s):
    def _get():
        for x in s:
            if x != ' ':
                break
            yield x
    return ''.join(_get())

---

Solution 1 seems to be about as fast as solution 2 except in certain
circumstances where the value of b has already been determined for
other purposes. Solution 3 is slower due to the function overhead.

Curious to see what other types of solutions people might have.

Thanks,
Daniel

Well, you could try a solution using re, but that's probably only
likely to be faster if you can use it on multiple concatenated lines.
I usually use something like your solution #1. One thing to be aware
of, though, is that strip() with no parameters will strip *any*
whitespace, not just spaces, so the implicit assumption in your code
that what you have stripped is spaces may not be justified (depending
on the source data). OTOH, depending on how you use that whitespace
information, it may not really matter. But if it does matter, you can
use strip(' ')

If speed is really an issue for you, you could also investigate
mxtexttools, but, like re, it might perform better if the source
consists of several batched lines.

Regards,
Pat
 
D

dasacc22

This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie '    ').
Here are some different methods I have tried so far
--- solution 1
a = '    some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))
--- solution 2
a = '    some content\n'
b = a.strip()
c = a.partition(b[0])[0]
--- solution 3
def get_leading_whitespace(s):
    def _get():
        for x in s:
            if x != ' ':
                break
            yield x
    return ''.join(_get())

Solution 1 seems to be about as fast as solution 2 except in certain
circumstances where the value of b has already been determined for
other purposes. Solution 3 is slower due to the function overhead.
Curious to see what other types of solutions people might have.
Thanks,
Daniel

Well, you could try a solution using re, but that's probably only
likely to be faster if you can use it on multiple concatenated lines.
I usually use something like your solution #1.  One thing to be aware
of, though, is that strip() with no parameters will strip *any*
whitespace, not just spaces, so the implicit assumption in your code
that what you have stripped is spaces may not be justified (depending
on the source data).  OTOH, depending on how you use that whitespace
information, it may not really matter.  But if it does matter, you can
use strip(' ')

If speed is really an issue for you, you could also investigate
mxtexttools, but, like re, it might perform better if the source
consists of several batched lines.

Regards,
Pat

Hi,

thanks for the info. Using .strip() to remove all whitespace in
solution 1 is a must. If you only stripped ' ' spaces then line
endings would get counted in the len() call and when multiplied
against ' ', would produce an inaccurate result. Regex is
significantly slower for my purposes but ive never heard of
mxtexttools. Even if it proves slow its spurred my curiousity as to
what functionality it provides (on an unrelated note)
 
S

Steven D'Aprano

Hi

This is a simple question. I'm looking for the fastest way to calculate
the leading whitespace (as a string, ie ' ').

Is calculating the amount of leading whitespace really the bottleneck in
your application? If not, then trying to shave off microseconds from
something which is a trivial part of your app is almost certainly a waste
of your time.


[...]
a = ' some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))


I take it that you haven't actually tested this code for correctness,
because it's buggy. Let's test it:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError


Not only doesn't it get the whitespace right, but it doesn't even get the
*amount* of whitespace right:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AssertionError

It doesn't even work correctly if you limit "whitespace" to mean spaces
and nothing else! It's simply wrong in every possible way.

This is why people say that premature optimization is the root of all
(programming) evil. Instead of wasting time and energy trying to optimise
code, you should make it correct first.

Your solutions 2 and 3 are also buggy. And solution 3 can be easily re-
written to be more straightforward. Instead of the complicated:
def get_leading_whitespace(s):
def _get():
for x in s:
if x != ' ':
break
yield x
return ''.join(_get())

try this version:

def get_leading_whitespace(s):
accumulator = []
for c in s:
if c in ' \t\v\f\r\n':
accumulator.append(c)
else:
break
return ''.join(accumulator)

Once you're sure this is correct, then you can optimise it:

def get_leading_whitespace(s):
t = s.lstrip()
return s[:len(s)-len(t)]

Unless your strings are very large, this is likely to be faster than any
other pure-Python solution you can come up with.
 
W

Wolfram Hinderer

def get_leading_whitespace(s):
    t = s.lstrip()
    return s[:len(s)-len(t)]

Unless your strings are very large, this is likely to be faster than any
other pure-Python solution you can come up with.

Returning s[:-1 - len(t)] is faster.
 
S

Steven D'Aprano

def get_leading_whitespace(s):
    t = s.lstrip()
    return s[:len(s)-len(t)]
c = get_leading_whitespace(a)
assert c == leading_whitespace

Unless your strings are very large, this is likely to be faster than
any other pure-Python solution you can come up with.

Returning s[:-1 - len(t)] is faster.


I'm sure it is. Unfortunately, it's also incorrect.

z = "*****abcde"
z[:-1-5] '****'
z[:len(z)-5]
'*****'


However, s[:-len(t)] should be both faster and correct.
 
M

Mark Dickinson

On 8 Mai, 20:46, Steven D'Aprano <st...@REMOVE-THIS- cybersource.com.au>
wrote:
def get_leading_whitespace(s):
    t = s.lstrip()
    return s[:len(s)-len(t)]
c = get_leading_whitespace(a)
assert c == leading_whitespace
Unless your strings are very large, this is likely to be faster than
any other pure-Python solution you can come up with.
Returning s[:-1 - len(t)] is faster.

I'm sure it is. Unfortunately, it's also incorrect.
z = "*****abcde"
z[:-1-5] '****'
z[:len(z)-5]

'*****'

However, s[:-len(t)] should be both faster and correct.

Unless len(t) == 0, surely?
 
D

dasacc22

U presume entirely to much. I have a preprocessor that normalizes
documents while performing other more complex operations. Theres
nothing buggy about what im doing

This is a simple question. I'm looking for the fastest way to calculate
the leading whitespace (as a string, ie '    ').

Is calculating the amount of leading whitespace really the bottleneck in
your application? If not, then trying to shave off microseconds from
something which is a trivial part of your app is almost certainly a waste
of your time.

[...]
a = '    some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))

I take it that you haven't actually tested this code for correctness,
because it's buggy. Let's test it:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

Not only doesn't it get the whitespace right, but it doesn't even get the
*amount* of whitespace right:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

It doesn't even work correctly if you limit "whitespace" to mean spaces
and nothing else! It's simply wrong in every possible way.

This is why people say that premature optimization is the root of all
(programming) evil. Instead of wasting time and energy trying to optimise
code, you should make it correct first.

Your solutions 2 and 3 are also buggy. And solution 3 can be easily re-
written to be more straightforward. Instead of the complicated:
def get_leading_whitespace(s):
    def _get():
        for x in s:
            if x != ' ':
                break
            yield x
    return ''.join(_get())

try this version:

def get_leading_whitespace(s):
    accumulator = []
    for c in s:
        if c in ' \t\v\f\r\n':
            accumulator.append(c)
        else:
            break
    return ''.join(accumulator)

Once you're sure this is correct, then you can optimise it:

def get_leading_whitespace(s):
    t = s.lstrip()
    return s[:len(s)-len(t)]

Unless your strings are very large, this is likely to be faster than any
other pure-Python solution you can come up with.
 
P

Patrick Maupin

Hi
This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie '    ').
Here are some different methods I have tried so far
--- solution 1
a = '    some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))
--- solution 2
a = '    some content\n'
b = a.strip()
c = a.partition(b[0])[0]
--- solution 3
def get_leading_whitespace(s):
    def _get():
        for x in s:
            if x != ' ':
                break
            yield x
    return ''.join(_get())
---
Solution 1 seems to be about as fast as solution 2 except in certain
circumstances where the value of b has already been determined for
other purposes. Solution 3 is slower due to the function overhead.
Curious to see what other types of solutions people might have.
Thanks,
Daniel
Well, you could try a solution using re, but that's probably only
likely to be faster if you can use it on multiple concatenated lines.
I usually use something like your solution #1.  One thing to be aware
of, though, is that strip() with no parameters will strip *any*
whitespace, not just spaces, so the implicit assumption in your code
that what you have stripped is spaces may not be justified (depending
on the source data).  OTOH, depending on how you use that whitespace
information, it may not really matter.  But if it does matter, you can
use strip(' ')
If speed is really an issue for you, you could also investigate
mxtexttools, but, like re, it might perform better if the source
consists of several batched lines.
Regards,
Pat

Hi,

thanks for the info. Using .strip() to remove all whitespace in
solution 1 is a must. If you only stripped ' ' spaces then line
endings would get counted in the len() call and when multiplied
against ' ', would produce an inaccurate result. Regex is
significantly slower for my purposes but ive never heard of
mxtexttools. Even if it proves slow its spurred my curiousity as to
what functionality it provides (on an unrelated note)

Could you reorganize your code to do multiple lines at a time? That
might make regex competitive.

Regards,
Pat
 
D

dasacc22

Hi
This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie '    ').
Here are some different methods I have tried so far
--- solution 1
a = '    some content\n'
b = a.strip()
c = ' '*(len(a)-len(b))
--- solution 2
a = '    some content\n'
b = a.strip()
c = a.partition(b[0])[0]
--- solution 3
def get_leading_whitespace(s):
    def _get():
        for x in s:
            if x != ' ':
                break
            yield x
    return ''.join(_get())
---
Solution 1 seems to be about as fast as solution 2 except in certain
circumstances where the value of b has already been determined for
other purposes. Solution 3 is slower due to the function overhead.
Curious to see what other types of solutions people might have.
Thanks,
Daniel
Well, you could try a solution using re, but that's probably only
likely to be faster if you can use it on multiple concatenated lines.
I usually use something like your solution #1.  One thing to be aware
of, though, is that strip() with no parameters will strip *any*
whitespace, not just spaces, so the implicit assumption in your code
that what you have stripped is spaces may not be justified (depending
on the source data).  OTOH, depending on how you use that whitespace
information, it may not really matter.  But if it does matter, you can
use strip(' ')
If speed is really an issue for you, you could also investigate
mxtexttools, but, like re, it might perform better if the source
consists of several batched lines.
Regards,
Pat

thanks for the info. Using .strip() to remove all whitespace in
solution 1 is a must. If you only stripped ' ' spaces then line
endings would get counted in the len() call and when multiplied
against ' ', would produce an inaccurate result. Regex is
significantly slower for my purposes but ive never heard of
mxtexttools. Even if it proves slow its spurred my curiousity as to
what functionality it provides (on an unrelated note)

Could you reorganize your code to do multiple lines at a time?  That
might make regex competitive.

Regards,
Pat

I have tried this already, the problem here is that it's not a trivial
matter. Iterating over each line is unavoidable, and I found that
using various python builtins to perform string operations (like say
the wonderful partition builtin) during each iteration works 3 fold
faster then regexing the entire document with various needs. Another
issue is having to keep a line count and when iterating over regex
matches and counting lines, it doesn't scale nearly as well as a
straight python solution using builtins to process the information.

At the heart of this here, determining the leading white-space is a
trivial matter. I have much more complex problems to deal with. I was
much more interested in seeing what kind of solutions ppl would come
up with to such a problem, and perhaps uncover something new in python
that I can apply to a more complex problem. What spurred the thought
was this piece written up by guido concerning "what's the best way to
convert a list of integers into a string". It's a simple question
where concepts are introduced that can lead to solving more complex
problems.

http://www.python.org/doc/essays/list2str.html
 
D

dasacc22

On 8 Mai, 20:46, Steven D'Aprano <st...@REMOVE-THIS- cybersource.com.au>
wrote:
def get_leading_whitespace(s):
    t = s.lstrip()
    return s[:len(s)-len(t)]
c = get_leading_whitespace(a)
assert c == leading_whitespace
Unless your strings are very large, this is likely to be faster than
any other pure-Python solution you can come up with.
Returning s[:-1 - len(t)] is faster.

I'm sure it is. Unfortunately, it's also incorrect.
z = "*****abcde"
z[:-1-5] '****'
z[:len(z)-5]

'*****'

However, s[:-len(t)] should be both faster and correct.

This is without a doubt faster and simpler then any solution thus far.
Thank you for this
 
S

Steven D'Aprano

U presume entirely to much. I have a preprocessor that normalizes
documents while performing other more complex operations. Theres
nothing buggy about what im doing

I didn't *presume* anything, I took your example code and ran it and
discovered that it didn't do what you said it was doing.
 
J

John Machin

dasacc22 said:
U presume entirely to much. I have a preprocessor that normalizes
documents while performing other more complex operations. Theres
nothing buggy about what im doing

Are you sure?

Your "solution" calculates (the number of leading whitespace characters) + (the
number of TRAILING whitespace characters).

Problem 1: including TRAILING whitespace.
Example: "content" + 3 * " " + "\n" has 4 leading spaces according to your
reckoning; should be 0.
Fix: use lstrip() instead of strip()

Problem 2: assuming all whitespace characters have *effective* width the same as
" ".
Examples: TAB has width 4 or 8 or whatever you want it to be. There are quite a
number of whitespace characters, even when you stick to ASCII. When you look at
Unicode, there are heaps more. Here's a list of BMP characters such that
character.isspace() is True, showing the Unicode codepoint, the Python repr(),
and the name of the character (other than for control characters):

U+0009 u'\t' ?
U+000A u'\n' ?
U+000B u'\x0b' ?
U+000C u'\x0c' ?
U+000D u'\r' ?
U+001C u'\x1c' ?
U+001D u'\x1d' ?
U+001E u'\x1e' ?
U+001F u'\x1f' ?
U+0020 u' ' SPACE
U+0085 u'\x85' ?
U+00A0 u'\xa0' NO-BREAK SPACE
U+1680 u'\u1680' OGHAM SPACE MARK
U+2000 u'\u2000' EN QUAD
U+2001 u'\u2001' EM QUAD
U+2002 u'\u2002' EN SPACE
U+2003 u'\u2003' EM SPACE
U+2004 u'\u2004' THREE-PER-EM SPACE
U+2005 u'\u2005' FOUR-PER-EM SPACE
U+2006 u'\u2006' SIX-PER-EM SPACE
U+2007 u'\u2007' FIGURE SPACE
U+2008 u'\u2008' PUNCTUATION SPACE
U+2009 u'\u2009' THIN SPACE
U+200A u'\u200a' HAIR SPACE
U+200B u'\u200b' ZERO WIDTH SPACE
U+2028 u'\u2028' LINE SEPARATOR
U+2029 u'\u2029' PARAGRAPH SEPARATOR
U+202F u'\u202f' NARROW NO-BREAK SPACE
U+205F u'\u205f' MEDIUM MATHEMATICAL SPACE
U+3000 u'\u3000' IDEOGRAPHIC SPACE

Hmmm, looks like all kinds of widths, from zero upwards.
 
M

Mark Dickinson

However, s[:-len(t)] should be both faster and correct.
Unless len(t) == 0, surely?

Doh! The hazards of insufficient testing. Thanks for catching that.

I have a love-hate relationship with the negative index semantics for
exactly this reason: code like 'x[-n]' always seems smelly to me.
It's often not what the code author actually wanted, except when n is
guaranteed strictly positive for some reason. 'x[-1]' is fine, of
course.
 
D

dasacc22

dasacc22 <dasacc22 <at> gmail.com> writes:




Are you sure?

Your "solution" calculates (the number of leading whitespace characters) + (the
number of TRAILING whitespace characters).

Problem 1: including TRAILING whitespace.
Example: "content" + 3 * " " + "\n" has 4 leading spaces according to your
reckoning; should be 0.
Fix: use lstrip() instead of strip()

Problem 2: assuming all whitespace characters have *effective* width the same as
" ".
Examples: TAB has width 4 or 8 or whatever you want it to be. There are quite a
number of whitespace characters, even when you stick to ASCII. When you look at
Unicode, there are heaps more. Here's a list of BMP characters such that
character.isspace() is True, showing the Unicode codepoint, the Python repr(),
and the name of the character (other than for control characters):

U+0009 u'\t' ?
U+000A u'\n' ?
U+000B u'\x0b' ?
U+000C u'\x0c' ?
U+000D u'\r' ?
U+001C u'\x1c' ?
U+001D u'\x1d' ?
U+001E u'\x1e' ?
U+001F u'\x1f' ?
U+0020 u' ' SPACE
U+0085 u'\x85' ?
U+00A0 u'\xa0' NO-BREAK SPACE
U+1680 u'\u1680' OGHAM SPACE MARK
U+2000 u'\u2000' EN QUAD
U+2001 u'\u2001' EM QUAD
U+2002 u'\u2002' EN SPACE
U+2003 u'\u2003' EM SPACE
U+2004 u'\u2004' THREE-PER-EM SPACE
U+2005 u'\u2005' FOUR-PER-EM SPACE
U+2006 u'\u2006' SIX-PER-EM SPACE
U+2007 u'\u2007' FIGURE SPACE
U+2008 u'\u2008' PUNCTUATION SPACE
U+2009 u'\u2009' THIN SPACE
U+200A u'\u200a' HAIR SPACE
U+200B u'\u200b' ZERO WIDTH SPACE
U+2028 u'\u2028' LINE SEPARATOR
U+2029 u'\u2029' PARAGRAPH SEPARATOR
U+202F u'\u202f' NARROW NO-BREAK SPACE
U+205F u'\u205f' MEDIUM MATHEMATICAL SPACE
U+3000 u'\u3000' IDEOGRAPHIC SPACE

Hmmm, looks like all kinds of widths, from zero upwards.

I unfortunately mixed the solution with a string that would never make
it in the state i typed it in, the trailing whitespace

This is my fault
 
S

Stefan Behnel

dasacc22, 08.05.2010 19:19:
This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie ' ').

Here is an (untested) Cython 0.13 solution:

from cpython.unicode cimport Py_UNICODE_ISSPACE

def leading_whitespace(unicode ustring):
cdef Py_ssize_t i
cdef Py_UNICODE uchar

for i, uchar in enumerate(ustring):
if not Py_UNICODE_ISSPACE(uchar):
return ustring[:i]
return ustring

Cython compiles this to the obvious C code, so this should be impossible to
beat in plain Python code.

However, since Cython 0.13 hasn't been officially released yet (may take
another couple of weeks or so), you'll need to use the current developer
version from here:

http://hg.cython.org/cython-devel

Stefan
 
S

Stefan Behnel

Stefan Behnel, 10.05.2010 08:54:
dasacc22, 08.05.2010 19:19:
This is a simple question. I'm looking for the fastest way to
calculate the leading whitespace (as a string, ie ' ').

Here is an (untested) Cython 0.13 solution:

from cpython.unicode cimport Py_UNICODE_ISSPACE

def leading_whitespace(unicode ustring):
cdef Py_ssize_t i
cdef Py_UNICODE uchar

for i, uchar in enumerate(ustring):
if not Py_UNICODE_ISSPACE(uchar):
return ustring[:i]
return ustring

Cython compiles this to the obvious C code, so this should be impossible
to beat in plain Python code.

.... and it is. For a simple string like

u = u" abcdefg" + u"fsdf"*20

timeit gives me this for "s=u.lstrip(); u[:-len(s)]":

1000000 loops, best of 3: 0.404 usec per loop

and this for "leading_whitespace(u)":

10000000 loops, best of 3: 0.0901 usec per loop

It's closer for the extreme case of an all whitespace string like " "*60,
where I get this for the lstrip variant:

1000000 loops, best of 3: 0.277 usec per loop

and this for the Cython code:

10000000 loops, best of 3: 0.177 usec per loop

But I doubt that this is the main use case of the OP.

Stefan
 
D

dasacc22

Stefan Behnel, 10.05.2010 08:54:




dasacc22, 08.05.2010 19:19:
Here is an (untested) Cython 0.13 solution:
    from cpython.unicode cimport Py_UNICODE_ISSPACE
    def leading_whitespace(unicode ustring):
        cdef Py_ssize_t i
        cdef Py_UNICODE uchar
        for i, uchar in enumerate(ustring):
            if not Py_UNICODE_ISSPACE(uchar):
                return ustring[:i]
        return ustring
Cython compiles this to the obvious C code, so this should be impossible
to beat in plain Python code.

... and it is. For a simple string like

     u = u"   abcdefg" + u"fsdf"*20

timeit gives me this for "s=u.lstrip(); u[:-len(s)]":

1000000 loops, best of 3: 0.404 usec per loop

and this for "leading_whitespace(u)":

10000000 loops, best of 3: 0.0901 usec per loop

It's closer for the extreme case of an all whitespace string like " "*60,
where I get this for the lstrip variant:

1000000 loops, best of 3: 0.277 usec per loop

and this for the Cython code:

10000000 loops, best of 3: 0.177 usec per loop

But I doubt that this is the main use case of the OP.

Stefan

indeed, actually ive been going back and forth on the idea to use
cython for some more intensive portions. That bit of code looks really
simple so I think I'll give cython a shot. Only deal is I need to be
able to use w/e the latest cython is available via easy_install, but
this should prove an interesting experience.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,444
Messages
2,571,709
Members
48,796
Latest member
Greg L.
Top