Splitting a string

Thomas Heller · Apr 2, 2010

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like this:

'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Thanks for ideas,
Thomas

Alex Willmer · Apr 2, 2010

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like this:

'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

This is very probably not the fastest execution wise, it was the
fastest development time wise:

import re

def maybe_int(x):
try:
return int(x)
except ValueError:
return x

def strings_n_ints(s):
return tuple(maybe_int(x) for x in re.findall('(\d+|\D+)', s))
('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Alex

Peter Otten · Apr 2, 2010

Thomas said:
Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like
this:

'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
parts[1::2] = map(int, parts[1::2])
parts

Click to expand...

Click to expand...

['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter

Patrick Maupin · Apr 2, 2010

Thomas said:
Thomas said:

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like
this:

Click to expand...

'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
parts[1::2] = map(int, parts[1::2])
parts

Click to expand...

Click to expand...

['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter

You beat me to it. re.split() seems underappreciated for some
reason. When I first started using it (even though it was faster for
the tasks I was using it for than other things) I was really annoyed
at the empty strings it was providing between matches. It is only
within the past couple of years that I have come to appreciate the
elegant solutions that those empty strings allow for. In short,
re.split() is by far the fastest and most elegant way to use the re
module for a large class of problems.

So, the only thing I have to add to this solution is that, for this
particular regular expression, if the source string starts with or
ends with digits, you will get empty strings at the beginning or end
of the resultant list, so if this is a problem, you will want to check
for and discard those.

Regards,
Pat

Bearophile · Apr 2, 2010

I don't know how fast this is (Python 2.x):
('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

It doesn't work with unicode strings.

Bye,
bearophile

Thomas Heller · Apr 2, 2010

Patrick said:
Thomas said:

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like
this:

Click to expand...

'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')
parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
parts[1::2] = map(int, parts[1::2])
parts

Click to expand...

['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter

Click to expand...

You beat me to it. re.split() seems underappreciated for some
reason. When I first started using it (even though it was faster for
the tasks I was using it for than other things) I was really annoyed
at the empty strings it was providing between matches. It is only
within the past couple of years that I have come to appreciate the
elegant solutions that those empty strings allow for. In short,
re.split() is by far the fastest and most elegant way to use the re
module for a large class of problems.

So, the only thing I have to add to this solution is that, for this
particular regular expression, if the source string starts with or
ends with digits, you will get empty strings at the beginning or end
of the resultant list, so if this is a problem, you will want to check
for and discard those.

Thanks to all for these code snippets. Peter's solution is the winner -
most elegant and also the fastest. With an additional list comprehension
to remove the possible empty strings at the start and at the end I get
16 us. Interesting is that Xavier's solution (which is similar to
some code that I wrote myself) isn't so much slower; it get timings of
around 22 us.

Peter Otten · Apr 2, 2010

Thomas said:
Thanks to all for these code snippets. Peter's solution is the winner -
most elegant and also the fastest. With an additional list comprehension
to remove the possible empty strings at the start and at the end I get
16 us. Interesting is that Xavier's solution (which is similar to
some code that I wrote myself) isn't so much slower; it get timings of
around 22 us.

Deleting the first or last item is probably faster than looping over the
whole list. If there aren't any empty strings the overhead is constant.

_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

Peter

Patrick Maupin · Apr 2, 2010

_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed, so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Alternatively, this is halfway between the previous example and the
list comprehension:

_split = re.compile(r"(\d+)").split
def split(s):
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
for index in (-1, 0):
if parts and parts[index] == "":
del parts[index]
return tuple(parts)

BTW, I just remembered that, although I have often used the fact that
split returns alternating non-match/match/.../match/non-match in the
past, the last time I did this particular task (of splitting out
digits from a string), I didn't make use of that fact. But I wasn't
expecting very many splits for this case. FWIW, here's a class I
wrote that does this to a string for the express purpose of making
sorts work better:

http://code.google.com/p/pyeda/source/browse/trunk/kipy/kipy/utility/istring.py

Regards,
Pat

Peter Otten · Apr 3, 2010

Patrick said:
_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)
parts[1::2] = map(int, parts[1::2])

Click to expand...

# because s is non-empty parts contains at least one
# item != "", and parts[x] below cannot fail with an
# IndexError

if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

Click to expand...

That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed,

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt

so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Generally speaking performing tests of which you know they can't fail can
confuse the reader just as much as tests with unobvious interdependencies.

Alternatively, this is halfway between the previous example and the
list comprehension:

_split = re.compile(r"(\d+)").split
def split(s):
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
for index in (-1, 0):
if parts and parts[index] == "":
del parts[index]
return tuple(parts)

I don't think that this is similar to the list comprehension approach
because it only tests the first and the last item instead of the whole list.
Both variants should therefore perform equally well for all but the empty
string argument. If that is a theoretical case you are free to choose the
more readable variant.

Peter

Patrick Maupin · Apr 3, 2010

Patrick said:
Patrick said:

_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)
parts[1::2] = map(int, parts[1::2])

Click to expand...

Click to expand...

# because s is non-empty parts contains at least one
# item != "", and parts[x] below cannot fail with an
# IndexError

if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

Click to expand...

Click to expand...

That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed,

Click to expand...

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt

so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Click to expand...

Generally speaking performing tests of which you know they can't fail can
confuse the reader just as much as tests with unobvious interdependencies..

Yes, I see your point. The list comprehension will only treat the
ends differently, and will always pass the middle, so someone could be
confused about why the comprehension is there in the first place. I
guess I'm used to doing this same thing on lists that *could* have
empty strings in the middle (with more complicated regular expressions
with multiple match cases), so I didn't consider that.

Alternatively, this is halfway between the previous example and the
list comprehension:

Click to expand...

_split = re.compile(r"(\d+)").split
def split(s):
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
for index in (-1, 0):
if parts and parts[index] == "":
del parts[index]
return tuple(parts)

Click to expand...

I don't think that this is similar to the list comprehension approach
because it only tests the first and the last item instead of the whole list.
Both variants should therefore perform equally well for all but the empty
string argument. If that is a theoretical case you are free to choose the
more readable variant.

I agree that "halfway" was not a very precise way of describing the
differences. Like your solution, this only tests the outer elements.
Like the list comprehension, no short-circuit test before doing the
re.split is required. Also like the list comprehension, the act of
doing the same operation on multiple elements is refactored such that
operation is only coded once.

BUT...

All of this is just a user preference, and is extremely minor compared
to the observation that re.split() and extended string slicing can be
combined to give a very elegant solution to the problem!

Regards,
Pat

Steven D'Aprano · Apr 4, 2010

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt

Generally speaking performing tests of which you know they can't fail
can confuse the reader just as much as tests with unobvious
interdependencies.

I'm not sure I agree with you.

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail

Alf P. Steinbach · Apr 4, 2010

* Steven D'Aprano:

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail

It's the umbrella law.

Cheers,

- Alf

Patrick Maupin · Apr 4, 2010

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail

Well, yes, but that can get rather tedious at times:

a = 1
assert 0 < a < 2
b = a + 3
assert 2 < b - a < 4
c = b * 5
assert not c % 5

At least, I usually ameliorate the pain a little bit by not bothering
to print any debugging information at the assertions until one of them
actually fails; otherwise I'd *never* get any real coding done

Steven D'Aprano · Apr 4, 2010

Well, yes, but that can get rather tedious at times:

a = 1
assert 0 < a < 2

Please tell me that's just an exaggerated example for illustration
purposes, and that you don't *actually* do that!

In any case, the *right* test would be:

a = 1
assert a == 1 and a*5==5 and str(a)=='1' and [None,a,None][a] is a

*wink*

At least, I usually ameliorate the pain a little bit by not bothering to
print any debugging information at the assertions until one of them
actually fails; otherwise I'd *never* get any real coding done

Ditto.

Peter Otten · Apr 4, 2010

Steven said:
I'm not sure I agree with you.

I'm going to help you make up your mind

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail

Note that I said /can/ not /do/ confuse. Consider the actual context,
removing a special value from the start/end of a list:

(1)
if parts[0] == "":
del parts[0]
if not parts: return parts
if parts[-1] == "":
del parts[-1]
return parts

(2)
return [item for item in parts if item != ""]

Now assume you have to familiarize yourself with the above code. Variant (1)
clearly expresses that the code is meant to touch only the first and last
item. (2) could be just removing all empty strings from a list.

Another way to look at it: it is much easier to refactor from (1) to (2)
than from (2) to (1).

As to assertions etc: you don't perform arbitrary tests like assert 42 ==
42, you make estimates about what can go wrong if you get unexpected input
or make an error in an algorithm that you cannot safely grasp in its
entirety. As an example you could add

assert "" not in parts[1:-1]

as a precondition to the above snippets. The superfluous tests in the list
comprehension are the opposite of that assertion: they could eat items == ""
in the list that either are intended to pass through or that indicate an
error in code above.

Personally, though, I prefer unit tests over assertions.

Peter

Patrick Maupin · Apr 4, 2010

Personally, though, I prefer unit tests over assertions.

IMO, the primary use cases for assertions and unit tests are not the
same.

When I have a well-defined, clearly understood specification that I am
coding to and fully implementing with the aim of producing a fully
general purpose solution that I are offering to others, unit tests are
appropriate to make sure that the code matches the spec, and even to
provide usage examples.

When I am using Python in what I would call a more personal way, for
example in an exploratory fashion to extract value from some large
mass of data, or even to do something like parsing on a subset of
another language, where I have no intention, desire, or time to
implement the entire spec, assertions are perfectly appropriate to
make sure that my code complains at a convenient point about input
data that it would not handle properly.

It is often the case that a more personal project will grow up and
become more general purpose and widely used. At this point, it is
appropriate to remove assertions, add unit tests, etc., but to
complain about not having unit tests in the first place is to avoid
acknowledging that there may be a higher cost associated with unit
tests than with assertions, especially with immature code that may be
severely refactored several times before settling on the best
solution.

Regards,
Pat

Patrick Maupin · Apr 4, 2010

In any case, the *right* test would be:

a = 1
assert a == 1 and a*5==5 and str(a)=='1' and [None,a,None][a] is a

You're right. I was very tired when I wrote that, and forgot the last
3 assertions...

Problem Splitting Text String	2	Dec 29, 2022
String splitting by spaces question	6	Nov 23, 2011
Hello guys ! How do I convert a string from an array into numbers ? Javascript	3	Dec 19, 2022
Converting an Array to a String in JavaScript	7	Sep 22, 2023
String splitting with exceptions	7	Aug 28, 2013
Splitting a file from specific column content	14	Jan 22, 2012
Sequence splitting	32	Jul 3, 2009
Splitting a line while keeping quoted items together	1	Nov 20, 2012

Splitting a string

Thomas Heller

Alex Willmer

Peter Otten

Patrick Maupin

Bearophile

Thomas Heller

Peter Otten

Patrick Maupin

Peter Otten

Patrick Maupin

Steven D'Aprano

Alf P. Steinbach

Patrick Maupin

Steven D'Aprano

Peter Otten

Patrick Maupin

Patrick Maupin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads