Splitting a string

T

Thomas Heller

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like this:


'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Thanks for ideas,
Thomas
 
A

Alex Willmer

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like this:

'si_pos_99_rep_1_0.ita'  -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

This is very probably not the fastest execution wise, it was the
fastest development time wise:

import re

def maybe_int(x):
try:
return int(x)
except ValueError:
return x

def strings_n_ints(s):
return tuple(maybe_int(x) for x in re.findall('(\d+|\D+)', s))
('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Alex
 
P

Peter Otten

Thomas said:
Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like
this:


'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')
parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
parts[1::2] = map(int, parts[1::2])
parts
['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter
 
P

Patrick Maupin

Thomas said:
Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like
this:
'si_pos_99_rep_1_0.ita'  -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')
parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
parts[1::2] = map(int, parts[1::2])
parts

['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter

You beat me to it. re.split() seems underappreciated for some
reason. When I first started using it (even though it was faster for
the tasks I was using it for than other things) I was really annoyed
at the empty strings it was providing between matches. It is only
within the past couple of years that I have come to appreciate the
elegant solutions that those empty strings allow for. In short,
re.split() is by far the fastest and most elegant way to use the re
module for a large class of problems.

So, the only thing I have to add to this solution is that, for this
particular regular expression, if the source string starts with or
ends with digits, you will get empty strings at the beginning or end
of the resultant list, so if this is a problem, you will want to check
for and discard those.

Regards,
Pat
 
B

Bearophile

I don't know how fast this is (Python 2.x):
('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

It doesn't work with unicode strings.

Bye,
bearophile
 
T

Thomas Heller

Patrick said:
Thomas said:
Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like
this:
'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')
parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
parts[1::2] = map(int, parts[1::2])
parts

['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter

You beat me to it. re.split() seems underappreciated for some
reason. When I first started using it (even though it was faster for
the tasks I was using it for than other things) I was really annoyed
at the empty strings it was providing between matches. It is only
within the past couple of years that I have come to appreciate the
elegant solutions that those empty strings allow for. In short,
re.split() is by far the fastest and most elegant way to use the re
module for a large class of problems.

So, the only thing I have to add to this solution is that, for this
particular regular expression, if the source string starts with or
ends with digits, you will get empty strings at the beginning or end
of the resultant list, so if this is a problem, you will want to check
for and discard those.

Thanks to all for these code snippets. Peter's solution is the winner -
most elegant and also the fastest. With an additional list comprehension
to remove the possible empty strings at the start and at the end I get
16 us. Interesting is that Xavier's solution (which is similar to
some code that I wrote myself) isn't so much slower; it get timings of
around 22 us.
 
P

Peter Otten

Thomas said:
Thanks to all for these code snippets. Peter's solution is the winner -
most elegant and also the fastest. With an additional list comprehension
to remove the possible empty strings at the start and at the end I get
16 us. Interesting is that Xavier's solution (which is similar to
some code that I wrote myself) isn't so much slower; it get timings of
around 22 us.

Deleting the first or last item is probably faster than looping over the
whole list. If there aren't any empty strings the overhead is constant.

_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

Peter
 
P

Patrick Maupin

_split = re.compile(r"(\d+)").split
def split(s):
    if not s:
        return ()
    parts = _split(s)
    parts[1::2] = map(int, parts[1::2])
    if parts[-1] == "":
        del parts[-1]
    if parts[0] == "":
        del parts[0]
    return tuple(parts)

That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed, so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Alternatively, this is halfway between the previous example and the
list comprehension:

_split = re.compile(r"(\d+)").split
def split(s):
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
for index in (-1, 0):
if parts and parts[index] == "":
del parts[index]
return tuple(parts)

BTW, I just remembered that, although I have often used the fact that
split returns alternating non-match/match/.../match/non-match in the
past, the last time I did this particular task (of splitting out
digits from a string), I didn't make use of that fact. But I wasn't
expecting very many splits for this case. FWIW, here's a class I
wrote that does this to a string for the express purpose of making
sorts work better:

http://code.google.com/p/pyeda/source/browse/trunk/kipy/kipy/utility/istring.py

Regards,
Pat
 
P

Peter Otten

Patrick said:
_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)
parts[1::2] = map(int, parts[1::2])

# because s is non-empty parts contains at least one
# item != "", and parts[x] below cannot fail with an
# IndexError
if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed,

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt
so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Generally speaking performing tests of which you know they can't fail can
confuse the reader just as much as tests with unobvious interdependencies.
Alternatively, this is halfway between the previous example and the
list comprehension:

_split = re.compile(r"(\d+)").split
def split(s):
parts = _split(s)
parts[1::2] = map(int, parts[1::2])
for index in (-1, 0):
if parts and parts[index] == "":
del parts[index]
return tuple(parts)

I don't think that this is similar to the list comprehension approach
because it only tests the first and the last item instead of the whole list.
Both variants should therefore perform equally well for all but the empty
string argument. If that is a theoretical case you are free to choose the
more readable variant.

Peter
 
P

Patrick Maupin

Patrick said:
_split = re.compile(r"(\d+)").split
def split(s):
    if not s:
        return ()
    parts = _split(s)
    parts[1::2] = map(int, parts[1::2])

       # because s is non-empty parts contains at least one
       # item != "", and parts[x] below cannot fail with an
       # IndexError
    if parts[-1] == "":
        del parts[-1]
    if parts[0] == "":
        del parts[0]
    return tuple(parts)
That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed,

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt
so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Generally speaking performing tests of which you know they can't fail can
confuse the reader just as much as tests with unobvious interdependencies..

Yes, I see your point. The list comprehension will only treat the
ends differently, and will always pass the middle, so someone could be
confused about why the comprehension is there in the first place. I
guess I'm used to doing this same thing on lists that *could* have
empty strings in the middle (with more complicated regular expressions
with multiple match cases), so I didn't consider that.
Alternatively, this is halfway between the previous example and the
list comprehension:
_split = re.compile(r"(\d+)").split
def split(s):
    parts = _split(s)
    parts[1::2] = map(int, parts[1::2])
    for index in (-1, 0):
        if parts and parts[index] == "":
            del parts[index]
    return tuple(parts)

I don't think that this is similar to the list comprehension approach
because it only tests the first and the last item instead of the whole list.
Both variants should therefore perform equally well for all but the empty
string argument. If that is a theoretical case you are free to choose the
more readable variant.

I agree that "halfway" was not a very precise way of describing the
differences. Like your solution, this only tests the outer elements.
Like the list comprehension, no short-circuit test before doing the
re.split is required. Also like the list comprehension, the act of
doing the same operation on multiple elements is refactored such that
operation is only coded once.

BUT...

All of this is just a user preference, and is extremely minor compared
to the observation that re.split() and extended string slicing can be
combined to give a very elegant solution to the problem!

Regards,
Pat
 
S

Steven D'Aprano

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt


Generally speaking performing tests of which you know they can't fail
can confuse the reader just as much as tests with unobvious
interdependencies.


I'm not sure I agree with you.

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail :)
 
A

Alf P. Steinbach

* Steven D'Aprano:
Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail :)

:)

It's the umbrella law.


Cheers,

- Alf
 
P

Patrick Maupin

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail :)

Well, yes, but that can get rather tedious at times:

a = 1
assert 0 < a < 2
b = a + 3
assert 2 < b - a < 4
c = b * 5
assert not c % 5

At least, I usually ameliorate the pain a little bit by not bothering
to print any debugging information at the assertions until one of them
actually fails; otherwise I'd *never* get any real coding done :)
 
S

Steven D'Aprano

Well, yes, but that can get rather tedious at times:

a = 1
assert 0 < a < 2

Please tell me that's just an exaggerated example for illustration
purposes, and that you don't *actually* do that!

In any case, the *right* test would be:

a = 1
assert a == 1 and a*5==5 and str(a)=='1' and [None,a,None][a] is a

*wink*


At least, I usually ameliorate the pain a little bit by not bothering to
print any debugging information at the assertions until one of them
actually fails; otherwise I'd *never* get any real coding done :)


Ditto.
 
P

Peter Otten

Steven said:
I'm not sure I agree with you.

I'm going to help you make up your mind ;)
Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail :)

Note that I said /can/ not /do/ confuse. Consider the actual context,
removing a special value from the start/end of a list:

(1)
if parts[0] == "":
del parts[0]
if not parts: return parts
if parts[-1] == "":
del parts[-1]
return parts

(2)
return [item for item in parts if item != ""]

Now assume you have to familiarize yourself with the above code. Variant (1)
clearly expresses that the code is meant to touch only the first and last
item. (2) could be just removing all empty strings from a list.

Another way to look at it: it is much easier to refactor from (1) to (2)
than from (2) to (1).

As to assertions etc: you don't perform arbitrary tests like assert 42 ==
42, you make estimates about what can go wrong if you get unexpected input
or make an error in an algorithm that you cannot safely grasp in its
entirety. As an example you could add

assert "" not in parts[1:-1]

as a precondition to the above snippets. The superfluous tests in the list
comprehension are the opposite of that assertion: they could eat items == ""
in the list that either are intended to pass through or that indicate an
error in code above.

Personally, though, I prefer unit tests over assertions.

Peter
 
P

Patrick Maupin

Personally, though, I prefer unit tests over assertions.

IMO, the primary use cases for assertions and unit tests are not the
same.

When I have a well-defined, clearly understood specification that I am
coding to and fully implementing with the aim of producing a fully
general purpose solution that I are offering to others, unit tests are
appropriate to make sure that the code matches the spec, and even to
provide usage examples.

When I am using Python in what I would call a more personal way, for
example in an exploratory fashion to extract value from some large
mass of data, or even to do something like parsing on a subset of
another language, where I have no intention, desire, or time to
implement the entire spec, assertions are perfectly appropriate to
make sure that my code complains at a convenient point about input
data that it would not handle properly.

It is often the case that a more personal project will grow up and
become more general purpose and widely used. At this point, it is
appropriate to remove assertions, add unit tests, etc., but to
complain about not having unit tests in the first place is to avoid
acknowledging that there may be a higher cost associated with unit
tests than with assertions, especially with immature code that may be
severely refactored several times before settling on the best
solution.

Regards,
Pat
 
P

Patrick Maupin

In any case, the *right* test would be:

a = 1
assert a == 1 and a*5==5 and str(a)=='1' and [None,a,None][a] is a

You're right. I was very tired when I wrote that, and forgot the last
3 assertions...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,780
Messages
2,569,611
Members
45,280
Latest member
BGBBrock56

Latest Threads

Top