Need a specific sort of string modification. Can someone help?

S

Sia

I have strings such as:

tA.-2AG.-2AG,-2ag
or
..+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
....

How can I do that?
Thanks.
 
F

Frank Millman

I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
...

How can I do that?
Thanks.

Here is a naive solution (I am sure there are more elegant ones) -

def strip(old_string):
new_string = ''
max = len(old_string)
pos = 0
while pos < max:
char = old_string[pos]
if char in ('-', '+'):
num_pos = pos+1
num_str = ''
while old_string[num_pos].isdigit():
num_str += old_string[num_pos]
num_pos += 1
pos = num_pos + int(num_str)
else:
new_string += old_string[pos]
pos += 1
return new_string

It caters for the possibility that the number following the +/- could be
greater than 9 - I don't know if you need that.

It works with your examples, except that the second one returns '....',
which I think is correct - there are 4 dots in the original string.

HTH

Frank Millman
 
C

Chris Angelico

I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
...

How can I do that?

Interesting. Are you guaranteed that there are no other plus or minus
signs? Is the number after the sign just one digit? Assuming the
answers to both are "yes", here's a one-liner:

s=".+3ACG.+5CAACG.+3ACG.+3ACG"

result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

Split on either - or +, then trim off that many characters from each
(that's the list comp), then join them back into a string.

ChrisA
 
R

Roy Smith

Sia said:
I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

Some kind of DNA binding site?

A couple of questions. Are the numbers always single digits? How much
data is there? Are we talking a few hundred 20-character strings, or
all of Genbank?
The plus and minus signs are always followed by a number (say, i). I want
python to find each single plus or minus, remove the sign, the number after
it and remove i characters after that. So the two strings above become:

tA..,
and
...

If I follow your description properly, the last output should be "...."
(4 dots), right? This looks like it should work. I'm sure there's more
efficient ways to do it, but for small inputs, this should be ok.˜

The general pattern here is a state machine. It's a good pattern to
learn if you're going to be doing any kind of sequence analysis. See,
for example, http://en.wikipedia.org/wiki/State_machine.

# Build up the new string as a list (for efficiency)
new = []

# Keep track of what state we're in. The three possible states
# are 1) scanning for a region to be deleted, 2) looking for the
# number, and 3) reading past the letters to be deleted.
SCANNING = 1
NUMBER = 2
DROPPING = 3
state = SCANNING

# If we are in state DROPPING, dropcount is the number of
# letters remaining to be dropped.
dropcount = 0

old = '.+3ACG.+5CAACG.+3ACG.+3ACG'
for c in old:
if state == SCANNING:
if c in '+-':
state = NUMBER
else:
new.append(c)

elif state == NUMBER:
# Assume the counts are all single digits. If not, then
# we'll need a 4th state for accumulating the digits.
dropcount = int(c)
state = DROPPING

else:
assert state == DROPPING
dropcount -= 1
if dropcount == 0:
state = SCANNING

print ''.join(new)
 
R

Roy Smith

Chris Angelico said:
result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

That's exceedingly clever. But bordering on line noise. At the very
least, I would break it up into a couple of lines to make it easier to
understand (plus you can print out the intermediate values to see what's
going on):

chunks = ("0"+s).replace("-","+").split("+")
result = "".join([x[int(x[0])+1:] for x in chunks]
 
C

Chris Angelico

Chris Angelico said:
result = "".join([x[int(x[0])+1:] for x in ("0"+s).replace("-","+").split("+")])

That's exceedingly clever. But bordering on line noise. At the very
least, I would break it up into a couple of lines to make it easier to
understand (plus you can print out the intermediate values to see what's
going on):

chunks = ("0"+s).replace("-","+").split("+")
result = "".join([x[int(x[0])+1:] for x in chunks]

Sure. You can always split a one-liner to taste, doesn't much matter
where. You could split it majorly into half a dozen lines if you want,
doesn't make a lot of diff. It'll work the same way :)

ChrisA
 
R

Roy Smith

Chris Angelico said:
Chris Angelico said:
result = "".join([x[int(x[0])+1:] for x in
("0"+s).replace("-","+").split("+")])

That's exceedingly clever. But bordering on line noise. At the very
least, I would break it up into a couple of lines to make it easier to
understand (plus you can print out the intermediate values to see what's
going on):

chunks = ("0"+s).replace("-","+").split("+")
result = "".join([x[int(x[0])+1:] for x in chunks]

Sure. You can always split a one-liner to taste, doesn't much matter
where.

Well, there are better and worse places to split things. Consider these
three statements:

result = f1().f2().f3().f4().f5()
result = f1(f2.f3(f4().f5()))
result = f1(f2(3(f4(f5()))))

Same number of function calls, but the middle one is clearly the most
difficult to understand. The reason is because the scan direction keeps
changing. The first one is strictly left-to-right. The last one is
strictly inside-to-outside. The middle one is all over the place.

That's why I chose to split this where I did. It was where the scan
direction changed.

Readability matters.
 
C

Chris Angelico

That's why I chose to split this where I did. It was where the scan
direction changed.

Ah, good point. In any case, this is a fairly simple and clear way of
doing things; it may or may not run faster than the explicit state
machine, but IMHO it's a lot clearer to read a split() than something
that changes state when a particular character is found.

ChrisA
 
R

Roy Smith

Chris Angelico said:
it may or may not run faster than the explicit state machine,

Hmmm, hard to say. Both are O(n), but yours makes several more copies
of the data than mine (the string addition, the replace(), the split(),
the string slicing). We both make copies as we put fragments into a
list, and when we do the join().

On the other hand, yours does all that inside the library, which is
generally faster than random python code.

Let's see:

My way:
real 0m2.097s
user 0m2.078s
sys 0m0.016s

Your way:
real 0m0.649s
user 0m0.622s
sys 0m0.025s

You got me by a factor of 3 or 4. Not bad. And not terribly
surprising. Once you've got things to the same O() group, pushing as
much data manipulation as you can down into the library is usually a
pretty effective optimization technique.
but IMHO it's a lot clearer to read a split() than something
that changes state when a particular character is found.

Maybe. But being familiar with state machines is still a handy skill.
DNA sequence analysis has lots of problems like "find a start codon
which is within about 50 bases of a binding site, and then copy
everything up until you find a stop codon". Things like that often map
well to state machines. Especially if you're trying to do it in
parallel in all three reading frames.
 
C

Chris Angelico

You got me by a factor of 3 or 4. Not bad.

You miss my point, though. I went for simple Pythonic code, and never
measured its performance, on the expectation that it's "good enough".
Written in C, the state machine is probably WAY faster than splitting
and then iterating. My C++ MUD client uses code similar to that to
parse TELNET and ANSI codes from a stream of bytes in a socket (and
one of its "states" is that there's no more data available, so wait on
the socket); the rewrite in a high level language divides the string
on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
then afterward splits on "\n" to divide into lines. The code's much
less convoluted, it's easier to test different parts (because I can
simply call the ANSI parser with a block of text), and on a modern
computer, you can't see the performance difference (since you spend
most of your time waiting for socket data anyway).

But it's gratifying to know that the obvious and brief way to do
things is fast too :)
Maybe. But being familiar with state machines is still a handy skill.
DNA sequence analysis has lots of problems like "find a start codon
which is within about 50 bases of a binding site, and then copy
everything up until you find a stop codon". Things like that often map
well to state machines. Especially if you're trying to do it in
parallel in all three reading frames.

Sure. And if you're working with petabytes of data, these
considerations become fairly important. When that happens, you start
rewriting your algorithms in C, or using Cython, or something; at very
least, you start rewriting clear and simple algorithms into more
complex ones. But all this happens *after* the code has been tested
and proven. All the rewrites can be verified as being identical to
their reference implementations; you can test one piece at a time as
you change them. It's ever so much easier to work that way.

<anecdote>At work, we had one employee whose code was, shall we say,
less than stellar. At one point, he embarked on a months-long rewrite
of one of his modules; meanwhile, I was unable to adequately test code
that called on it. Once the rewrite was finally complete, I discovered
myriad bugs in my own code, ones that would have been found and fixed
instantly if I'd had even a slow version of the code to work against.
Starting with something you can easily debug helps enormously with
that, because debugging doesn't demand mega-TPS throughput.</anecdote>

ChrisA
 
T

Tim Chase

I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I want python to find each single plus or minus, remove the sign, the number after it and remove i characters after that. So the two strings above become:

tA..,
and
...

With the same caveat as Frank posted about the second one being
"...." (4 dots), I don't know how this version times out:

import re
r = re.compile(r"[-+](\d+)([^-+]*)")
def modify(m):
result = m.group(2)[int(m.group(1)):]
return result
for test, expected in (
("tA.-2AG.-2AG,-2ag", "tA..,"),
(".+3ACG.+5CAACG.+3ACG.+3ACG", "...."),
):
s = r.sub(modify, test)
print "%r -> %r (%r)" % (
test, s, expected
)
assert s == expected, "Nope"

(it passes the tests as modified to "....")

-tkc
 
T

Tim Chase

I don't know how this version times out:

import re
r = re.compile(r"[-+](\d+)([^-+]*)")
def modify(m):
result = m.group(2)[int(m.group(1)):]
return result

Doh, I intended to change this after testing, making it just

returm m.group(2)[int(m.group(1)):]

rather than expending an intermediate variable I used for testing

-tkc
 
I

Ian Kelly

You miss my point, though. I went for simple Pythonic code, and never
measured its performance, on the expectation that it's "good enough".
Written in C, the state machine is probably WAY faster than splitting
and then iterating. My C++ MUD client uses code similar to that to
parse TELNET and ANSI codes from a stream of bytes in a socket (and
one of its "states" is that there's no more data available, so wait on
the socket); the rewrite in a high level language divides the string
on "\xFF" for TELNET and "\x1B" for ANSI, working them separately, and
then afterward splits on "\n" to divide into lines. The code's much
less convoluted, it's easier to test different parts (because I can
simply call the ANSI parser with a block of text), and on a modern
computer, you can't see the performance difference (since you spend
most of your time waiting for socket data anyway).

Anecdotally and somewhat off-topic, when I wrote my own MUD client in
Python, I implemented both TELNET and ANSI parsing in Python using a
state machine processing one byte at a time (actually two state
machines - one at the protocol layer and one at the client layer; the
telnet module is a modified version of the twisted.conch.telnet
module). I was worried that the processing would be too slow in
Python. When I got it running, it turned out that there was a
noticeable lag between input being received and displayed. But when I
profiled the issue it actually turned out that the rich text control I
was using, which was written in C++, wasn't able to handle a large
buffer gracefully. The parsing code in Python did just fine and
didn't contribute to the lag issue at all.
 
C

Chris Angelico

Anecdotally and somewhat off-topic, when I wrote my own MUD client in
Python, I implemented both TELNET and ANSI parsing in Python using a
state machine processing one byte at a time (actually two state
machines - one at the protocol layer and one at the client layer; the
telnet module is a modified version of the twisted.conch.telnet
module). I was worried that the processing would be too slow in
Python. When I got it running, it turned out that there was a
noticeable lag between input being received and displayed. But when I
profiled the issue it actually turned out that the rich text control I
was using, which was written in C++, wasn't able to handle a large
buffer gracefully. The parsing code in Python did just fine and
didn't contribute to the lag issue at all.

The opposite decision, the same result. Performance was *good enough*.
With Gypsum, my early performance problems were almost completely
solved by properly handling the expose-event and only painting the
parts that changed (I don't use a rich text control, I use a
GTK2.DrawingArea). Text processing of something that's come over a
network is seldom a problem; if it were, we'd see noticeably slower
downloads over HTTPS than HTTP, and that just doesn't happen. I think
(though I can't prove) that crypto written in C is probably more
expensive than ANSI parsing written in Python.

ChrisA
 
R

Roy Smith

Chris Angelico said:
The opposite decision, the same result. Performance was *good enough*.
With Gypsum, my early performance problems were almost completely
solved by properly handling the expose-event and only painting the
parts that changed (I don't use a rich text control, I use a
GTK2.DrawingArea). Text processing of something that's come over a
network is seldom a problem; if it were, we'd see noticeably slower
downloads over HTTPS than HTTP, and that just doesn't happen. I think
(though I can't prove) that crypto written in C is probably more
expensive than ANSI parsing written in Python.

ChrisA

It's rare to find applications these days that are truly CPU bound.
Once you've used some reasonable algorithm, i.e. not done anything in
O(n^2) that could have been done in O(n) or O(n log n), you will more
often run up against I/O speed, database speed, network latency, memory
exhaustion, or some such as the reason your code is too slow.

The upshot of this is that for most things, Python (even though it runs
an order of magnitude slower than C), will always be *good enough*.

I'm sure I've mentioned this before, but the application code for
Songza.com is 100% Python. We pretty much can't even measure how much
CPU time is spent running Python code. Everything is database, network,
and (occasionally) disk throughput.
 
M

Mitya Sirenef

I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I
want python to find each single plus or minus, remove the sign, the
number after it and remove i characters after that. So the two strings
above become:
tA..,
and
...

How can I do that?
Thanks.


I think it's a bit cleaner and nicer to do something similar to
itertools.takewhile but takewhile 'eats' a single next value.
I was actually doing some stuff that also needed this. I wonder if
there's a more elegant, robust way to do this?

Here's what I got for now:


class BIterator(object):
"""Iterator with 'buffered' takewhile."""

def __init__(self, seq):
self.seq = iter(seq)
self.buffer = []
self.end_marker = object()
self.last = None

def consume(self, n):
for _ in range(n): self.next()

def next(self):
val = self.buffer.pop() if self.buffer else next(self.seq,
self.end_marker)
self.last = val
return val

def takewhile(self, test):
lst = []
while True:
val = self.next()
if val is self.end_marker:
return lst
elif test(val):
lst.append(val)
else:
self.buffer.append(val)
return lst

def joined_takewhile(self, test):
return ''.join(self.takewhile(test))

def done(self):
return bool(self.last is self.end_marker)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
not_plusminus = lambda x: x not in "+-"
isdigit = lambda x: x.isdigit()

def process(s):
lst = []
s = BIterator(s)

while True:
lst.extend(s.takewhile(not_plusminus))
if s.done(): break
s.next()
n = int(s.joined_takewhile(isdigit))
s.consume(n)

return ''.join(lst)


print(process(s))


Obviously it assumes the input is well-formed, but the logic would be
very easy to change to, for example, check for s.done() after each step.

- mitya
 
R

Roy Smith

Roy Smith said:
It's rare to find applications these days that are truly CPU bound.
Once you've used some reasonable algorithm, i.e. not done anything in
O(n^2) that could have been done in O(n) or O(n log n), you will more
often run up against I/O speed, database speed, network latency, memory
exhaustion, or some such as the reason your code is too slow.

Well, I just found a counter-example :)

I've been doing some log analysis. It's been taking a grovelingly long
time, so I decided to fire up the profiler and see what's taking so
long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
might be (looking up IP addresses in the geolocation database, or
producing some pretty pictures using matplotlib). It was just a matter
of figuring out which it was.

As with most attempts to out-guess the profiler, I was totally,
absolutely, and embarrassingly wrong.

It turns out we were spending most of the time parsing timestamps!
Since there's no convenient way (I don't consider strptime() to be
convenient) to parse isoformat strings in the standard library, our
habit has been to use the oh-so-simple parser from the third-party
dateutil package. Well, it turns out that's slow as all get-out
(probably because it's trying to be smart about auto-recognizing
formats). For the test I ran (on a few percent of the real data), we
spent 90 seconds in parse().

OK, so I dragged out the strptime() docs and built the stupid format
string (%Y-%m-%dT%H:%M:%S+00:00). That got us down to 25 seconds in
strptime().

But, I could also see it was spending a significant amount in routines
that looked like they were computing things like day of the week that we
didn't need. For what I was doing, we only really needed the hour and
minute. So I tried:

t_hour = int(date[11:13])
t_minute = int(date[14:16])

that got us down to 12 seconds overall (including the geolocation and
pretty pictures).

I think it turns out we never do anything with the hour and minute other
than print them back out, so just

t_hour_minute = date[11:16]

would probably be good enough, but I think I'm going to stop where I am
and declare victory :)
 
M

Mitya Sirenef

I have strings such as:

tA.-2AG.-2AG,-2ag
or
.+3ACG.+5CAACG.+3ACG.+3ACG

The plus and minus signs are always followed by a number (say, i). I
want python to find each single plus or minus, remove the sign, the
number after it and remove i characters after that. So the two strings
above become:
tA..,
and
...

How can I do that?
Thanks.


I think it's a bit cleaner and nicer to do something similar to
itertools.takewhile but takewhile 'eats' a single next value.
I was actually doing some stuff that also needed this. I wonder if
there's a more elegant, robust way to do this?

Here's what I got for now:


class BIterator(object):
"""Iterator with 'buffered' takewhile."""

def __init__(self, seq):
self.seq = iter(seq)
self.buffer = []
self.end_marker = object()
self.last = None

def consume(self, n):
for _ in range(n): self.next()

def next(self):
val = self.buffer.pop() if self.buffer else next(self.seq,
self.end_marker)
self.last = val
return val

def takewhile(self, test):
lst = []
while True:
val = self.next()
if val is self.end_marker:
return lst
elif test(val):
lst.append(val)
else:
self.buffer.append(val)
return lst

def joined_takewhile(self, test):
return ''.join(self.takewhile(test))

def done(self):
return bool(self.last is self.end_marker)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
not_plusminus = lambda x: x not in "+-"
isdigit = lambda x: x.isdigit()

def process(s):
lst = []
s = BIterator(s)

while True:
lst.extend(s.takewhile(not_plusminus))
if s.done(): break
s.next()
n = int(s.joined_takewhile(isdigit))
s.consume(n)

return ''.join(lst)


print(process(s))


Obviously it assumes the input is well-formed, but the logic would be
very easy to change to, for example, check for s.done() after each step.

- mitya

I've added some refinements:



class BIterator(object):
"""Iterator with 'buffered' takewhile and takeuntil."""

def __init__(self, seq):
self.seq = iter(seq)
self.buffer = []
self.end_marker = object()
self.last = None

def __bool__(self):
return self.last is not self.end_marker

def __next__(self):
val = self.buffer.pop() if self.buffer else next(self.seq,
self.end_marker)
self.last = val
return val

def consume(self, n):
for _ in range(n): next(self)

def takewhile(self, test):
lst = []
while True:
val = next(self)
if val is self.end_marker:
return lst
elif test(val):
lst.append(val)
else:
self.buffer.append(val)
return lst

def takeuntil(self, test):
negtest = lambda x: not test(x)
return self.takewhile(negtest)

def joined_takewhile(self, test):
return ''.join(self.takewhile(test))

def joined_takeuntil(self, test):
return ''.join(self.takeuntil(test))


def process(s):
s = BIterator(s)
lst = []
plusminus = lambda x: x in "+-"
isdigit = lambda x: x.isdigit()

while s:
lst.extend(s.takeuntil(plusminus))
next(s)
n = s.joined_takewhile(isdigit) or 0
s.consume(int(n))

return ''.join(lst)


s = ".+3ACG.+5CAACG.+3ACG.+3ACG"
print(process(s))
 
S

Steven D'Aprano

I've been doing some log analysis. It's been taking a grovelingly long
time, so I decided to fire up the profiler and see what's taking so
long. I had a pretty good idea of where the ONLY TWO POSSIBLE hotspots
might be (looking up IP addresses in the geolocation database, or
producing some pretty pictures using matplotlib). It was just a matter
of figuring out which it was.

As with most attempts to out-guess the profiler, I was totally,
absolutely, and embarrassingly wrong.

+1 QOTW

Thank you for sharing this with us.
 
N

Nick Mellor

Hi Sia,

Here's another variation. It's within my tolerance for readability :) and also quick, if that's an issue. It does 100,000 of your longer string in a couple of seconds on my venerable laptop.

It handles only single-digit numbers. For multi-digit, I'd be inclined to have a look at takewhile and/or dropwhile, both in itertools (Python 2.7 and 3.x only.)


from string import maketrans

class redux:

def __init__(self):
intab = '+-'
outtab = ' ' # two spaces
self.trantab = maketrans(intab, outtab)


def reduce_plusminus(self, s):
list_form = [r[int(r[0]) + 1:] if r[0].isdigit() else r
for r
in s.translate(self.trantab).split()]
return ''.join(list_form)

if __name__ == "__main__":
p = redux()
print p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
print p.reduce_plusminus("tA.-2AG.-2AG,-2ag")

for n in range(100000): p.reduce_plusminus(".+3ACG.+5CAACG.+3ACG.+3ACG")
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top